# **Hacker News Posts - Data Cleaning**



In this project, we're diving into a dataset of submissions made to the renowned tech website, Hacker News.

Hacker News, founded by the startup incubator Y Combinator, operates much like Reddit. Users submit stories, called "posts," which can garner votes and comments. The site enjoys significant popularity in tech and startup spheres. Stories that climb to the top of the Hacker News rankings can draw in immense traffic, sometimes reaching hundreds of thousands of views.

This is one of the guided project from dataquest, and the data for the same can be downloaded from [here](https://dq-content.s3.amazonaws.com/356/hacker_news.csv).


Skills I learnt from this project -
1.   How to work with strings
2.   Object-oriented programming
3.   Dates and times








Below are descriptions of the columns:

1. id: the unique identifier from Hacker News for the post
2. title: the title of the post
3. url: the URL that the posts links to, if the post has a URL
4. num_points: the number of points the post acquired, calculated as 5. 5. the total number of upvotes minus the total number of downvotes
6. num_comments: the number of comments on the post
7. author: the username of the person who submitted the post
8. created_at: the date and time of the post's submission

## Importing necessary libraries

In [36]:
import pandas as pd
import numpy as np
import csv
import datetime as dt

We're specifically interested in posts with titles that begin with either **Ask HN** or **Show HN**.

Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a few examples:

> Ask HN: How to improve my personal website?

> Ask HN: Am I the only one outraged by Twitter shutting down share counts?

> Ask HN: Aby recent changes to CSS that broke mobile?


Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:


> Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform

> Show HN: Something pointless I made

> Show HN: Shanhu.io, a programming playground powered by e8vm


We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?
Let's start by importing the libraries we need and reading the dataset into a list of lists.

## Load Dataset

In [37]:
# first we specify the path to the data we downloaded
# I am using Colab hence my path will be set via teh folder I uploaded my dataset to
file_path = '/content/DCNH.csv'

# Next we Load the dataset
df = pd.read_csv(file_path)

# Display the first few rows of the dataframe
print(df.head())

         id                                              title  \
0  12578975                      Saving the Hassle of Shopping   
1  12578822    Amazons Algorithms Dont Find You the Best Deals   
2  12578694  Emergency dose of epinephrine that does not co...   
3  12578624  Phone Makers Could Cut Off Drivers. So Why Don...   
4  12578311  Americas Lost Boys: Men who choose video games...   

                                                 url  num_points  \
0  https://blog.menswr.com/2016/09/07/whats-new-w...           1   
1  https://www.technologyreview.com/s/602442/amaz...           1   
2                   http://m.imgur.com/gallery/th6Ua           2   
3  http://www.nytimes.com/2016/09/25/technology/p...           4   
4  https://www.firstthings.com/blogs/firstthought...           5   

   num_comments       author      created_at  
0             1        bdoux  9/26/2016 3:13  
1             1    yarapavan  9/26/2016 2:26  
2             1  dredmorbius  9/26/2016 1:54  
3     

In [38]:
#Read the hacker_news.csv file in as a list of lists

# Initialize an empty list to hold the data
data = []

# Open the CSV file
with open(file_path, mode='r', encoding='utf-8') as file:
    # Create a CSV reader object
    csv_reader = csv.reader(file)

    # Iterate over the rows in the CSV reader
    for row in csv_reader:
        # Append each row to the data list
        data.append(row)

# Print the first few rows to verify
for row in data[:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']
['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26']
['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']
['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']


In [39]:
#storing data in hn
hn = data
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13'], ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'], ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54'], ['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']]


## Split Data for processing

Notice that the first list in the inner lists contains the column headers, and the lists after contain the data for one row. In order to analyze our data, we need to first remove the row containing the column headers. Let's remove that first row next.

In [40]:
# Extract the first row as headers
headers = data[0]

# Remove the first row from the data
hn = data[1:]

# Display the headers
print("Headers:")
print(headers)

# Display the first five rows of the dataset to verify
print("\nFirst five rows of hn:")
for row in hn[:5]:
    print(row)

Headers:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

First five rows of hn:
['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']
['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26']
['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']
['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']
['12578311', 'Americas Lost Boys: Men who choose video games over work', 'https://www.firstthings.com/blogs/firstthoughts/2016/08/americas-lost-boys', '5', '1', 'jse

## Filtering data

Now that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either Ask HN or Show HN, we'll use the string method startswith. Given a string object, say, string1, we can check if starts with, say, dq by inspecting the output of the object string1.startswith('dq'). If string1 starts with dq, it will return True; otherwise, it will return False.

In [41]:
# We create three empty lists
ask_posts =[]
show_posts =[]
other_posts =[]

#Assign the title in each row to a variable named title
for row in hn:
  title = row[1].lower()
  if title.startswith('ask hn'):
        ask_posts.append(row)
  elif title.startswith('show hn'):
        show_posts.append(row)
  else:
        other_posts.append(row)

# Check the number of posts in each list
print("Number of ask posts:", len(ask_posts))
print("Number of show posts:", len(show_posts))
print("Number of other posts:", len(other_posts))

Number of ask posts: 6911
Number of show posts: 5059
Number of other posts: 68431


In [42]:
# Print the first five rows
for row in show_posts[:5]:
    print(row)


for row in ask_posts[:5]:
    print(row)


['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']
['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06']
['12576090', 'Show HN: Markov chain Twitter bot. Trained on comments left on Pornhub', 'https://twitter.com/botsonasty', '3', '1', 'keepingscore', '9/25/2016 16:50']
['12575471', 'Show HN: Project-Okot: Novel, CODE-FREE data-apps in mere seconds', 'https://studio.nuchwezi.com/', '3', '1', 'nfixx', '9/25/2016 14:30']
['12574556', 'Show HN: Geto, a mobile local compass', 'https://andreapaiola.name/geto/', '2', '1', 'andreapaiola', '9/25/2016 9:19']
['12576946', 'Ask HN: How hard would it be to make a cheap, hackable phone?', '', '2', '1', 'hkt', '9/25/2016 19:30']
['12573681', 'Ask HN: Where can I learn more about and contribute to the AI singularity?', ''

## Analyzing posts

Next, let's determine if ask posts or show posts receive more comments on average.

In [43]:
# total number of comments in ask posts

total_ask_comments = 0
for row in ask_posts:
  num_comments = (int(row[4]))
  total_ask_comments += num_comments

In [44]:
avg_ask_comments = total_ask_comments/len(ask_posts)
avg_ask_comments

13.744175951381855

In [45]:
# total number of comments in show posts

total_show_comments = 0
for row in show_posts:
  num_comments = int(row[4])
  total_show_comments += num_comments

In [46]:
avg_show_comments = total_show_comments/ len(show_posts)
avg_show_comments

9.810832180272781

## Findings

In [47]:
# Comparison
if avg_ask_comments > avg_show_comments:
    print("Ask posts receive more comments on average.")
elif avg_ask_comments < avg_show_comments:
    print("Show posts receive more comments on average.")
else:
    print("Both ask and show posts receive the same average number of comments.")

Ask posts receive more comments on average.


Hence we can see that ask posts receive more comments on average than show posts.

## Ask posts Analysis

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.




**To calculate the number of ask posts and comments by hour created.**
We'll use the datetime module to work with the data in the created_at column.

In [48]:
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])


In [49]:
counts_by_hour = {}
comments_by_hour = {}

# counts_by_hour: contains the number of ask posts created during each hour of the day.
# comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

for row in result_list:
    created_at = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = created_at.strftime("%H")

    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]


Next, we'll use these two dictionaries **to calculate the average number of comments for posts created during each hour of the day**.

In [50]:
avg_by_hour = []

for avg in comments_by_hour:
  avg_by_hour.append([avg, comments_by_hour[avg]/counts_by_hour[avg]])

In [51]:
avg_by_hour

[['19', 9.414285714285715],
 ['03', 10.160377358490566],
 ['17', 13.73019801980198],
 ['14', 13.153439153439153],
 ['08', 12.43157894736842],
 ['20', 11.38265306122449],
 ['09', 8.392045454545455],
 ['01', 9.367713004484305],
 ['18', 10.789823008849558],
 ['15', 39.66809421841542],
 ['06', 9.017045454545455],
 ['21', 11.056511056511056],
 ['12', 15.452554744525548],
 ['04', 12.688172043010752],
 ['00', 9.857142857142858],
 ['16', 10.76144578313253],
 ['23', 8.322463768115941],
 ['05', 11.139393939393939],
 ['10', 13.757990867579908],
 ['07', 10.095541401273886],
 ['11', 11.143426294820717],
 ['22', 11.749128919860627],
 ['13', 22.2239263803681],
 ['02', 13.198237885462555]]

We now have the results we need, but this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [52]:
swap_avg_by_hour = []

for swap in avg_by_hour:
  swap_avg_by_hour.append([swap[1], swap[0]])

In [53]:
swap_avg_by_hour

[[9.414285714285715, '19'],
 [10.160377358490566, '03'],
 [13.73019801980198, '17'],
 [13.153439153439153, '14'],
 [12.43157894736842, '08'],
 [11.38265306122449, '20'],
 [8.392045454545455, '09'],
 [9.367713004484305, '01'],
 [10.789823008849558, '18'],
 [39.66809421841542, '15'],
 [9.017045454545455, '06'],
 [11.056511056511056, '21'],
 [15.452554744525548, '12'],
 [12.688172043010752, '04'],
 [9.857142857142858, '00'],
 [10.76144578313253, '16'],
 [8.322463768115941, '23'],
 [11.139393939393939, '05'],
 [13.757990867579908, '10'],
 [10.095541401273886, '07'],
 [11.143426294820717, '11'],
 [11.749128919860627, '22'],
 [22.2239263803681, '13'],
 [13.198237885462555, '02']]

In [54]:
sorted_swap= sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[39.66809421841542, '15'], [22.2239263803681, '13'], [15.452554744525548, '12'], [13.757990867579908, '10'], [13.73019801980198, '17'], [13.198237885462555, '02'], [13.153439153439153, '14'], [12.688172043010752, '04'], [12.43157894736842, '08'], [11.749128919860627, '22'], [11.38265306122449, '20'], [11.143426294820717, '11'], [11.139393939393939, '05'], [11.056511056511056, '21'], [10.789823008849558, '18'], [10.76144578313253, '16'], [10.160377358490566, '03'], [10.095541401273886, '07'], [9.857142857142858, '00'], [9.414285714285715, '19'], [9.367713004484305, '01'], [9.017045454545455, '06'], [8.392045454545455, '09'], [8.322463768115941, '23']]


**Top 5 Hours for Ask Posts Comments**

In [55]:
print(sorted_swap[:5])


[[39.66809421841542, '15'], [22.2239263803681, '13'], [15.452554744525548, '12'], [13.757990867579908, '10'], [13.73019801980198, '17']]


In [56]:
for avg, hour in sorted_swap[:5]:
    hour_dt = dt.datetime.strptime(hour, "%H")
    hour_formatted = hour_dt.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(hour_formatted, avg))

15:00: 39.67 average comments per post
13:00: 22.22 average comments per post
12:00: 15.45 average comments per post
10:00: 13.76 average comments per post
17:00: 13.73 average comments per post


To increase the chances of receiving comments on ask posts, one should consider posting during the hours with the highest average comments per post. According to the analysis, the top five hours for ask posts comments are:

15:00 (3:00 PM) - 38.59 average comments per post

02:00 (2:00 AM) - 23.81 average comments per post

20:00 (8:00 PM) - 21.52 average comments per post

16:00 (4:00 PM) - 16.80 average comments per post

21:00 (9:00 PM) - 16.01 average comments per post

These hours are based on the Eastern Time (ET) time zone. Depending on the time zone you live in, you may need to adjust these hours accordingly.

# Thank You

Here are some next steps for you to consider:

Determine if show or ask posts receive more points on average.

Determine if posts created at a certain time are more likely to receive more points.

Compare your results to the average number of comments and points other posts receive.

**Determine if show or ask posts receive more points on average.**


In [57]:
show_num_points = []
for row in show_posts:
  show_num_points.append(int(row[3]))

In [58]:
total_show_num_points = 0
for points in show_num_points:
  total_show_num_points += points

In [59]:
total_show_num_points

134677

In [60]:
avg_show_points = total_show_num_points/len(show_posts)
avg_show_points

26.62126902549911

In [61]:
ask_num_points = []
for row in ask_posts:
  ask_num_points.append(int(row[3]))

In [62]:
total_ask_num_points = 0
for points in ask_num_points:
  total_ask_num_points += points

In [63]:
total_ask_num_points

99550

In [64]:
avg_ask_points = total_ask_num_points/len(ask_posts)
avg_ask_points

14.40457242077847

In [65]:
# Comparison
if avg_ask_points > avg_show_points:
    print("Ask posts receive more points on average.")
elif avg_ask_points < avg_show_points:
    print("Show posts receive more points on average.")
else:
    print("Both ask and show posts receive the same average number of points.")

Show posts receive more points on average.


**Determine if posts created at a certain time are more likely to receive more points.**

Show posts Analysis

In [66]:
test_list = []

for row in show_posts:
  num_points = int(row[3])
  created_at = row[6]
  test_list.append([created_at, num_points])

In [67]:
count_by_hour = {}
point_by_hour = {}

for row in test_list:
    created_at = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = created_at.strftime("%H")

    if hour not in count_by_hour:
        count_by_hour[hour] = 1
        point_by_hour[hour] = row[1]
    else:
        count_by_hour[hour] += 1
        point_by_hour[hour] += row[1]

In [68]:
avg_by_hour = []

for avg in point_by_hour:
  avg_by_hour.append([avg, point_by_hour[avg]/count_by_hour[avg]])

In [69]:
avg_by_hour

[['20', 24.8130081300813],
 ['19', 29.8],
 ['16', 26.159383033419022],
 ['14', 28.531722054380666],
 ['09', 20.841772151898734],
 ['06', 29.378947368421052],
 ['03', 18.567010309278352],
 ['18', 28.408360128617364],
 ['17', 25.527027027027028],
 ['11', 31.56578947368421],
 ['01', 19.103703703703705],
 ['12', 33.56666666666667],
 ['04', 26.488888888888887],
 ['21', 25.076555023923444],
 ['00', 27.645390070921987],
 ['13', 28.51197604790419],
 ['07', 23.1496062992126],
 ['15', 26.1734693877551],
 ['22', 23.25],
 ['10', 23.891025641025642],
 ['05', 19.07894736842105],
 ['23', 30.39864864864865],
 ['08', 25.98125],
 ['02', 22.625]]

In [70]:
# Create a list with columns swapped
swap_avg_by_hour = [[row[1], row[0]] for row in avg_by_hour]

# Sort the list by the average number of comments in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Print the results
print("Top 5 Hours for Ask Posts Comments")
for avg, hour in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(hour, avg))


Top 5 Hours for Ask Posts Comments
12: 33.57 average comments per post
11: 31.57 average comments per post
23: 30.40 average comments per post
19: 29.80 average comments per post
06: 29.38 average comments per post
