![title](hacker_news.jpg)

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), and the data set includes almost 300,000. Below are descriptions of the columns:

`id`: The unique identifier from Hacker News for the post<br>
`title`: The title of the post<br>
`url`: The URL that the posts links to, if it the post has a URL<br>
`num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes<br>
`num_comments`: The number of comments that were made on the post<br>
`author`: The username of the person who submitted the post<br>
`created_at`: The date and time at which the post was submitted<br>

Here are the first 5 rows of the dataset.

In [1]:
#Import the required packages
from csv import reader
import datetime as dt

In [2]:
#Read in the .csv file
opened_file = open('C:/Users/shane.mcdonald/Desktop/HN_posts_year_to_Sep_26_2016.csv', encoding='utf-8')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. 

Users submit `Ask HN` posts to ask the Hacker News community a specific question.

Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting. 

We'll compare these two types of posts to determine the following:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

### Removing headers from a list of lists

To clean this up a bit, let's remove the first list, which are the headers, and assign them to a new variable `headers`.

In [3]:
#Remove headers from original list
headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [4]:
#Re-assign the list without the headers
hn = hn[1:]
print(hn[:6])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'], ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-w

### Extracting 'Ask HN' and 'Show HN' posts

Now that we've removed the headers from `hn`, we're ready to filter our data. Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles.

In [5]:
#Create 3 empty lists
ask_posts = []
show_posts = []
other_posts = []

#Loop through each row to extract the posts
for item in hn:
    news_title = item[1].lower()
    if news_title.startswith('ask hn'):
        ask_posts.append(item)
    elif news_title.startswith('show hn'):
        show_posts.append(item)
    else:
        other_posts.append(item)

#Print the results
print('Ask posts:',len(ask_posts))
print('Show posts:', len(show_posts))
print('Other:', len(other_posts))

Ask posts: 9139
Show posts: 10158
Other: 273822


### Calculating the average number of comments for 'Ask HN' and 'Show HN' posts

Next, let's determine if ask posts or show posts receive more comments on average.

In [6]:
#Store the total in a new variable
total_ask_comments = 0

#Loop through the Ask post list
for item in ask_posts:
    comments = item[4]
    total_ask_comments += int(comments)
    
#Calculate the average
avg_ask_comments = total_ask_comments / len(ask_posts)

#Print the reults
print('The average number of ask comments is {:.2f}'.format(avg_ask_comments))

The average number of ask comments is 10.39


In [7]:
#Store the total in a new variable
total_show_comments = 0

#Loop through the Show posts
for item in show_posts:
    comments = item[4]
    total_show_comments += int(comments)

#Calculate the average
avg_show_comments = total_show_comments / len(show_posts)

#Print the results
print('The average number of show comments is {:.2f}'.format(avg_show_comments))

The average number of show comments is 4.89


The results show that there are over twice as many ask comments than there are show comments, which may mean that users are using Hacker News as a Q&A forum rather than a show & tell platform.

### Finding the amount of 'Ask' posts and comments by hour created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

In [8]:
#Create an empty list
result_list = []

#Loop through the Ask posts 'created_at' column
for item in ask_posts:
    temp_list = []
    ask_create = item[6]
    num_comments = int(item[4])
    temp_list.append(ask_create)
    temp_list.append(num_comments)
    result_list.append(temp_list)

#Create empty dictionaries
counts_by_hour = {}
comments_by_hour = {}

#Loop through the new list, extract the hour created & number of comments, then store in dictionaries
for item in result_list:
    just_date = dt.datetime.strptime(item[0], "%m/%d/%Y %H:%M")
    just_hour = just_date.strftime("%H")
    if just_hour not in counts_by_hour:
        counts_by_hour[just_hour] = 1
        comments_by_hour[just_hour] = item[1] 
    else:
        counts_by_hour[just_hour] += 1
        comments_by_hour[just_hour] += item[1]

#print the results
print("Counts by hour:", counts_by_hour)
print('\n')
print("Comments by hour:", comments_by_hour)

Counts by hour: {'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


Comments by hour: {'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


We now have two frequency tables that tells us how many ask posts were made each hour and how many comments were made to the ask posts each hour. To make this a bit cleaner, let's merge the 2 dictionaries into one to demonstrate the average number of comments made for 'Ask' posts per hour

### Calculating the average number of comments for 'Ask HN' posts by hour

In [9]:
#Create an empty list
avg_comments_list = []

#Loop over the comments dictionary, calculate the average and append to the new list
for key in comments_by_hour:
        avg_comments_list.append([key, comments_by_hour[key] / counts_by_hour[key]])      

#Show the list
avg_comments_list

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

We now need to swap the list items around in preparation for the final analysis

In [10]:
#Create a new list
swap_avg_comments_list = []

#Loop through the previous list and append the items in the correct order
for item in avg_comments_list:
    swap_avg_comments_list.append([item[1], item[0]])

#Print the results
print(swap_avg_comments_list)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


### Sorting and printing values from a list of lists

So we can easily identify the hours with the highest number of comments, lets sort the list in descending order according to the average number of comments recieved

In [11]:
#Sort the list in descending order
sorted_swap = sorted(swap_avg_comments_list, reverse=True)

#Show the sorted list
sorted_swap

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

Let's now look at the `Top 5` hours for ask post comments to draw up our conclusions

In [12]:
#Print out our title
print('Top 5 Hours for Ask Posts Comments:')

#Loop over the previous list, extract the top 5 hours by comment and convert hour back to full time format
for avg, hr in sorted_swap[:5]:
    hour = dt.datetime.strptime(hr, "%H")
    hour = hour.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(hour, avg))

Top 5 Hours for Ask Posts Comments:
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


### Conclusion

In this project we analysed nearly 300,000 rows to determine the best time to recieve a comment from a post on Hackerank. The best time to submit a question to Hackerank is between 3pm and 4pm where the average amount of comments per post increase by 168% compared to submitting between the morning hours of 10am and 11am.