# Exploring Hacker News Post

In this study we will explore (a sample of) posts that were posted on [Hacker News](https://news.ycombinator.com/). Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Some posts can easily attract a lot of views, and comments. In this study we will explore aspects that impact the amount of comments for a post.

*Post title*: when creating posts, users can - optionally - add `Ask HN` or `Show HN` to the title of the post. They do so to explicitly 'ask' or 'show' something to the Hacker News community. We'll analyze whether posts with these tags receive more comments on average.

*Post timing*: also, we will explore whether posts published at certain times receive more comments on average.

**Data** 

The source data for this study can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). It contains  almost 300,000 rows, each row representing a post. The data is of 2016. However, for this project we make use of a version that been reduced to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. This file was prepared by Dataquest and can be downloaded from [here](https://app.dataquest.io/m/356/guided-project%3A-exploring-hacker-news-posts/1/introduction).

Let us start with reading in the data, and displaying the header row and a small sample.

In [52]:
from csv import reader
opened_file = open('inputdata/hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

for row in hn[:4]:
    print (row, '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 



Let's split of the headers in in `headers`, and keep the data itself in `hn`. (And print to check the results)

In [53]:
headers = hn[0]
print ('Number of records before removing the header: ', len(hn))
hn = hn[1:]
print ('Number of records after removing the header: ', len(hn))
print ('\n','The first three rows of the data:', '\n')
for row in hn[:3]:
    print (row, '\n')

Number of records before removing the header:  20101
Number of records after removing the header:  20100

 The first three rows of the data: 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 



Next, let us split the data into three new lists:
* `ask_posts` (the one who posted added 'ask hn' or similar)
* `show_posts` (the one who posted added 'show hn' or similar)
* `other_posts` (the remainder)


In [54]:
# Create empty lists

ask_posts = []
show_posts = []
other_posts = []

# Fill the lists

for post in hn:
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)

# Print some samples

print('Sample posts for "ask":', '\n')
for post in ask_posts[:2]:
    print (post)

print('\n', 'Sample posts for "show":', '\n')
for post in show_posts[:2]:
    print (post)
    
print('\n', 'Sample posts for other:', '\n')
for post in other_posts[:2]:
    print (post)

# Check the totals
print ('\n')
print ('Number of posts in the original list is', len(hn))
print ('Number of posts in "ask" is', len(ask_posts))
print ('Number of posts in "show" is', len(show_posts))
print ('Number of posts in "other" is', len(other_posts))
print ('Sum of the three new lists is', len(ask_posts)+len(show_posts)+len(other_posts))

Sample posts for "ask": 

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']

 Sample posts for "show": 

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']

 Sample posts for other: 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


Number of posts in the original list is 20100

Next, let's determine if "ask posts" or "show posts" receive more comments on average.

In [55]:
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
avg_ask_comments = total_ask_comments/len(ask_posts)
print ('Average number of comments for "ask" posts is {:.2f}'.format(avg_ask_comments))

total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
avg_show_comments = total_show_comments/len(show_posts)
print ('Average number of comments for "show" posts is {:.2f}'.format(avg_show_comments))

Average number of comments for "ask" posts is 14.04
Average number of comments for "show" posts is 10.32


It appears that 'ask' posts receive more comments on average than 'show' posts.

To analyze whether particular times of the day attact more comments, we will continue with these "ask" posts.

In [56]:
import datetime as dt

# Create a list that contains the creation times and number of comments (ask-posts only)

result_list = []
for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])
#print (result_list[:3])

# Build frequency tables for the number of posts and for the number of comments, per hour of the day
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created_at = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = created_at.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

# Create a table that contains the hours of day and the average number of comments per posts
avg_by_hour = []
for hour in counts_by_hour:
    num_posts = counts_by_hour[hour]
    num_comments = comments_by_hour[hour]
    average = num_comments / num_posts
    avg_by_hour.append([hour, average])

# Sort the list (on its first element, being the hour of day)
avg_by_hour.sort()    
    
# Print the result
output = "For hour {:02d} the average number of comments per post is {:.2f}"
for row in avg_by_hour:
    print (output.format(row[0], row[1]))  

For hour 00 the average number of comments per post is 8.13
For hour 01 the average number of comments per post is 11.38
For hour 02 the average number of comments per post is 23.81
For hour 03 the average number of comments per post is 7.80
For hour 04 the average number of comments per post is 7.17
For hour 05 the average number of comments per post is 10.09
For hour 06 the average number of comments per post is 9.02
For hour 07 the average number of comments per post is 7.85
For hour 08 the average number of comments per post is 10.25
For hour 09 the average number of comments per post is 5.58
For hour 10 the average number of comments per post is 13.44
For hour 11 the average number of comments per post is 11.05
For hour 12 the average number of comments per post is 9.41
For hour 13 the average number of comments per post is 14.74
For hour 14 the average number of comments per post is 13.23
For hour 15 the average number of comments per post is 38.59
For hour 16 the average number 

It appears there are significant differences indeed. Let's visualize this a bit clearer, and show  which are the hours of day where posts (on average) attract most comments.

In [57]:
# Create a list that is sorted on the average number of comments instead
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

# Created a sorted version of this list
sorted_swap = sorted (swap_avg_by_hour, reverse = True)

# Display the results
print ('Top 5 Hours for Ask Posts Comments', '\n')
output = "{}: {:.2f} average comments per post"
for row in sorted_swap[:5]:
    thetime = dt.datetime.strptime(str(row[1]), '%H')
    thetime = thetime.strftime('%H:%M')
    print ( output.format(thetime,row[0] ))

Top 5 Hours for Ask Posts Comments 

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


So those are the best times of days to post if you want to attract comments. What is interesting to see is that the top 5 hours are on very different hours during the day. One possible explanation could be that commenters are located across the globe, and that these different hours represent peak times for different time zones. (That would require further study though.)

Note that the times above are for the US Eastern Time. (As per the [dataset documentation](https://www.kaggle.com/hacker-news/hacker-news-posts).)

For our time zone (Central European Time), you'll need to add six hours to that.

## Conclusions

Refering back to the goal of this study, let's summarize the conclusions. 

*Post title*: when creating posts, adding `Ask HN` to your post title will do better for attracting comments than adding `Show HN`:
- Ask HN: 14.04 average comments per post
- Show HN: 10.32 average comments per post

(It has not been compared with posts for not adding a tag at all.)

*Post timing*: the time of day of posting appears to have significant impact on the number of comments that you will attract. Based on an analysis of the `Ask HN` posts, the top hours (in Central European Time) are:
- 21:00 - 22:00: 38.59 average comments per post
- 08:00 - 09:00: 23.81 average comments per post
- 02:00 - 03:00: 21.52 average comments per post