#  Hacking Hacker News - Post / Comments Frequency Analysis

*If you've ever wanted to get your post on Hacker News to the front page, you've probably wondered what type of posts attract the most interaction, or what time of the day is user interaction at its peak? These questions are exactly what this analysis is looking to answer.*

This dataset contains information from 300,000 posts over a period of 12 months from September 2015 to 2016. 

After analyzing post's user interaction by the hour, I've found that an 'Ask' post submitted at 2:00pm Central Standard Time will most likely recieve more comments/user interaction than any other type or time of post.

## Exploring the Data

In [68]:
from csv import reader
import datetime as dt   ## needed for time frequency analysis later
opened_file = open("HN_posts_year_to_Sep_26_2016.csv", encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

print(headers)
print('\n')
print(hn[:3])
print(len(hn))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']]
293119


The most interesting variables in this dataset for this analysis are 'num_points', 'num_comments', and 'created_at'. 

Before we can get to using these variables, let's separate posts by their respective categroy via parsing their titles.

In [19]:
ask_posts = []
show_posts = []
other_posts = []

for i in hn:
    title = i[1]
    title = title.lower()
    if title.startswith('ask hn'): ## ask posts title
        ask_posts.append(i)
    elif title.startswith('show hn'): ## show posts title
        show_posts.append(i)
    else:
        other_posts.append(i)

print("Number of Ask Posts: ",len(ask_posts))
print("Number of Show Posts: ",len(show_posts))
print("Number of Other Posts: ",len(other_posts))

Number of Ask Posts:  9139
Number of Show Posts:  10158
Number of Other Posts:  273822


For the purpose of this analysis, we'll be looking only at Ask/Show posts, since they are the most user-centric. The other posts consist mostly of news articles and job postings rather than user generated content.

## Frequency Analysis of Comments Per Category

In [18]:
total_ask_comments = 0 ## calculate total ask comments
for i in ask_posts:
    num_comments = int(i[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of Ask Post comments: ", avg_ask_comments)

total_show_comments = 0 ## calculate total show comments
for i in show_posts:
    num_comments = int(i[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
print("Average number of Show Post comments: ", avg_show_comments)

Average number of Ask Post comments:  10.393478498741656
Average number of Show Post comments:  4.886099625910612


Ask posts have more than twice the amount of comments as show posts on average. Since the purpose of ask posts is to recieve feedback, it makes sense for these posts to have more comments. 

This is a staggering difference in interaction, so we'll focus our remaining analysis just on ask posts. 

## Frequency Analysis of 'Ask' Posts by Hour

In order to separate posts by the hour they were posted, we must convert the created_at data into a datetime object and add perform a frequency analysis.

In [65]:
result_list = []
for i in ask_posts:
    created_at = i[6] ## time is still a string at this point
    num_comments = int(i[4])
    info = [created_at, num_comments]
    result_list.append(info)
    
counts_by_hour = {} ## number of posts created during each hour
comments_by_hour = {}
for i in result_list:
    time = i[0]
    time_object = dt.datetime.strptime(time, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(time_object, "%H")
    comments = i[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

# With the times counted, now lets print out the information in a clean format

avg_by_hour = []
for i in counts_by_hour:
    avg_by_hour.append([i, comments_by_hour[i]/counts_by_hour[i]])
    
swap_avg_by_hour = []
for i in avg_by_hour:
    swap_avg_by_hour.append([i[1], i[0]])
##print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments")
print('\n')
new_output = "{0}: {1:.2f} average comments per post"
for entry in sorted_swap[:5]:
    hour = dt.datetime.strptime(entry[1],'%H')
    hour = dt.datetime.strftime(hour, '%H:%M')
    comments = float(entry[0])
    print(new_output.format(hour, comments))

Top 5 Hours for Ask Posts Comments


15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


It seems that 2:00pm Central is the most effective hour of the day to make an Ask post (this dataset uses United States Eastern Time)

Behind 2:00pm is 12:00pm, which is reports almost half as much interaction as 2:00pm.

## Conclusion

On Hacker News, the most popular user-centric posts are 'Ask' posts. The 'peak' time of day to make an ask post is 2:00pm US Central Time.