# Hacker News post analysis

Objective: to determine:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

In [31]:
from csv import reader
import datetime as dt

In [12]:
opened_file = open("/Users/Taylor/OneDrive/Documents/DC/Dataquest/Hacker News/train.csv", encoding='utf8')
read_file = reader(opened_file)
data = list(read_file)
hn_header = data[0]
hn = data[1:]

In [13]:
print(hn_header)


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Fields are:
* id: The unique identifier from Hacker News for the post
* title: The title of the post
* url: The URL that the posts links to, if it the post has a URL
* num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* num_comments: The number of comments that were made on the post
* author: The username of the person who submitted the post
* created_at: The date and time at which the post was submitted

In [14]:
print(hn[:4])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [19]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Ask HN posts:", len(ask_posts))
print("Show HN posts:", len(show_posts))
print("Other HN posts:", len(other_posts))
print("Total posts:", len(hn))

Ask HN posts: 9139
Show HN posts: 10158
Other HN posts: 273822
Total posts: 293119


Let's determine if ask posts or show posts receive more comments on average.

In [23]:
#Ask posts comments
total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments

avg_ask_comments = total_ask_comments / len(ask_posts)
ask_string = "There are {comments: .1f} comments per average ask post".format(comments = avg_ask_comments)
print(ask_string)

#Show posts comments
total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

avg_show_comments = total_show_comments / len(show_posts)
show_string = "There are {comments: .1f} comments per average show post".format(comments = avg_show_comments)
print(show_string)

There are  10.4 comments per average ask post
There are  4.9 comments per average show post


It looks like ask posts receive more than double the comments as show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments.

In [28]:
result_list = []
for row in ask_posts:
    created  = row[6]
    comments = int(row[4])
    result = [created, comments]
    result_list.append(result)

In [30]:
print(len(result_list))

9139


In [41]:
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    hour = row[0]
    comments = int(row[1])
    #Format looks like this "9/26/2016 3:26"
    hour = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
    hour = hour.hour
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments 
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments 
print(comments_by_hour)
print(counts_by_hour)

{2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}
{2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209}


In [70]:
avg_comments = []

for row in comments_by_hour:
    avg_comments.append([str(row), comments_by_hour[row] / counts_by_hour[row]])
    print("At {hour}:00: {comments:.1f} comments".format(hour = row, comments = comments_by_hour[row] / counts_by_hour[row]))
#print(avg_comments)
#sorted_avg_comments = sorted(avg_comments, reverse = False)
#print(sorted_avg_comments)    

At 2:00: 11.1 comments
At 1:00: 7.4 comments
At 22:00: 8.8 comments
At 21:00: 8.7 comments
At 19:00: 7.2 comments
At 17:00: 9.4 comments
At 15:00: 28.7 comments
At 14:00: 9.7 comments
At 13:00: 16.3 comments
At 11:00: 9.0 comments
At 10:00: 10.7 comments
At 9:00: 6.7 comments
At 7:00: 7.0 comments
At 3:00: 7.9 comments
At 23:00: 6.7 comments
At 20:00: 8.7 comments
At 16:00: 7.7 comments
At 8:00: 9.2 comments
At 0:00: 7.6 comments
At 18:00: 7.9 comments
At 12:00: 12.4 comments
At 4:00: 9.7 comments
At 6:00: 6.8 comments
At 5:00: 8.8 comments


Let's see what are the top 5 hours of the day

In [76]:
swap_avg_by_hour = []
for row in avg_comments:
    swap = []
    swap.append(float(row[1]))
    swap.append(row[0])
    swap_avg_by_hour.append(swap)
#print(swap_avg_by_hour)
sorted_avg_comments = sorted(swap_avg_by_hour, reverse = True)
for row in sorted_avg_comments:
    print("At {hour}:00: {comments:.1f} comments".format(hour = row[1], comments = row[0]))

At 15:00: 28.7 comments
At 13:00: 16.3 comments
At 12:00: 12.4 comments
At 2:00: 11.1 comments
At 10:00: 10.7 comments
At 4:00: 9.7 comments
At 14:00: 9.7 comments
At 17:00: 9.4 comments
At 8:00: 9.2 comments
At 11:00: 9.0 comments
At 22:00: 8.8 comments
At 5:00: 8.8 comments
At 20:00: 8.7 comments
At 21:00: 8.7 comments
At 3:00: 7.9 comments
At 18:00: 7.9 comments
At 16:00: 7.7 comments
At 0:00: 7.6 comments
At 1:00: 7.4 comments
At 19:00: 7.2 comments
At 7:00: 7.0 comments
At 6:00: 6.8 comments
At 23:00: 6.7 comments
At 9:00: 6.7 comments


Ok, so looks like top hours of the day are 10 am - 3 pm ET with 3 pm being the highest.    

# Conclusion

If your goal is to maximize comments on a Hacker News post, you should post on the Ask Posts page between 10 am and 3 pm, with 3 pm being the best with 28.7 comments on average. This result makes sense because it is during working hours for both the West Coast and East Coast. 