# Exploring Hacker News Posts

In this project, I'll work with a data set of submissions to popular technology site Hacker News. 

I investigate the following questions to make a recommendation to users on how to maximize engagement:
1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

Hacker news is a site started by the startup incubator, YCombinator, where users sumbit, vote, and comment on posts.
It is popular in technology and startup circles and top posts receive thousands of visitors. 

Users submit Ask HN or Show HN posts to ask the Hacker News community a specific question or show a project, product, or something interesting. 

Some examples from the data set include:

**Ask HN:**

How to improve my personal website?

Am I the only one outraged by Twitter shutting down share counts?

Any recent changes to CSS that broke mobile?

**Show HN:**

Learn Japanese Vocab via multiple choice questions

Turning a Trello list into a shared helpdesk

Something pointless I made

#### Descriptions of the columns:

**id:** The unique identifier from Hacker News for the post
title: The title of the post

**url:** The URL that the posts links to, if it the post has a URL
**num_points:** The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

**num_comments:** The number of comments that were made on the post

**author:** The username of the person who submitted the post

**created_at:** The date and time at which the post was submitted



### Data Exploration

I import the dataset and preview the first 5 columns.

In [1]:
from csv import reader

open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)
hn = hn[1:]
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


I divide the dataset into three lists: ask_posts, show_posts, and other_posts.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower() 
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts), len(show_posts), len(other_posts))

1744 1162 17194


### Data Analysis

I find which type of post receives more comments.

In [3]:
total_ask_comments = 0

for post in ask_posts:
    num_comment = post[4]
    num_comment = int(num_comment)
    total_ask_comments += num_comment


avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [4]:
total_show_comments = 0

for post in show_posts:
    num_comment = post[4]
    num_comment = int(num_comment)
    total_show_comments += num_comment

avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

10.31669535283993


In [5]:
avg_ask_comments-avg_show_comments

3.7217220783527303

Ask posts receive around 4 more comments on average than show posts. Since the goal is to maximize engagement, I will stick with ask posts for further analysis because it usually yields more engagement than show posts.

I calculate the average number of posts per hour and the average number of comments per hour.

In [30]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = [post[6], int(post[4])]
    result_list.append(created_at)
    list(result_list)
    
#print(result_list)

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_str = row[0]
    #print(date_str)
    new_date_str = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M") #creates datetime object
    hour = dt.datetime.strftime(new_date_str, "%H")
    #print(new_date_str)    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
print(counts_by_hour, '\n')
print(comments_by_hour)


{'12': 73, '11': 58, '13': 85, '03': 54, '05': 46, '06': 44, '10': 59, '08': 48, '23': 68, '17': 100, '16': 108, '01': 60, '20': 80, '15': 116, '04': 47, '19': 110, '22': 71, '09': 45, '14': 107, '18': 109, '02': 58, '07': 34, '21': 109, '00': 55} 

{'12': 687, '11': 641, '13': 1253, '03': 421, '05': 464, '06': 397, '10': 793, '08': 492, '23': 543, '17': 1146, '16': 1814, '01': 683, '20': 1722, '15': 4477, '04': 337, '19': 1188, '22': 479, '09': 251, '14': 1416, '18': 1439, '02': 1381, '07': 267, '21': 1745, '00': 447}


I calculate the average number of comments per post for posts created during each hour of the day (i.e. the number of comments per post per hour).

In [39]:
avg_by_hour = []

for value in comments_by_hour:
    avg_by_hour.append([value, comments_by_hour[value]/counts_by_hour[value]])

#list of lists- the first element is the hour
#the second element is the average number of comments per post
avg_by_hour  

[['12', 9.41095890410959],
 ['11', 11.051724137931034],
 ['13', 14.741176470588234],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['06', 9.022727272727273],
 ['10', 13.440677966101696],
 ['08', 10.25],
 ['23', 7.985294117647059],
 ['17', 11.46],
 ['16', 16.796296296296298],
 ['01', 11.383333333333333],
 ['20', 21.525],
 ['15', 38.5948275862069],
 ['04', 7.170212765957447],
 ['19', 10.8],
 ['22', 6.746478873239437],
 ['09', 5.5777777777777775],
 ['14', 13.233644859813085],
 ['18', 13.20183486238532],
 ['02', 23.810344827586206],
 ['07', 7.852941176470588],
 ['21', 16.009174311926607],
 ['00', 8.127272727272727]]

The results are clear, however this format makes it difficult to identify the hours with the highest values. 
I finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

I change the presentation so that the average number of comments appears before the hour. Then, I sort the average number of comments in descending order.

In [43]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])   #puts average number of comments first per hour first
                            
print(swap_avg_by_hour)

[[9.41095890410959, '12'], [11.051724137931034, '11'], [14.741176470588234, '13'], [7.796296296296297, '03'], [10.08695652173913, '05'], [9.022727272727273, '06'], [13.440677966101696, '10'], [10.25, '08'], [7.985294117647059, '23'], [11.46, '17'], [16.796296296296298, '16'], [11.383333333333333, '01'], [21.525, '20'], [38.5948275862069, '15'], [7.170212765957447, '04'], [10.8, '19'], [6.746478873239437, '22'], [5.5777777777777775, '09'], [13.233644859813085, '14'], [13.20183486238532, '18'], [23.810344827586206, '02'], [7.852941176470588, '07'], [16.009174311926607, '21'], [8.127272727272727, '00']]


In [72]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)  #sorts average number of comments in descending order

print(sorted_swap, '\n')

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']] 



Which hours should a user create a post to have a higher chance of receiving comments?

In [78]:
top_5_hours = []

for row in sorted_swap:
    time_str = row[1]
    dt_object_parse = dt.datetime.strptime(time_str, "%H")  #returns a datetime object
    dt_object_format = dt.datetime.strftime(dt_object_parse, "%H:%M")  #uses the datetime object to specify the format time
    print("{}: {:.2f} average comments per post".format(dt_object_format, row[0]))
    top_5_hours.append("{}: {:.2f} average comments per post".format(dt_object_format, row[0]))
    #print(dt_object_parse)
    #print(dt_object_format)

print('\n')
print("Top 5 Hours for Ask Posts Comments")

top_5_hours[:5]

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post


Top 5 Hours for Ask Posts Comments


['15:00: 38.59 average comments per post',
 '02:00: 23.81 average comments per post',
 '20:00: 21.52 average comments per post',
 '16:00: 16.80 average comments per post',
 '21:00: 16.01 average comments per post']

### Conclusion


I recommend that users post at 15:00 or 3pm Eastern Time in the US to maximize their chance of receiving comments. Other convenient times are 2am, 8pm, 4pm, amd 9pm.