# Analysis of Hacker News Posts
#### Do Certain Types of Posts Receive More Comments?
#### Does the Time of the Post Impact Popularity?

In this project, we will be working with a data set of submissions to a popular technology site [Hacker News](https://news.ycombinator.com/). 

Specifically, we will be analyzing posts with "Ask HN" or "Show HN" in the title to see if these posts receive more comments on average than other posts.  We will also be analyzing the timing of posts and how it can impact the popularity of the posts.


In [1]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
#Re-running this cell will screw up dataset, restart and run
#if you need to re-run it.
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [3]:
#Initialize empty lists to sort our data
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    #startswith function requires case match, so make everything
    #lowercase to be sure
    title = title.lower()
    #append each to correct list
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

#Check to make sure each worked correctly
print("The number of 'Ask HN' posts is " + str(len(ask_posts)))
print("The number of 'Show HN' posts is " + str(len(show_posts)))
print("the number of other posts is " + str(len(other_posts)))


The number of 'Ask HN' posts is 1744
The number of 'Show HN' posts is 1162
the number of other posts is 17194


In [4]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [5]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


In general, 'Ask HN' posts are more common on Hacker News, with a total of 1744 posts in our data set, as opposed to 1162 'Show HN' posts.  

'Ask HN' posts also appear to be more engaging than 'Show HN' posts, with the average 'Ask HN' post gathering just over 14 comments per post, compared to 10.3 comments per post for 'Show HN' posts.

Because of this finding that 'Ask HN' posts receive more comments on average, we will focus our remaining analysis just on these posts.

In this next section, we will determine if 'Ask HN' posts created at a certain time are more likely to engage users using the following steps:
1. Calculate the amount of 'Ask HN' posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments 'Ask HN' posts receive by hour created.

In [6]:
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    new_row = [created_at, num_comments]
    result_list.append(new_row)
print(result_list[:5])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


In [7]:
#Same thing as earlier, restart and run all if needs redone
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    post_time = row[0]
    num_comments = row[1]
    #Convert to datetime object
    post_time = dt.datetime.strptime(post_time, "%m/%d/%Y %H:%M")
    #Select just the hour from datetime object
    post_time = post_time.strftime("%H")
    if post_time not in counts_by_hour:
        #Creating dictionaries for number of posts and number of comments
        counts_by_hour[post_time] = 1
        comments_by_hour[post_time] = num_comments
    else: 
        counts_by_hour[post_time] += 1
        comments_by_hour[post_time] += num_comments

print(counts_by_hour)
print('\n')
print(comments_by_hour)

{'06': 44, '19': 110, '12': 73, '07': 34, '18': 109, '04': 47, '09': 45, '10': 59, '23': 68, '22': 71, '14': 107, '02': 58, '13': 85, '16': 108, '00': 55, '17': 100, '03': 54, '11': 58, '05': 46, '15': 116, '01': 60, '21': 109, '20': 80, '08': 48}


{'06': 397, '19': 1188, '12': 687, '07': 267, '18': 1439, '04': 337, '09': 251, '10': 793, '23': 543, '22': 479, '14': 1416, '02': 1381, '13': 1253, '16': 1814, '00': 447, '17': 1146, '03': 421, '11': 641, '05': 464, '15': 4477, '01': 683, '21': 1745, '20': 1722, '08': 492}


In [8]:
#Creating a new list of lists to calculate the average number
#of posts per hour
avg_by_hour = []
for post in comments_by_hour:
    avg_by_hour.append([post, comments_by_hour[post]/counts_by_hour[post]])
    
print(avg_by_hour)

[['06', 9.022727272727273], ['19', 10.8], ['12', 9.41095890410959], ['07', 7.852941176470588], ['18', 13.20183486238532], ['04', 7.170212765957447], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['23', 7.985294117647059], ['22', 6.746478873239437], ['14', 13.233644859813085], ['02', 23.810344827586206], ['13', 14.741176470588234], ['16', 16.796296296296298], ['00', 8.127272727272727], ['17', 11.46], ['03', 7.796296296296297], ['11', 11.051724137931034], ['05', 10.08695652173913], ['15', 38.5948275862069], ['01', 11.383333333333333], ['21', 16.009174311926607], ['20', 21.525], ['08', 10.25]]


In [9]:
swap_avg_by_hour = []
for row in avg_by_hour:
    first = row[0]
    second = row[1]
    swap_avg_by_hour.append([second, first])
print(swap_avg_by_hour)
print('\n')

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[9.022727272727273, '06'], [10.8, '19'], [9.41095890410959, '12'], [7.852941176470588, '07'], [13.20183486238532, '18'], [7.170212765957447, '04'], [5.5777777777777775, '09'], [13.440677966101696, '10'], [7.985294117647059, '23'], [6.746478873239437, '22'], [13.233644859813085, '14'], [23.810344827586206, '02'], [14.741176470588234, '13'], [16.796296296296298, '16'], [8.127272727272727, '00'], [11.46, '17'], [7.796296296296297, '03'], [11.051724137931034, '11'], [10.08695652173913, '05'], [38.5948275862069, '15'], [11.383333333333333, '01'], [16.009174311926607, '21'], [21.525, '20'], [10.25, '08']]


[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'],

In [10]:
print("Top 5 Hours for 'Ask HN' Posts Comments")
for avg, hr in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(
    dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg))
    

Top 5 Hours for 'Ask HN' Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


There appears to be quite a variety of times to create 'Ask HN' posts that generate interaction.  The most popular time by a fairly large margin is at 3:00 PM, with the second most popular time being almost 12 hours later at 2:00 AM. 

In general, it appears that you have a higher chance of receiving comments if you create a post in the afternoon, evening, or late at night.  Posts made in the morning or early afternoon do not appear to generate as much interaction.  

# In Summary

In this project, we set out to analyze 'Ask HN' and 'Show HN' posts on the Hacker News site because these posts tend to generate a good deal of interaction. 

Our first step was to determine which type of post generated more interaction on average, which we determined were 'Ask HN' posts (~14 comments per post vs 10.3 on 'Show HN').  We accomplished this by sorted all of the posts in our data set into lists based on if the post title began with 'Ask HN', 'Show HN' or other.  We then calculated the total number of posts and comments in each category to create an average number of comments and compared.

The next step was to determine which time of day was best to post if your goal is to generate user interaction.  For this part, we focused solely on 'Ask HN' posts.  We first needed to extract the date/time and number of comments for each post.  We then used this data to create two dictionaries: one counting the number of posts per hour, and the other counting the number of comments per hour.  We used these dictionaries to calculating the average number of comments per hour and then analyzed the data, finding that the most interactive posts were posted from 3:00 - 3:59 PM.