## Hacker News Analysis

In this analysis, I evaluate approximately 20k rows of data involving the website Hacker News. The data covers posts made on the site. Each row of data includes the following variables: 

id: the unique identifier from Hacker News for the post
title: the title of the post
url: the URL that the posts links to, if the post has a URL
num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: the number of comments on the post
author: the username of the person who submitted the post
created_at: the date and time of the post's submission

The goal of the analysis is to determine which of two kinds of posts -- Ask HN and Show HN -- is more popular, and whether posts made at a certain time of day are more popular.

### Hack News Comments Analysis

In [28]:
from csv import reader
with open("hacker_news.csv") as hn:
    csv_reader = reader(hn)
    hn = list(csv_reader)
print("Number of rows in the dataset: ", len(hn))
print("Sample of the dataset: ", hn[:5])

Number of rows in the dataset:  20101
Sample of the dataset:  [['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [13]:
headers = hn[0]
hn = hn[1:]
print("Column Headers:", headers)
print("The first five rows of the data set:", hn[:5])

Column Headers: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
The first five rows of the data set: [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '

Below, I filter the dataset to focus only on those posts with "Ask HN" or "Show HN" tags and print how many posts fall into each category.

In [23]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts)) 
print(len(other_posts))

1744
1162
17194


Next, I check to see how many comments are in each of the types of posts.

In [26]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments = total_ask_comments + num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average Number of Comments in an Ask HN Post: ", avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments = total_show_comments + num_comments

avg_show_comments = total_show_comments / len(show_posts)
print("Average Number of Comments in a Show HN Post: ",avg_show_comments)

Average Number of Comments in an Ask HN Post:  14.038417431192661
Average Number of Comments in a Show HN Post:  10.31669535283993


Based on the analysis above, I conclude that Ask HN posts generate more engagement on average than Show HN posts do.

### Analysis of Times of Show HN Posts

Below, I calculate the number of Ask HN posts created in each hour of the day and determine the number of comments these posts receive each hour.

In [67]:
import datetime as dt

result_list = []

for row in ask_posts: 
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])
        
counts_by_hour = {}
comments_by_hour = {}
length_hour = 0

for row in result_list:
    num_comments = int(row[1])
    time_str = row[0]
    dt_object = dt.datetime.strptime(time_str, "%m/%d/%Y %H:%M")
    hour = dt_object.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] = comments_by_hour[hour] + num_comments
        
print("Posts by Hour (Hour:Count): ", counts_by_hour)
print("Comments by Hour (Hour:Count): ", comments_by_hour)


Posts by Hour (Hour:Count):  {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
Comments by Hour (Hour:Count):  {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


Below I calculate the average number of comments per post for posts created during each hour of the day.

In [70]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

print("Average Number of Comments Per Post for Posts Created During Hour of the Day (Hour,Average)", avg_by_hour)
    

Average Number of Comments Per Post for Posts Created During Hour of the Day (Hour,Average) [[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]


In [86]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("")
print("Top 5 Hours for Ask Posts Comments")
print("")

for row in sorted_swap[:5]:
    avg = row[0]
    hour = row[1]
    string = "{0}:00: {1:.2f} average comments per post".format(hour, avg)
    print(string)


[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.20183486238532, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.127272727272727, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]

Top 5 Hours for Ask Posts Comments

15:00: 38.59 average comments per post
2:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
