# Hacker News Data Analysis
The purpose of this project will be to analyze the posts (user-submitted stories) made on Hacker News, a news website for Y Combinator, from the data set provided. The posts that we are the most interested in are the ones whose titles begin with either "Ask HN" or "Show HN". The Ask HN posts are specific questions posted by users to the Hacker News community, whereas Show HN posts are projects, products, and other interesting things. This project will attempt to answer questions such as, "Do Ask HN or Show HN receive more comments on average?" and "Do posts posts created at a certain time receive more comments on average?"

In [2]:
# Import reader function, open data file, read file, and format as
# list of lists. Data set is assigned to variable 'hn'.

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [3]:
# Separating the header from the rest of the data.

headers = hn[:1]
hn = hn[1:]
print(headers)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [4]:
# Filtering the data for post titles only beginning with Ask HN or
# Show HN

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    lowered_title = title.lower()
    if lowered_title.startswith('ask hn'):
        ask_posts.append(row)
    elif lowered_title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [5]:
# Determining if ask posts or show posts receive more comments
# on average.

# Calculating average no. of ask comments
total_ask_comments = 0
for row in ask_posts:
    num_comments = row[4]
    total_ask_comments += int(num_comments)
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

# Calculating average no. of show comments
total_show_comments = 0
for row in show_posts:
    num_comments = row[4]
    total_show_comments += int(num_comments)
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


Based on the calculations made in the cell above, it seems that ask posts received more comments on average than show posts. Ask posts received about 14 comments on average from this data sample, compared to only about 10 comments for the show posts.

In [6]:
# Do ask posts created at a certain time attract more comments?

# To answer this question, the first step is to calculate the
# amount of ask posts and comments by hour created.

import datetime as dt
result_list = [] # This list will house all the times from 7th
                # column of ask_posts
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

# Creating dictionaries to count number of posts and their comments
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date_time = row[0]
    date_object = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date_object, "%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
print(counts_by_hour)
print("\n")
print(comments_by_hour)

{'09': 45, '13': 85, '22': 71, '19': 110, '21': 109, '07': 34, '06': 44, '01': 60, '12': 73, '20': 80, '08': 48, '00': 55, '23': 68, '03': 54, '11': 58, '17': 100, '15': 116, '14': 107, '02': 58, '05': 46, '16': 108, '18': 109, '10': 59, '04': 47}


{'09': 251, '13': 1253, '22': 479, '19': 1188, '21': 1745, '07': 267, '06': 397, '01': 683, '12': 687, '20': 1722, '08': 492, '00': 447, '23': 543, '03': 421, '11': 641, '17': 1146, '15': 4477, '14': 1416, '02': 1381, '05': 464, '16': 1814, '18': 1439, '10': 793, '04': 337}


In [7]:
# Using the values from the dictionaries above to calculate the
# average number of comments for posts created during each hour

avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['22', 6.746478873239437], ['19', 10.8], ['21', 16.009174311926607], ['07', 7.852941176470588], ['06', 9.022727272727273], ['01', 11.383333333333333], ['12', 9.41095890410959], ['20', 21.525], ['08', 10.25], ['00', 8.127272727272727], ['23', 7.985294117647059], ['03', 7.796296296296297], ['11', 11.051724137931034], ['17', 11.46], ['15', 38.5948275862069], ['14', 13.233644859813085], ['02', 23.810344827586206], ['05', 10.08695652173913], ['16', 16.796296296296298], ['18', 13.20183486238532], ['10', 13.440677966101696], ['04', 7.170212765957447]]


In [25]:
# Making the list above more readable

swap_avg_by_hour = [] # list to swap columns
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("\n")

# Formatting the rows using the string and datetime format methods
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    hour = row[1]
    dt_object = dt.datetime.strptime(hour, "%H")
    time = dt_object.strftime("%H:%M")
    average = row[0]
    string = "{ti}: {avg:.2f} average comments per post"
    print(string.format(ti = time, avg = average))

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [6.746478873239437, '22'], [10.8, '19'], [16.009174311926607, '21'], [7.852941176470588, '07'], [9.022727272727273, '06'], [11.383333333333333, '01'], [9.41095890410959, '12'], [21.525, '20'], [10.25, '08'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.796296296296297, '03'], [11.051724137931034, '11'], [11.46, '17'], [38.5948275862069, '15'], [13.233644859813085, '14'], [23.810344827586206, '02'], [10.08695652173913, '05'], [16.796296296296298, '16'], [13.20183486238532, '18'], [13.440677966101696, '10'], [7.170212765957447, '04']]


Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


According to the analysis above, 3pm has the highest average number of comments per post, followed by 2am at a distant 2nd. Users should generally avoid the morning, since not many comments are made during those times.