# Exploring Hacker News
For this project I'll be looking through a dataset that describes forum stats from the site [https://news.ycombinator.com/](hacker news). This dataset includes the following columns;
- id
- title
- url
- num_points
- num_comments
- author
- created_at


We are specifically looking at posts that include "Ask HN" or "Show HN" in the title. These titles indicate users who are asking the community a specific question, or showing the community a project. Do posts with these titles receive more comments on average?
Do posts created at a specific time get more comments on average? Let's find out. 

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn") == True:
        ask_posts.append(row)
    elif title.lower().startswith("show hn") == True:
        show_posts.append(row)
    else:
        other_posts.append(row)
print("There are {:,} ask posts".format(len(ask_posts)))
print("There are {:,} show posts".format(len(show_posts)))
print("There are {:,} other posts".format(len(other_posts)))

There are 1,744 ask posts
There are 1,162 show posts
There are 17,194 other posts


In [4]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print("{:.2f} comments on average for ask posts".format(avg_ask_comments))

total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
avg_show_comments = total_show_comments / len(show_posts)
print("{:.2f} comments on average for show posts.".format(avg_show_comments))

14.04 comments on average for ask posts
10.32 comments on average for show posts.


Here we see that ask posts have more comments on average than show posts. There are a little less than 600 more ask posts than show posts. Since ask posts have more comments on average we will focus on these posts. Next, let's see whether there is a relationship between the time the post was posted and the number of comments.

In [5]:
import datetime as dt

result_list = []
for row in ask_posts:
    created_time = row[6]
    num_comments = int(row[4])
    result_list.append([created_time, num_comments])

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    created_time = row[0]
    num_comments = row[1]
    hour = dt.datetime.strptime(created_time, "%m/%d/%Y %H:%M").strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments


comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

In [6]:
# lets find the average number of comments per post for posts created during
# each hour of the day
avg_by_hour = []
for hour in comments_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg])
avg_by_hour

[['00', 8.127272727272727],
 ['11', 11.051724137931034],
 ['14', 13.233644859813085],
 ['13', 14.741176470588234],
 ['20', 21.525],
 ['05', 10.08695652173913],
 ['08', 10.25],
 ['01', 11.383333333333333],
 ['06', 9.022727272727273],
 ['21', 16.009174311926607],
 ['23', 7.985294117647059],
 ['10', 13.440677966101696],
 ['19', 10.8],
 ['12', 9.41095890410959],
 ['04', 7.170212765957447],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['07', 7.852941176470588],
 ['16', 16.796296296296298],
 ['22', 6.746478873239437],
 ['09', 5.5777777777777775],
 ['15', 38.5948275862069],
 ['02', 23.810344827586206],
 ['17', 11.46]]

In [20]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)
print("\n")
print("Top 5 Hours for Ask Posts Comments:")

for average, hour in sorted_swap[:5]:
    hour = dt.datetime.strptime(hour, "%H").strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(hour, average))

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


Top 5 Hours for Ask Posts Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


According to the statistics, if you want the greater chance of your post receiving the maximum amount of comments, you should aim to post around the following times; 3pm, 12am, 8pm, 4pm, and 9pm, respectively.