# Exploring Hacker News Posts 

In this notebook a comparative analysis is conducted on submissions to the [Hacker News](https://news.ycombinator.com/) website. 

The aim of the project is to compare two types of posts, namely `Ask HN` and `Show HN` to determine the following:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

In [11]:
from csv import reader
hn_opened_file = open('hacker_news.csv')
hn_read_file = reader(hn_opened_file)
hn = list(hn_read_file)

First the header is separated from the data.

In [12]:
headers = hn[0]
hn = hn[1:]

#print(headers)
print(hn[10:20])


[['11370829', 'Crate raises $4M seed round for its next-gen SQL database', 'http://techcrunch.com/2016/03/15/crate-raises-4m-seed-round-for-its-next-gen-sql-database/', '3', '1', 'hitekker', '3/27/2016 18:08'], ['11665197', 'Advertising Cannot Maintain the Internet. Heres the Secret Sauce Solution', 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/', '2', '1', 'dredmorbius', '5/10/2016 4:46'], ['11981466', 'Coding Is Over', 'https://medium.com/@loorinm/coding-is-over-6d653abe8da8', '18', '14', 'prostoalex', '6/26/2016 16:36'], ['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['11587596', 'Custom Deleters for C++ Smart Pointers', 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html', '59', '18', 'ingve', '4/28/2016 10:01'], ['12335860', 'How often to update third party libraries?', '', '7', '5', 'rabid_oxen', '8/22/2016 12:37'], [

Next, using a **for loop** the `Ask HN` and `Show HN` posts are separated from uncategorized posts (`other posts`) and one another into unique lists. 

In [13]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
    
print('There are', len(ask_posts), 'ask posts.')
print('There are', len(show_posts), 'show posts.')
print('There are', len(other_posts), 'uncategorized posts.')

There are 1744 ask posts.
There are 1162 show posts.
There are 17194 uncategorized posts.


Having determined the number of posts in each respective category, the next step is to find out which category between `Ask HN` and `Show HN` recieves more comments. 

In [14]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4]) 
    total_ask_comments += num_comments

avg_ask_comments = round(total_ask_comments / len(ask_posts))

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4]) 
    total_show_comments += num_comments

avg_show_comments = round(total_show_comments / len(ask_posts))

category = ''

if avg_ask_comments > avg_show_comments:
    category = 'Ask HN'
else:
    category = 'Show HN'

statement = '{cat} posts have more comments on average'.format(cat = category)

print('There are', total_ask_comments, 'comments in total for ask posts.')
print('There are', total_show_comments, 'comments in total for show posts.')
print('The Ask HN category has on average', avg_ask_comments, 'comments per post.')
print('The Show HN category has on average', avg_show_comments, 'comments per post.')
print(statement)

There are 24483 comments in total for ask posts.
There are 11988 comments in total for show posts.
The Ask HN category has on average 14 comments per post.
The Show HN category has on average 7 comments per post.
Ask HN posts have more comments on average


Focusing on the category with the higher number of posts on average, `Ask HN`, the following step is to try and determine if `Ask HN` posts are more likely to recieve comments at a certain time. This will be achieved by parsing the entry `created_at` column for each row using the `datetime` module to determine the hour in which a post was created. The total number of comments that each post has recieved are tallied according to the hour in which the post was created in. 

In [15]:
import datetime as dt

results_list = []

for row in ask_posts:
    created_at = row[6]
    num_comms = row[4]
    results_list.append([created_at, num_comms])

counts_by_hour = {}
comments_by_hour = {}

for item in results_list:
    date = dt.datetime.strptime(item[0],'%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(date, '%H')
    cmnts = int(item[1])
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = cmnts
    else: 
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += cmnts

print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Having counted the amount of posts as well as comments to posts recieved per hour for `Ask HN` posts, next the aim is to find out what is the average amount of comments per post, per hour. 

In [16]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, round(comments_by_hour[hour]/counts_by_hour[hour])])

print(avg_by_hour)

[['09', 6], ['13', 15], ['10', 13], ['14', 13], ['16', 17], ['23', 8], ['12', 9], ['17', 11], ['15', 39], ['21', 16], ['20', 22], ['02', 24], ['18', 13], ['03', 8], ['05', 10], ['19', 11], ['01', 11], ['22', 7], ['08', 10], ['04', 7], ['00', 8], ['06', 9], ['07', 8], ['11', 11]]


Next, a new list of lists is created in which the position of the `hour` and the `average comments` are swopped. This is to make it easy to see which hour receives the highest amount of average comments after sorting the list. 

In [19]:
swop_avg_by_hour = []
for item in avg_by_hour:
    hour = item[0]
    avg =  item[1]
    swop_avg_by_hour.append([avg, hour]) 

sorted_swop = sorted(swop_avg_by_hour, reverse=True)
sorted_swop[0:5]



[[39, '15'], [24, '02'], [22, '20'], [17, '16'], [16, '21']]

For each of the top 5 values in `sorted_swop` a string is printed with the value of average comments per hour for the hour as well as it's rank.

In [25]:
rank = 1
for item in sorted_swop[0:5]:
    avg = item[0]
    time = dt.datetime.strptime(item[1], '%H').strftime('%H:%M')
    statement = 'At number {r} comes the hour of {h} with an average of {a} comments.'.format(r=rank, h=time, a=avg)
    print(statement)
    rank += 1

At number 1 comes the hour of 15:00 with an average of 39 comments.
At number 2 comes the hour of 02:00 with an average of 24 comments.
At number 3 comes the hour of 20:00 with an average of 22 comments.
At number 4 comes the hour of 16:00 with an average of 17 comments.
At number 5 comes the hour of 21:00 with an average of 16 comments.


Through this analysis we have determinded that `Ask HN` posts get more attention on average than `Show HN` posts and that the best time to post on Hacker News for optimal response would be at 3pm.  