# Hacker News Submission Analysis

Hacker News is a site started by Y combinator where users submit posts that are voted on and discussed. The site is popular among those in the tech industry as well as startup circles, and consequently, the links that top the Hacker News listings get a lot of visitors. 

The original data has been filtered from 300,000 rows to a smaller dataset of 20,000 rows by
1. Removing all submissions that do not have a comment
2. Creating a random sample of the remaining submissions

We're interested in posts that begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask a certain question, and `Show HN` posts to showcase a product, project or anything interesting.

Our objective is to study the sample, and determine
1. Which of the 2 types of the posts receive more comments on average
2. Do posts created at certain times receive more comments on average

# Data Import

The source data can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). We will first read the file, and import it as a list of lists. We'll also separate the header from the dataset.

In [3]:
from csv import reader
import datetime as dt

opened_file = open('/Users/shubzroy/Documents/GitHub/Analytics/Datasets/HN_posts_year_to_Sep_26_2016.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

In [4]:
hn_header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [7]:
hn[0:5]

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

# Filter Data

We will have to filter the data to separate out the `Ask HN`, `Show HN`, and all other posts based on the title. The title starts with either or none of the two labels, and based on this we'll create 3 separate lists.

Note for DQ: The dataset originally has ~300,000 records, and filters down to ~20,000 records after this step. 

In [8]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


In [9]:
ask_posts[0:5]

[['12578908',
  'Ask HN: What TLD do you use for local development?',
  '',
  '4',
  '7',
  'Sevrene',
  '9/26/2016 2:53'],
 ['12578522',
  'Ask HN: How do you pass on your work when you die?',
  '',
  '6',
  '3',
  'PascLeRasc',
  '9/26/2016 1:17'],
 ['12577908',
  'Ask HN: How a DNS problem can be limited to a geographic region?',
  '',
  '1',
  '0',
  'kuon',
  '9/25/2016 22:57'],
 ['12577870',
  'Ask HN: Why join a fund when you can be an angel?',
  '',
  '1',
  '3',
  'anthony_james',
  '9/25/2016 22:48'],
 ['12577647',
  'Ask HN: Someone uses stock trading as passive income?',
  '',
  '5',
  '2',
  '00taffe',
  '9/25/2016 21:50']]

In [10]:
show_posts[0:5]

[['12578335',
  'Show HN: Finding puns computationally',
  'http://puns.samueltaylor.org/',
  '2',
  '0',
  'saamm',
  '9/26/2016 0:36'],
 ['12578182',
  'Show HN: A simple library for complicated animations',
  'https://christinecha.github.io/choreographer-js/',
  '1',
  '0',
  'christinecha',
  '9/26/2016 0:01'],
 ['12578098',
  'Show HN: WebGL visualization of DNA sequences',
  'http://grondilu.github.io/dna.html',
  '1',
  '0',
  'grondilu',
  '9/25/2016 23:44'],
 ['12577991',
  'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules',
  'https://github.com/jakebian/zeal',
  '2',
  '0',
  'dbranes',
  '9/25/2016 23:17'],
 ['12577142',
  'Show HN: Jumble  Essays on the go #PaulInYourPocket',
  'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8',
  '1',
  '1',
  'ryderj',
  '9/25/2016 20:06']]

# Analysis

Let's look at which of the post type received more comments on average. We'll start with the `Ask HN` posts first.

In [12]:
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(post)
print(avg_ask_comments)

13569.42857142857


In [16]:
total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

4.886099625910612


We can see that average number of comments are significantly higher for the `Ask HN` posts than for the `Show HN` posts. On average, `Ask HN` posts receive over 2700 times more comments than `Show HN` posts.

Since we see that the `Ask HN` posts recevied more comments, we'll use this dataset to see if there were posts created at specific times which received more comments. We'll iterate through the `Ask HN` posts to get to total number of comments received each hour, and the number of posts for each hour.

In [30]:
result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    dt_string = row[0]
    num_comments = row[1]
    dt_time = dt.datetime.strptime(dt_string, "%m/%d/%Y %H:%M")
    dt_hour = dt.datetime.strftime(dt_time, "%H")
    if dt_hour not in counts_by_hour:
        counts_by_hour[dt_hour] = 1 
        comments_by_hour[dt_hour] =  num_comments
    else:
        counts_by_hour[dt_hour] += 1 
        comments_by_hour[dt_hour] +=  num_comments

We'll use the above to calculate the average number of comments per post for each hour.

In [36]:
avg_by_hour = []

for hours in comments_by_hour:
    avg_by_hour.append([hours, round(comments_by_hour[hours]/counts_by_hour[hours],2)])

print(avg_by_hour)

[['02', 11.14], ['01', 7.41], ['22', 8.8], ['21', 8.69], ['19', 7.16], ['17', 9.45], ['15', 28.68], ['14', 9.69], ['13', 16.32], ['11', 8.96], ['10', 10.68], ['09', 6.65], ['07', 7.01], ['03', 7.95], ['23', 6.7], ['20', 8.75], ['16', 7.71], ['08', 9.19], ['00', 7.56], ['18', 7.94], ['12', 12.38], ['04', 9.71], ['06', 6.78], ['05', 8.79]]


Let's sort the list of lists in order of descending average number of comments per hour.

In [51]:
swap_avg_by_hour = []

for hours in avg_by_hour:
    swap_avg_by_hour.append([hours[1],hours[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for ranking in sorted_swap[0:5]:
    template = "{0}: {1:.2f} average comments per post"
    rank_time = dt.datetime.strptime(ranking[1], "%H")
    time_hr = dt.datetime.strftime(rank_time, "%H:%M")
    print(template.format(time_hr, ranking[0]))

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


Given the above ranking, you'd want to create a post at 3pm EST (2pm CST) for the highest probability for receiving comments. Seeing the top 5 hours, if a range is required, you'd have a high probability of comments 12-3pm EST (11am-2pm CST). 