# Correlation between the Types of Posts and User Interest in Hacker News

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. 

## Introduction

Below are descriptions of the columns:

- `id`: The unique identifier from Hacker News for the post
- `title`: The title of the post
- `url`: The URL that the posts links to, if the post has a URL
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the post
- `created_at`: The date and time at which the post was submitted

The titles of some posts begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Below are a couple examples:
>Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:
>Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

These two types of posts will be compared to determine the following:
Do `Ask HN` or `Show HN` receive more comments on average?
Do posts created at a certain time receive more comments on average?

First, the necessary libraries are imported, and the data set is converted into a list of lists.


In [1]:
from csv import reader
hn = open('hn-2016-dataset.csv', encoding="utf-8")
hn = reader(hn)
hn = list(hn)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


## Removing Headers

When the first five rows of the data set are printed, it is found that the first list in the inner lists contains the column headers, and the lists after contain the data for one row. Thus, the row containing the column headers has to be removed.

In [2]:
headers = hn[1]
hn = hn[1:]
print(headers)
print(hn[:5])

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.tale

## Extracting Ask HN and Show HN Posts

The posts were distributed into three different categories:
- `ask_posts`, which includes the `Ask HN` posts,
- `show_posts`, which includes the `Show HN` posts,
- `other_posts`, which includes the rest of the posts.

Then, the number of posts in each category was printed:

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts), len(show_posts), len(other_posts))


9139 10158 273822


## Calculating the Average Number of Comments for Each Category
Next, the average numbers of the comments in each category of posts were calculated.

In [4]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)

print(avg_ask_comments)
print(avg_show_comments)
    

10.393478498741656
4.886099625910612


Show posts received about 10 comments per post on average, and ask posts received about 5 comments per post on average. Since ask posts are more likely to receive comments, the remaining analysis will focus on these posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

The next goal is to find if ask posts created at a certain *time* are more likely to attract comments. The following steps will be used to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Then, calculate the average number of comments ask posts receive by hour created.

The first technique was used to find the amount of ask posts created per hour, along with the total amount of comments.

In [5]:
from datetime import *
result_list = []
for row in ask_posts:
    l = [row[6], int(row[4])]
    result_list.append(l)

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    created_at_int = row[0]
    created_at_dt = datetime.strptime(created_at_int, "%m/%d/%Y %H:%M")
    h = created_at_dt.hour
    if h not in counts_by_hour:
        counts_by_hour[h] = 1
        comments_by_hour[h] = row[1]
    else:
        counts_by_hour[h] += 1
        comments_by_hour[h] += row[1]


Here, two dictionaries were created:
- `counts_by_hour`: contains the number of ask posts created during each hour of the day.
- `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received.

## Calculating the Average Number of Comments for Ask HN Posts by Hour

The two dictionaries created above were used to calculate the average number of comments for posts created during each hour of day. The printed result is a list of lists whose first elements are hours and second elements are the corresponding average number of comments.

In [6]:
avg_by_hour = []
for key in counts_by_hour:
    avg_comments = comments_by_hour[key] / counts_by_hour[key]
    l = [key, avg_comments]
    avg_by_hour.append(l)
print(avg_by_hour)

[[2, 11.137546468401487], [1, 7.407801418439717], [22, 8.804177545691905], [21, 8.687258687258687], [19, 7.163043478260869], [17, 9.449744463373083], [15, 28.676470588235293], [14, 9.692007797270955], [13, 16.31756756756757], [11, 8.96474358974359], [10, 10.684397163120567], [9, 6.653153153153153], [7, 7.013274336283186], [3, 7.948339483394834], [23, 6.696793002915452], [20, 8.749019607843136], [16, 7.713298791018998], [8, 9.190661478599221], [0, 7.5647840531561465], [18, 7.94299674267101], [12, 12.380116959064328], [4, 9.7119341563786], [6, 6.782051282051282], [5, 8.794258373205741]]


## Sorting and Printing Values from a List of Lists

Since the printed result is difficult to identify the hours with the highest values, the list of lists was sorted so that it can print the five highest values in a format that is easier to read.

In [7]:
swap_avg_by_hour = []
for row in avg_by_hour:
    l = [row[1], row[0]]
    swap_avg_by_hour.append(l)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("<Top 5 Hours for Asks Posts Comments>")
for row in sorted_swap[:5]:
    form = "{}: {:.2f} average comments per post"
    time_dt = datetime.strptime(str(row[1]), "%H")
    time_str = time_dt.strftime("%H:%M")
    text = form.format(time_str, row[0])
    print(text)

<Top 5 Hours for Asks Posts Comments>
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


Thus, it is apparent that creating a post at 3 p.m. ET (6 p.m. PST) has the highest chance of receiving comments. In general, Hacker News users in California interact with each other mostly in the afternoon, when most people are awake during daytime.