# Hacker News Posts Analysis

In this small project, we will investigate a [Hacker News dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) from Kaggle. As detailed on Kaggle, this dataset includes all posts from the 12-month period of September 26 2015-2016. Post data include the number of points/upvotes and comments at time of the data scrape.

Here we will investigate two main questions:
1. Do "Ask HN" or "Show HN" posts receive more comments?
 - "Ask HN" posts are submitted by users asking the Hacker News Community a specific question, whereas "Show HN" posts highlight something of interest to the community
2. Do posts created at a certain date/time receive more comments?

Sections of the analysis are outlined below:

[0.0 Data Pre-processing](#0.0-Data-Pre-processing) <br>
[0.1 Import Dataset](#0.1-Import-Dataset) <br>
[0.2 Filter Data](#0.2-Filter-Data) <br>

[1.0 Post Metrics](#1.0-Post-Metrics) <br>
[1.1 Comment Numbers](#1.1-Comment-Numbers) <br>
[1.2 Point Numbers](#1.2-Point-Numbers) <br>

[2.0 Effect of Time of Day](#2.0-Effect-of-Time-of-Day) <br>
[2.1 Comments vs Time of Day](#2.1-Comments-vs-Time-of-Day) <br>
[2.2 High Comment Average Between 2am-3am](#2.2-High-Comment-Average-Between-2am-3am) <br>
***

In [None]:
# imports
import datetime as dt

## 0.0 Data Pre-processing
### 0.1 Import Dataset

In [2]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print('{}\n{}\n{}\n'.format(hn[0],hn[1],hn[2]))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']



In [3]:
# Zeroth row includes the column headers, separate to its own variable
headers = hn[0]
hn = hn[1:]

### 0.2 Filter Data

Since we are only concerned with posts beginning with "Ask HN" or "Show HN", we will separate the dataset into 3 tables: Ask HN posts, Show HN posts, other posts.

In [4]:
ask_posts, show_posts, other_posts = [], [], []
for post in hn:
    title = post[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
print('{} posts: {}'.format('Ask HN',len(ask_posts)))
print('{} posts: {}'.format('Show HN',len(show_posts)))
print('{} posts: {}'.format('Other HN',len(other_posts)))

Ask HN posts: 1744
Show HN posts: 1162
Other HN posts: 17194


Apparently we are looking at approximately 14% of Hacker News posts overall. However, it is worth noting that the data we are looking at is split 40% to 60% rather than there existing a majority of Ask HN or Show HN posts.

## 1.0 Post Metrics

### 1.1 Comment Numbers

In [5]:
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
print('Total Ask HN post comments:',total_ask_comments)
avg_ask_comments = round(total_ask_comments / len(ask_posts))
print('Average comment count per Ask HN post:',avg_ask_comments)

Total Ask HN post comments: 24483
Average comment count per Ask HN post: 14


In [6]:
total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
print('Total Show HN post comments:',total_show_comments)
avg_show_comments = round(total_show_comments / len(show_posts))
print('Average comment count per Show HN post:',avg_show_comments)

Total Show HN post comments: 11988
Average comment count per Show HN post: 10


So on average, Ask HN posts have 4 more comments than Show HN posts. This is not too surprising. There are only so many unique ways to respond to a post showing something, whereas asking a question to the community opens the door for more varied response. Why not look at the points average too?

### 1.2 Point Numbers

In [7]:
def get_post_points_avg(posts):
    avg_points = 0
    for post in posts:
        avg_points += int(post[3])
    avg_points /= len(posts)
    return avg_points

In [8]:
avg_ask_points = round(get_post_points_avg(ask_posts))
avg_show_points = round(get_post_points_avg(show_posts))
print('Average comment count per Ask HN post:',avg_ask_points)
print('Average comment count per Show HN post:',avg_show_points)

Average comment count per Ask HN post: 15
Average comment count per Show HN post: 28


Again, not surprising. People casually browsing social media are probably more likely to engage at looking at something interesting rather than engage at some discussion, and the number of points for a post is related to its viewership.

## 2.0 Effect of Time of Day

Focusing on the posts identified with the most comments, Ask HN posts, we will look at how the time of post creation affects how many comments are made. 

Hypothetically, more comments will be present on posts created when community members are finished work. Assuming a 9-5 work schedule, afternoon Eastern Time seems a likely high point since it averages the time zones of Western Europe and America (and it is simply speculation that community members are are located in these regions).

### 2.1 Comments vs Time of Day

In [9]:
# Extract the target columns from the dataset
result_list = []
for post in ask_posts:
    created_at = post[6]
    comments = int(post[4])
    result_list.append([created_at,comments])

In [10]:
# Process result_list, make tables for posts and comments by hour
counts_by_hour, comments_by_hour = {}, {}
for post in result_list:
    dt_created = dt.datetime.strptime(post[0], '%m/%d/%Y %H:%M')
    if dt_created.hour in counts_by_hour:
        counts_by_hour[dt_created.hour] += 1
        comments_by_hour[dt_created.hour] += post[1]
    else:
        counts_by_hour[dt_created.hour] = 1
        comments_by_hour[dt_created.hour] = post[1]

In [11]:
# Combine these two tables for average comments per post per hour
# Note: this code diverts from Dataquest, because it's better
avg_by_hour = [[h,0] for h in range(0,24)]
disp_text = 'Average comments for {h:02}:00 to {h:02}:59 posts: {c}'
for hour_data in avg_by_hour:
    h = hour_data[0]
    hour_data[1] = round(comments_by_hour[h] / counts_by_hour[h])
    print(disp_text.format(h=h,c=hour_data[1]))

Average comments for 00:00 to 00:59 posts: 8
Average comments for 01:00 to 01:59 posts: 11
Average comments for 02:00 to 02:59 posts: 24
Average comments for 03:00 to 03:59 posts: 8
Average comments for 04:00 to 04:59 posts: 7
Average comments for 05:00 to 05:59 posts: 10
Average comments for 06:00 to 06:59 posts: 9
Average comments for 07:00 to 07:59 posts: 8
Average comments for 08:00 to 08:59 posts: 10
Average comments for 09:00 to 09:59 posts: 6
Average comments for 10:00 to 10:59 posts: 13
Average comments for 11:00 to 11:59 posts: 11
Average comments for 12:00 to 12:59 posts: 9
Average comments for 13:00 to 13:59 posts: 15
Average comments for 14:00 to 14:59 posts: 13
Average comments for 15:00 to 15:59 posts: 39
Average comments for 16:00 to 16:59 posts: 17
Average comments for 17:00 to 17:59 posts: 11
Average comments for 18:00 to 18:59 posts: 13
Average comments for 19:00 to 19:59 posts: 11
Average comments for 20:00 to 20:59 posts: 22
Average comments for 21:00 to 21:59 posts

It is plain to see that the top times when most comments are made as as follows:
1. Between 3pm and 4pm
2. Between 2am and 3am
3. Between 8pm and 9pm
4. Between 4pm and 5pm
5. Between 9pm and 10pm

According to the dataset, these times are in Eastern Time. We hypothesized that afternoon post times in Eastern Time would receive the most comments. While this is apparent in the dataset's top ranks, it does not justify the second top timeslot of 2-3am so let's investigate this.

### 2.2 High Comment Average Between 2am-3am

In [12]:
disp_text = '(Posts, Comments) from {h:02}:00 to {h:02}:59: ({p}, {c})'
for hour in range(24):
    print(disp_text.format(h=hour,p=counts_by_hour[hour],c=comments_by_hour[hour]))

(Posts, Comments) from 00:00 to 00:59: (55, 447)
(Posts, Comments) from 01:00 to 01:59: (60, 683)
(Posts, Comments) from 02:00 to 02:59: (58, 1381)
(Posts, Comments) from 03:00 to 03:59: (54, 421)
(Posts, Comments) from 04:00 to 04:59: (47, 337)
(Posts, Comments) from 05:00 to 05:59: (46, 464)
(Posts, Comments) from 06:00 to 06:59: (44, 397)
(Posts, Comments) from 07:00 to 07:59: (34, 267)
(Posts, Comments) from 08:00 to 08:59: (48, 492)
(Posts, Comments) from 09:00 to 09:59: (45, 251)
(Posts, Comments) from 10:00 to 10:59: (59, 793)
(Posts, Comments) from 11:00 to 11:59: (58, 641)
(Posts, Comments) from 12:00 to 12:59: (73, 687)
(Posts, Comments) from 13:00 to 13:59: (85, 1253)
(Posts, Comments) from 14:00 to 14:59: (107, 1416)
(Posts, Comments) from 15:00 to 15:59: (116, 4477)
(Posts, Comments) from 16:00 to 16:59: (108, 1814)
(Posts, Comments) from 17:00 to 17:59: (100, 1146)
(Posts, Comments) from 18:00 to 18:59: (109, 1439)
(Posts, Comments) from 19:00 to 19:59: (110, 1188)
(Posts

So the number of posts is not significantly different in the hours around 2am Eastern Time, they apparently receive many more comments though. Perhaps this discontinuity in comments can be explained by looking at the types of posts made at this time. For example, if they are for people of a culture/country who happen to end their 5pm at that time, it would make more sense. Alternatively, some media be released regularly at this time; or there are any number of other reasons. Start by looking at the 2am posts.

In [13]:
# 2am posts
post_intro = '{h:02}:{m:02} post, {c} comments: {t}'
for post in ask_posts:
    dt_created = dt.datetime.strptime(post[6], '%m/%d/%Y %H:%M')
    if dt_created.hour == 2:
        print(post_intro.format(h=dt_created.hour,m=dt_created.minute,\
                                c=int(post[4]),t=post[1]))

02:47 post, 3 comments: Ask HN: $500k revenue business  Shopify vs. Custom website?
02:05 post, 22 comments: Ask HN: Critique my biz idea  local.menu the next Airbnb
02:36 post, 30 comments: Ask HN: Settle for less salary or change to make more in prime years
02:30 post, 7 comments: Ask HN: Your advise wanted for how to get into programming
02:34 post, 15 comments: Ask HN: Client is bullying for refund since beginning
02:41 post, 1 comments: Ask HN: Has Dropbox been compromised recently?
02:02 post, 8 comments: Ask HN: Declarative database migrations?
02:33 post, 4 comments: Ask HN: How can you work at a desk all day?
02:53 post, 4 comments: Ask HN: How to work fewer hours?
02:17 post, 11 comments: Ask HN: Electrical Eng. PhD, Thinking of Moving to Programming Jobs. Worth It?
02:21 post, 11 comments: Ask HN: Should I build an airline booking system?
02:23 post, 8 comments: Ask HN: Cheap databases for new projects?
02:24 post, 41 comments: Ask HN: The best app to keep a work diary
02:34

The second last post in the above printout has 868 comments! Without this post, Ask HN posts occurring between 2am to 3am have 1381-868=513 comments, which is much closer in-line with the number of comment data for other hours. The 868-comment post can be found [here](https://news.ycombinator.com/item?id=11694277). Looking at some of the top comments, we can see that they are made throughout the day and not just shortly after 2am, and some follow-up responses are made days later. It seems that this general question trended to the HN community, rather than following in-line with a hypothesis about posts of a particular nature being made around the 2am time.