# Exploring Hacker News

*Jack Kolberg-Edelbrock, MS*

## Executive Summary

Using the ["Hacker News Posts" dataset from Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts), I identified how posting type and posting time correlate with received comments and points. 

First, I examined the relationship between post type and the number of comments and points received. The data show that **ask** posts receive the most comments, and **other** posts (neither **ask** nor **show**) receive the most points. I believe that the nature of **ask** and **other** posts leads to this observation. Requests for help in **ask** posts necessitates responses; the **other** category, which primarily consists of industry news and product reviews, is most likely to receive quick point-based feedback, as the content of the post is identifiable from the title.

Second, I examined how posting time-of-day influences the number of comments and points received. Here, the data show that early afternoon posts receive the most comments, although there are other, smaller comment hotspots thoughout the day. The clearest trend between number of points awarded and post time-of-day is in the **other** category. With these posts, the *number of points received is nearly independent of the posting time-of-day.* I believe this also ties back to the nature of the posts, with **other** posts being more digestible during the work day, on the bus, or even in the middle of the night. Contrast this to the the content-heavy nature of **ask** or **show** posts.

*Overall, the nature of the post type demonstrably correlates with the number of comments and points received. **Ask** posts receive the most comments, and **other** posts receive points independent of post time-of-day.*

---
## Introduction: 

### What is Hacker News?

[Hacker News](https://news.ycombinator.com/) is an independent news site, similar to Reddit, operated by [Y combinator](https://www.ycombinator.com/). The website is built on a simple posting platform with an up/down voting system and comments on posts.

### Who uses Hacker News?

Hacker News is popular in technology and start-up circles, and is regularly visited by hundreds of thousands of users daily.

### Why evaluate stories on Hacker News?

By examining the differences between how often different post categories (ie Ask HN, Show HN, etc.), we can gain insight into what sort of posts would generate the most visibility (though comments or aggregate votes) for our client. We can also determine if there is an opportune time of day to post on HN to maximize visibility.

### Posting categories on Hacker News

Within this project, we will sort Hacker News posts into one of three categories:

1. **Ask**: posts in which users are requesting aid with projects
2. **Show**: posts in which users are showcasing their own projects
3. **Other**: posts which fall into neither the ASK nor the SHOW categories

---

## Goal
Examine how post type and post time-of-day correlate with the number of comments and points received by a post.

---
## Imports

In [1]:
from csv import reader
import datetime as dt
import numpy as np
import pandas as pd

---

## Data Processing

### Data Source
This project utilizes the **"Hacker News Posts"** dataset, which is [avaiable on Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts).

In [2]:
'''
Dataset Headings:
    [0] = *id*: the post identifier number
    [1] = *title*: the title of the post
    [2] = *url*: the url that the post links to, if it links to a URL
    [3] = *num_points*: the number of points the post acquired, calculated as upvotes less downvotes
    [4] = *num_comments*: the number of comments made on the post
    [5] = *author*: the username of the account that submitted the post
    [6] = *created_at*: the date and time at which the post was created
'''

# Import data from local source
hn = pd.read_csv('./DataSets/HackerNews/HN_posts_year_to_Sep_26_2016.csv', encoding='utf8', header=0)
hn.sample(5)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
173953,11084685,Argon2 code audits Infer,https://lolware.net/2016/02/12/argon2-code-rev...,2,0,technion,2/12/2016 1:16
18689,12403833,Just want to introduce my robot: Moorebot,,1,0,DaisyDai,9/1/2016 8:57
280504,10266176,Backblaze Introducing B2 Cloud Storage,https://www.backblaze.com/b2/cloud-storage.html,4,1,philipp-spiess,9/23/2015 16:48
239826,10562126,ODB: C++ ORM with #pragma,http://www.codesynthesis.com/products/odb/,2,0,vmorgulis,11/13/2015 20:13
137873,11378604,Ask HN: Who works in (mobile) robotics on HN?,,10,1,ibnroberttuta,3/29/2016 0:18


### Data Cleaning
These data were cleaned by *DataQuest* in two steps.

1. Remove all submissions without comments
2. Randomly sample from remaining submissions

The original data file includes more than 300,000 entries. The cleaned data-set has ~30,000 entries.

### Data sorting and aggregation

With the data read into our program, we start by filtering the remaining data into three buckets:

1. 'Ask HN'
2. 'Show HN'
3. 'Other'

In [3]:
# Generate bool vectors to identify ask and show posts
ask_bool = hn.loc[:,'title']               \
                .str.lower()               \
                .str.startswith('ask hn')
show_bool = hn.loc[:,'title']              \
                .str.lower()               \
                .str.startswith('show hn')

# Sort initial hn list into ask, show, and other sub-lists
ask_posts = hn.loc[ask_bool, :]
show_posts = hn.loc[show_bool, :]
other_posts = hn.loc[~(ask_bool | show_bool), :]

# Ouput
print('The number of ASK posts is \t{}'.format(ask_posts.shape[0]))
print('The number of SHOW posts is \t{}'.format(show_posts.shape[0]))
print('The number of OTHER posts is \t{}'.format(other_posts.shape[0]))
print('--------------------------------------')
print('The total number of posts is \t{}'.format(hn.shape[0]))

The number of ASK posts is 	9139
The number of SHOW posts is 	10158
The number of OTHER posts is 	273822
--------------------------------------
The total number of posts is 	293119


---
## Functions
### metric_sort

In [4]:
'''
This function accepts a list which follows the format of this project's imported dataset.
It takes this list and calculates an average metric (comments or points) and returns a 
descending sorted list in the desired metric.

Arguments:
    *post_list*    = A subset of the original dataset as a Panda DataFrame
    *dataset_name* = the name of the data subset, for use in the print-out
    *metric_index* = the list index of the metric of index. For number of comments,
                         this will be 4. For number of points, this will be 3.
    *metric_name*  = the name of the metric of interest, for use in the print-out
    *top_x_hours*  = the number of hours to include in the printout
'''

def metric_sort(post_list, dataset_name, metric_index, metric_name, top_x_hours):
    # Convert the 'created_at' column into sortable datetime objects
    post_hour = pd.to_datetime(post_list.loc[:,'created_at'], format='%m/%d/%Y %H:%M')

    # Dictionaries for sorting posts by hour
    hourly_metric_count = {}
    hourly_metric_amount = {}
        
    # Count hourly posts by creating and measuring a hour-based bool array
    for hr in range(24):
        dates_bool = post_hour.dt.hour == hr
        hourly_metric_count[hr] = sum(dates_bool)
        hourly_metric_amount[hr] = sum(post_list.loc[dates_bool, metric_index])
    
    # Convert hourly libraries into pd.Series for ease of printing and calculation
    hourly_metric_count = pd.Series(hourly_metric_count)
    hourly_metric_amount = pd.Series(hourly_metric_amount)
    hourly_metric_average = pd.Series.divide(hourly_metric_amount, hourly_metric_count)
    hourly_metric_average = hourly_metric_average.sort_values(ascending = False)  

    # OUTPUT
    print(f'Top hours for {metric_name} on {dataset_name} posts')
    print(f'---------------------------------------')
    for hour in range(top_x_hours):
        print(f'Hour {hourly_metric_average.index[hour]:02d}'
              f' = {hourly_metric_average.iloc[hour]:.2f}')
    print('\n')
    
    return hourly_metric_average

### avg_metric

In [5]:
'''
This function determines a post's average points or comments,
given a data subset in the format of the imported data.

Arguments:
    *data_subset*  = A subset of the original data set as a Panda DataFrame.
    *metric_index* = the list index of the metric of index. For number of comments,
                         this will be 4. For number of points, this will be 3.
    *metric_name*  = the name of the metric of interest, for use in the print-out

'''
def avg_metric(data_subset, subset_name, metric_index, metric_name):      
    sum_metric = data_subset.iloc[:, metric_index].sum()
    num_posts = data_subset.shape[0]
    
    average_metric = sum_metric/num_posts
    
    print(f'The average number of {metric_name}'
          f'per post in the {subset_name}'
          f' subset is: {average_metric:.2f}')
#     return [sum_metric, average_metric]

---
## Data Analysis
### What post types receive the most comments?

In [6]:
avg_metric(ask_posts, "ASK", 4, "comments")
avg_metric(show_posts, "SHOW", 4, "comments")
avg_metric(other_posts, "OTHER", 4, "comments")

The average number of commentsper post in the ASK subset is: 10.39
The average number of commentsper post in the SHOW subset is: 4.89
The average number of commentsper post in the OTHER subset is: 6.46


#### Ask posts receive most comments on average


Based on our analysis, *ask* posts receive the most comments, followed by *other* posts, then *show* posts. The exact results are in the output above. This finding appears logical, as posts seeking help are explicitly soliciting comments.

---
### What post category receives more points on average?

In [7]:
avg_metric(ask_posts, "ASK", 3, "points")
avg_metric(show_posts, "SHOW", 3, "points")
avg_metric(other_posts, "OTHER", 3, "points")

The average number of pointsper post in the ASK subset is: 11.31
The average number of pointsper post in the SHOW subset is: 14.84
The average number of pointsper post in the OTHER subset is: 15.16


#### Posts that are neither *Ask HN* nor *Show HN* receive the most points.

The data above show that non-ask, non-show HN posts receive the kmost points, followed by Show HN posts. I imagine the **other** category including posting memes, newstories, or cat photos which are easy to award points to, but neither promote discussion, ask for information, nor provide information.

A brief exploration of the articles composing the other category is given below.

In [8]:
print(other_posts.iloc[100:110,1])

109        Solve a wooden puzzle with Python and Jupyter
110                                 Dont be too inspired
111    Marc Andreessen suddenly deletes all his tweet...
112    Wearable tech devices have a negative effect o...
113    Strong Growth of Entertainment and Nightlife I...
114    A Possible Future of Software Development by  ...
115               (Idea) A Spotify for Lifelong Learning
116    10,000 Listings and $30M in Bookings in First ...
117                       Carver Mead  Research Overview
118                                             Wikilisp
Name: title, dtype: object


From the post titles, I infer that the **other** category consitst of a mixture of unlabeled "Show HN" posts and news/technical stories that users think would be interesting for the community:

**Unlabeled Show HN**
* Solve a wooden puzzle with Python and Jupyter


**News/Techincal Stories**
* Dont be too inspired
* Marc Andereesen suddenly deletes all his tweet...
* Wearable tech devices have a negative effect o...
* Strong Growth of Entertainment and Nightlife I...
* 10,000 Listings and $30M in Bookings in First...
* Carver Mead Research Overview
* Wikilisp

**Personal views/reviews**
* A Possible Future of Software Development by...
* (Idea) A Spotify for Lifelong Learning




Given that news stories tend to have broader appeal than any one project, it makes sense that more users would be interested in rating the story.

### Does time-of-day affect *Ask HN* post comment number?

In [9]:
comments_ask_sort = metric_sort(ask_posts, "ASK", 'num_comments', "comment(s)", 5)
comments_show_sort = metric_sort(show_posts, "SHOW", 'num_comments', "comment(s)", 5)
comments_other_sort = metric_sort(other_posts, "OTHER", 'num_comments', "comment(s)", 5)

Top hours for comment(s) on ASK posts
---------------------------------------
Hour 15 = 28.68
Hour 13 = 16.32
Hour 12 = 12.38
Hour 02 = 11.14
Hour 10 = 10.68


Top hours for comment(s) on SHOW posts
---------------------------------------
Hour 12 = 6.99
Hour 07 = 6.68
Hour 11 = 6.00
Hour 08 = 5.60
Hour 14 = 5.52


Top hours for comment(s) on OTHER posts
---------------------------------------
Hour 12 = 7.59
Hour 11 = 7.37
Hour 02 = 7.18
Hour 13 = 7.15
Hour 05 = 6.79




#### Early afternoon includes several of the best times to post

As shown in the code block above, if you want to maximize the number of comments you receive on your posts, it is best to post during the hours of 2pm, 12pm, 11am, 1am, or 9am Central time (time above are given in Eastern time)

### Does the number of points awarded to posts vary with time-of-day?

Finally, we can examine if the number of points, which I would take as a stronger proxy of user activity than number of comments, cvaries by time of day.

In [10]:
sorted_ask = metric_sort(ask_posts, 'ASK', 'num_points', 'points', 5)
sorted_show = metric_sort(show_posts, 'SHOW', 'num_points', 'points', 5)
sorted_other = metric_sort(other_posts, 'OTHER', 'num_points', 'points', 5)

Top hours for points on ASK posts
---------------------------------------
Hour 15 = 21.64
Hour 13 = 17.93
Hour 12 = 13.58
Hour 10 = 13.44
Hour 17 = 12.19


Top hours for points on SHOW posts
---------------------------------------
Hour 12 = 20.91
Hour 11 = 19.26
Hour 13 = 17.02
Hour 19 = 16.06
Hour 06 = 15.99


Top hours for points on OTHER posts
---------------------------------------
Hour 02 = 16.71
Hour 12 = 16.70
Hour 11 = 16.29
Hour 00 = 16.12
Hour 13 = 16.02




#### The number of points awarded per day varies least in the "Other" category of post

The most striking result of this analysis is that the points per post in the *other* category is almost flat across a 24 hour period. Contrast this with the ranges of the other two categories, which are **significantly** larger:

In [11]:
range_ask = sorted_ask.max() - sorted_ask.min()
range_show = sorted_show.max() - sorted_show.min()
range_other = sorted_other.max() - sorted_other.min()

print(f'Range of average points per post:\n' \
      f'---------------------------------\n' \
      f'Ask HN:  {range_ask:.2f}\n'          \
      f'Show HN: {range_show:.2f}\n'         \
      f'Other:    {range_other:.2f}')

Range of average points per post:
---------------------------------
Ask HN:  14.01
Show HN: 10.38
Other:    2.93


---
## Conclusion
   In this project, we examined a dataset containing posting, commenting, and rating statistics from "Hacker News." We addressed four questions:

1. What post category receives most comments onaverage?

Ask posts receive the most comments, followed by show posts, followed by other posts. I hypothesize that this is due to the very nature of ask posts requesting feedback.

2. What post category receives most points on average?

The **Ask** and **Show** posts receive fewer points than **Other** posts. A brief analysis shows that the **Other** category is composed of both news stories, opinion pieces, and unlabeled "ask" and "show" posts. I hypothesize that this observation is due to the digestiblilty of the **other** category. These posts do not require in-depth understanding of another person's work, so it is easier to give a point assignment than other are inherently easier to understand than **Ask** or **Show** posts.

3. Does the number of comments on an **Ask** HN post vary with the time of day it was posted?

Time-of-day does apper to correlate with the number of comments that are received per **Ask** HN post. Beyond a cluster of hours in the early afternoon Central US time, I do not see a particularly coherent pattern. Overall, the hour with the highest comment number per "Ask HN" post is 3pm Central US Time.

4. Does the number of points awarded to posts vary with the time of day it was posted?

This question revealed the surprising finding that in an average 24 hour period, the range in average in points per post versus hour was drastically lower in the **Other** category than in the **Ask** or **Show** caregories. This could reflect the broader interest that these topics have, as well as news articles being more digestible during the day than individual projects.