This notebook explores the characteristics of posts that attract the most upvotes on Hacker News. I conclude the ultimate post is **Sam Altman posting about Rust at midday ET on Sunday.**

A fun extension of this basic analysis would be a machine learning model that predicts upvotes based on a post's characteristics.

In [None]:
#load in libraries
import pandas as pd
import re
%matplotlib inline

In [None]:
#read in the data set
df_hn = pd.read_csv('../input/HN_posts_year_to_Sep_26_2016.csv',parse_dates=['created_at'],index_col=[0])

###Top Ten Posts Over the Last Year

Apple's letter to customer about the US Gov request to break into the iPhone received the most upvotes followed by a BBC article about the UK voting to leave the EU. 

In [None]:
df_hn[['title','url','num_points']].sort_values(by='num_points',ascending=False)[0:10]

###Exploring relationship between domain names and upvotes

Links to Medium and Github appear most often

In [None]:
df_hn['domain'] = df_hn['url'].str.extract('^http[s]*://([0-9a-z\-\.]*)/.*$',flags=re.IGNORECASE,expand=False)
df_groupby = df_hn.groupby(by='domain')
df_groupby['num_points'].count().sort_values(ascending=False)[0:20]

And Github and Medium also attract the most aggregate upvotes. 

In [None]:
df_groupby['num_points'].sum().sort_values(ascending=False)[0:20]
#ideally I'd strip out the subdomains

However, of the urls that have 10+ posts, the posts about the programming language Rust attract the most upvotes on average. 

In [None]:
df_groupby['num_points'].mean()[df_groupby['num_points'].count() > 9].sort_values(ascending=False)[0:20]

###Best time to post

Midday ET is the best time to post

In [None]:
df_hn['hour'] = df_hn['created_at'].dt.hour
df_groupby = df_hn.groupby(by='hour')
df_groupby['num_points'].mean().sort_values(ascending=False)
#should really strip out outliers before doing analyzing impact of hour of day or day of week

Sunday is the best day to post

In [None]:
df_hn['dayofweek'] = df_hn['created_at'].dt.dayofweek
df_groupby = df_hn.groupby(by='dayofweek')
df_groupby['num_points'].mean().sort_values(ascending=False)
#Monday is 0 and Sunday is 6

###Are there users who have a good nose for a popular HN post?

ingve has attracted the most upvotes in aggregate over the last year

In [None]:
##top 20 users whose posts attract the most upvotes
df_groupby = df_hn.groupby(by='author')
df_groupby['num_points'].sum().sort_values(ascending=False)[0:20]

Sam Altman attracts the most upvotes per post on average (of those who have made more 10+ posts)

In [None]:
df_groupby['num_points'].mean()[df_groupby['num_points'].count() > 9].sort_values(ascending=False)[0:20]