### Exploring Hacker News Website Posts Data

### Background
Hacker News is a social website focusing on computer science and entrepreneurship. It is run by Paul's Grahams investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".

Users submit <b><u>Ask HN</u></b> posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Likewise, users submit <b><u>Show HN</u></b> posts to show the Hacker News community a project, product, or just generally something interesting.
 <br><br>
 ##### Dataset
This dataset that we are working it, comprising of almost 300,000 posts, from September 2015 to September 2016 can be found in Kaggle: <br>
https://www.kaggle.com/hacker-news/hacker-news-posts <br>

It contains the following columns:


|Feature|Type|Description|
|---|---|---|
|**id**|*int*|Post ID|
|**title**|*object*|Title of the post|
|**url**|*object*|URL of the item being linked to|
|**num_points**|*int*|Number of upvotes the post received|
|**num_comments**|*int*|Number of comments the post received|
|**author**|*object*|Name of the account that created that post|
|**created_at**|*object*|Date and time the post was made (Eastern US Timezone)|


<br><br>
##### Project Scope
We are going to analyze what is the feedback for two types of HN posts (Ask and Show). And see which one is more popular (we are going to take comments average as our metric) <br>

We'll specifically compare these two types of posts (Ask HN & Show HN) to determine the following: <br>

1. Do Ask HN or Show HN receive more comments on average? 
2. Do Ask HN posts created at a certain time receive more comments on average?
3. What are the most common topics posted on Hacker News?
4. Predict the number of upvotes a headline would receive


In [1]:
import numpy as np
import pandas as pd
import datetime as dt

In [2]:
hn = pd.read_csv(r'./HN_posts_year_to_Sep_26_2016.csv')

In [3]:
hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293119 entries, 0 to 293118
Data columns (total 7 columns):
id              293119 non-null int64
title           293119 non-null object
url             279256 non-null object
num_points      293119 non-null int64
num_comments    293119 non-null int64
author          293119 non-null object
created_at      293119 non-null object
dtypes: int64(3), object(4)
memory usage: 15.7+ MB


In [4]:
display(hn.head(5))

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14


In [38]:
hn["num_comments"].value_counts()[0]
#from below, we can see that the num_comments column has a total of 212,718 posts with 0 comments

212718

In [6]:
hn_filtered = hn[hn["num_comments"]!=0]
hn_filtered.info()
#total of 80,401 posts that have comments

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80401 entries, 5 to 293116
Data columns (total 7 columns):
id              80401 non-null int64
title           80401 non-null object
url             70664 non-null object
num_points      80401 non-null int64
num_comments    80401 non-null int64
author          80401 non-null object
created_at      80401 non-null object
dtypes: int64(3), object(4)
memory usage: 4.9+ MB


In [7]:
#by looking at the title column of the dataset, we can see how many Ask HN and Show HN posts are there
hn["title"] = hn["title"].str.lower()
ask_posts = hn[hn["title"].str.startswith("ask hn")]
show_posts = hn[hn["title"].str.startswith("show hn")]
other_posts = hn[~(hn["title"].str.startswith("ask hn")) & ~(hn["title"].str.startswith("show hn"))]

In [8]:
print("{} ask posts".format(len(ask_posts)))
print("{} show posts".format(len(show_posts)))
print("{} other posts".format(len(other_posts)))

9139 ask posts
10158 show posts
273822 other posts


In [9]:
#average comments per ask post
mean_comments_ask = np.mean(ask_posts["num_comments"])
print("On average, {} comments per Ask HN post".format(round(mean_comments_ask,2)))
#average comments per show post
mean_comments_show = np.mean(show_posts["num_comments"])
print("On average, {} comments per Show HN post".format(round(mean_comments_show,2)))

On average, 10.39 comments per Ask HN post
On average, 4.89 comments per Show HN post


###### To answer our first question "Do Ask HN or Show HN receive more comments on average?"
###### Ask HN has 10.39 comments per post on average as compared to  4.89 comments for each Show HN post.

In [39]:
#convert "created_at" column to datetime datatype
ask_posts.loc[:,"created_at"] = pd.to_datetime(ask_posts.loc[:,"created_at"],
                                               format = "%m/%d/%Y %H:%M")

In [40]:
ask_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9139 entries, 10 to 293114
Data columns (total 8 columns):
id              9139 non-null int64
title           9139 non-null object
url             56 non-null object
num_points      9139 non-null int64
num_comments    9139 non-null int64
author          9139 non-null object
created_at      9139 non-null datetime64[ns]
hour            9139 non-null int64
dtypes: datetime64[ns](1), int64(4), object(3)
memory usage: 642.6+ KB


In [41]:
#extract hour value from "created_at" column to a new column "hour"
ask_posts.loc[:,"hour"] = ask_posts.loc[:,"created_at"].dt.hour

In [42]:
display(ask_posts)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at,hour
10,12578908,ask hn: what tld do you use for local developm...,,4,7,Sevrene,2016-09-26 02:53:00,2
42,12578522,ask hn: how do you pass on your work when you ...,,6,3,PascLeRasc,2016-09-26 01:17:00,1
76,12577908,ask hn: how a dns problem can be limited to a ...,,1,0,kuon,2016-09-25 22:57:00,22
80,12577870,ask hn: why join a fund when you can be an angel?,,1,3,anthony_james,2016-09-25 22:48:00,22
102,12577647,ask hn: someone uses stock trading as passive ...,,5,2,00taffe,2016-09-25 21:50:00,21
...,...,...,...,...,...,...,...,...
293047,10177359,ask hn: is coursera specialization in product ...,,1,0,pipipzz,2015-09-06 11:27:00,11
293052,10177317,ask hn: any meteor devs out there who could sp...,,2,1,louisswiss,2015-09-06 10:52:00,10
293055,10177309,ask hn: any recommendations for books about ra...,,2,4,rationalthrowa,2015-09-06 10:46:00,10
293073,10177200,ask hn: where do you look for work if you need...,,14,20,coroutines,2015-09-06 09:36:00,9


In [14]:
comments_by_hour = ask_posts.groupby("hour")["num_comments"].sum().sort_values(ascending= False)
posts_by_hour = ask_posts.groupby("hour")["id"].count().sort_values(ascending= False)

In [15]:
#combine comments_by_hour and posts_by_hour series
filtered = comments_by_hour.to_frame().join(posts_by_hour)
filtered = filtered.reset_index()
filtered.rename(columns = {"id": "num_of_posts"}, inplace=True)

In [16]:
filtered["avg_comments_per_post"] = filtered.iloc[:,1]/filtered.iloc[:,2]

In [17]:
filtered = filtered.sort_values(by="avg_comments_per_post", ascending=False)

In [33]:
display(filtered)

Unnamed: 0,hour,num_comments,num_of_posts,avg_comments_per_post
0,15,18525,646,28.676471
1,13,7245,444,16.317568
8,12,4234,342,12.380117
12,2,2996,269,11.137546
11,10,3013,282,10.684397
15,4,2360,243,9.711934
3,14,4972,513,9.692008
2,17,5547,587,9.449744
14,8,2362,257,9.190661
13,11,2797,312,8.964744


###### To answer our second question "Do Ask HN posts created at a certain time receive more comments on average?"
###### Ask HN posts have an average of:
###### 28.68 comments per post at 3pm (15:00 hrs)
###### 16.32 comments per post at 1pm (13:00 hrs)
###### 12.38 comments per post at 12pm (12:00 hrs)
###### 11.14 comments per post at 2am (02:00 hrs)
###### 10.68 comments per post at 10am (10:00 hrs)

###### These values have a greater average comments than the total average for Ask HN posts which was calculated at 10.39.

### Data Preparation
The only columns we will be using for regression analysis are the "title" and "num_points" to find out roughly how many upvotes a headline would receive.

In [30]:
train = hn.loc[:, ["title","num_comments"]]
#sampling 80% of the dataset
train = train.sample(frac=0.8,axis=0).reset_index()

train = train.dropna()
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234495 entries, 0 to 234494
Data columns (total 3 columns):
index           234495 non-null int64
title           234495 non-null object
num_comments    234495 non-null int64
dtypes: int64(2), object(1)
memory usage: 7.2+ MB


In [31]:
import string
train["no_puncs"] = train["title"]\
.apply(lambda x : x.translate(str.maketrans("","",string.punctuation)))

In [32]:
train

Unnamed: 0,index,title,num_comments,no_puncs
0,175366,the resetting of the startup industry,0,the resetting of the startup industry
1,42263,maxscale-an intelligent database proxy,0,maxscalean intelligent database proxy
2,45142,comparing mercedes-benz e-class drivepilot and...,2,comparing mercedesbenz eclass drivepilot and t...
3,54963,video: trust me i'm lying. a well done animate...,0,video trust me im lying a well done animated b...
4,146444,go game guru learn all about the board game go,85,go game guru learn all about the board game go
...,...,...,...,...
234490,105141,"hamburg, germany, bans coffee pod machines fro...",0,hamburg germany bans coffee pod machines from ...
234491,44247,"microsoft laying off another 2,850 people in t...",1,microsoft laying off another 2850 people in th...
234492,222176,decoding fallout 4's pip-boy database with a c...,0,decoding fallout 4s pipboy database with a com...
234493,237144,how to get started with ionic framework on mac...,0,how to get started with ionic framework on mac...


### Conclusion
To get a higher chance of input from other users for a question posted in the Hacker News Website, we should post the question from 10am to 3pm. This could also possibly mean that there is a higher user traffic during these periods.