## Hacker News - Predicting Upvotes

The data set consists of submissions users made to Hacker News from 2006 to 2015. Developer Arnaud Drizard used the Hacker News API to scrape the data, which you can find in one of his GitHub repositories. We've sampled 3000 rows from the data randomly, and removed all of the extraneous columns. Our data only has four columns:

- submission_time - When the article was submitted
- upvotes - The number of upvotes the article received
- url - The base URL of the article
- headline - The article's headline
In this mission, we'll be predicting the number of upvotes the articles received, based on their headlines. Because upvotes are an indicator of popularity, we'll discover which types of articles tend to be the most popular.

In [1]:
import pandas as pd

In [2]:
submissions = pd.read_csv("sel_hn_stories.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()

In [3]:
tokenized_headlines = []
for item in submissions['headline']:
    words = item.split(" ")
    tokenized_headlines.append(words)

In [4]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []
for lst in tokenized_headlines:
    words = []
    for word in lst:
        word = word.lower()
        for punc in punctuation:
            word = word.replace(punc, "")
        words.append(word)
    clean_tokenized.append(words)

In [5]:
import numpy as np
unique_tokens = []
single_tokens = []
for lst in clean_tokenized:
    for token in lst:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)
            
counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

In [6]:
for i, lst in enumerate(clean_tokenized):
    for token in lst:
        if token in unique_tokens:
            counts.iloc[i][token] +=1

In [7]:
word_counts = counts.sum(axis=0)
counts = counts.loc[:, (word_counts >= 5) & (word_counts <= 100)]

### Apply machine learning - linear regression

In [8]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)



In [9]:
from sklearn.linear_model import LinearRegression

clf = LinearRegression()
predictions = clf.fit(X_train, y_train).predict(X_test)

In [10]:
mse = sum((y_test - predictions)**2)/len(predictions)

In [11]:
print mse

2652.60825125
