## Introduction to the Data

Hacker News is a community where users can submit articles, and other users can upvote those articles. The articles with the most upvotes make it to the front page, where they're more visible to the community.

The data set consists of submissions users made to Hacker News from 2006 to 2015.

The columns in the dataset include:

- submission_time - When the article was submitted
- upvotes - The number of upvotes the article received
- url - The base URL of the article
- headline - The article's headline

In this project, I'll be predicting the number of upvotes the articles received, based on their headlines. 

Because upvotes are an indicator of popularity, we'll discover which types of articles tend to be the most popular.

In [19]:
import pandas as pd
submissions = pd.read_csv("sel_hn_stories.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]

In [20]:
submissions

Unnamed: 0,submission_time,upvotes,url,headline
0,2010-02-17T16:57:59Z,1,blog.jonasbandi.net,Software: Sadly we did adopt from the construc...
1,2014-02-04T02:36:30Z,1,blogs.wsj.com,Google’s Stock Split Means More Control for L...
2,2011-10-26T07:11:29Z,1,threatpost.com,SSL DOS attack tool released exploiting negoti...
3,2011-04-03T15:43:44Z,67,algorithm.com.au,Immutability and Blocks Lambdas and Closures
4,2013-01-13T16:49:20Z,1,winmacsofts.com,Comment optimiser la vitesse de Wordpress?
5,2013-09-04T12:10:52Z,1,theincidentaleconomist.com,ilk is not as good for you as you think
6,2012-03-09T20:25:42Z,1,worldometers.info,Worldometers - Real time world statistics
7,2010-04-22T13:23:10Z,26,docs.com,icrosoft strikes back: introduces docs for fac...
8,2012-05-06T16:08:46Z,2,blog.hackplanet.in,Net HTTP status codes
9,2014-12-23T00:55:31.000Z,1,curt-rice.com,Anecdata or how McKinsey’s story became Sheryl...


## Finding missing Values

In [21]:
submissions.isnull().sum()

submission_time      0
upvotes              0
url                189
headline            10
dtype: int64

In [22]:
submissions.notnull().sum()

submission_time    2999
upvotes            2999
url                2810
headline           2989
dtype: int64

Since url and headline NaN values constitutes about less than 7% of the entire datapoints, its fair to drop all rows that contains any NaN value

In [23]:
submissions = submissions.dropna(axis=0)

In [24]:
submissions.isnull().sum()

submission_time    0
upvotes            0
url                0
headline           0
dtype: int64

## Tokenizing & Preprocessing the Headlines

As earlier said, I'll be using only the Headlines column to train the model

In [25]:
tokenized_headlines = []
for item in submissions["headline"]:
    tokenized_headlines.append(item.split())
tokenized_headlines

[['Software:',
  'Sadly',
  'we',
  'did',
  'adopt',
  'from',
  'the',
  'construction',
  'analogy'],
 ['Google’s',
  'Stock',
  'Split',
  'Means',
  'More',
  'Control',
  'for',
  'Larry',
  'and',
  'Sergey'],
 ['SSL',
  'DOS',
  'attack',
  'tool',
  'released',
  'exploiting',
  'negotiation',
  'overhead'],
 ['Immutability', 'and', 'Blocks', 'Lambdas', 'and', 'Closures'],
 ['Comment', 'optimiser', 'la', 'vitesse', 'de', 'Wordpress?'],
 ['ilk', 'is', 'not', 'as', 'good', 'for', 'you', 'as', 'you', 'think'],
 ['Worldometers', '-', 'Real', 'time', 'world', 'statistics'],
 ['icrosoft', 'strikes', 'back:', 'introduces', 'docs', 'for', 'facebook'],
 ['Net', 'HTTP', 'status', 'codes'],
 ['Anecdata',
  'or',
  'how',
  'McKinsey’s',
  'story',
  'became',
  'Sheryl',
  'Sandberg’s',
  'fact'],
 ['Immigration', 'Overhaul', 'Passes', 'in', 'Senate'],
 ['What', 'matters', 'most', 'at', 'Ad:TECH', 'SF', '2014'],
 ['Amazon',
  'Silk',
  'revisited:',
  'Is',
  'the',
  'split',
  'cloud',

from tokenized_headlines:

- some punctuations are attached to each splitted item
- the same words are differentiated by uppercase and lowercase

In [26]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []
for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in punctuation:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized.append(tokens)

In [27]:
clean_tokenized

[['software',
  'sadly',
  'we',
  'did',
  'adopt',
  'from',
  'the',
  'construction',
  'analogy'],
 ['googles',
  'stock',
  'split',
  'means',
  'more',
  'control',
  'for',
  'larry',
  'and',
  'sergey'],
 ['ssl',
  'dos',
  'attack',
  'tool',
  'released',
  'exploiting',
  'negotiation',
  'overhead'],
 ['immutability', 'and', 'blocks', 'lambdas', 'and', 'closures'],
 ['comment', 'optimiser', 'la', 'vitesse', 'de', 'wordpress'],
 ['ilk', 'is', 'not', 'as', 'good', 'for', 'you', 'as', 'you', 'think'],
 ['worldometers', '', 'real', 'time', 'world', 'statistics'],
 ['icrosoft', 'strikes', 'back', 'introduces', 'docs', 'for', 'facebook'],
 ['net', 'http', 'status', 'codes'],
 ['anecdata',
  'or',
  'how',
  'mckinseys',
  'story',
  'became',
  'sheryl',
  'sandbergs',
  'fact'],
 ['immigration', 'overhaul', 'passes', 'in', 'senate'],
 ['what', 'matters', 'most', 'at', 'adtech', 'sf', '2014'],
 ['amazon',
  'silk',
  'revisited',
  'is',
  'the',
  'split',
  'cloud',
  'brows

#### Assembling a Matrix of Unique Words

In [29]:
import numpy as np
unique_tokens = []
single_tokens = []
for tokens in clean_tokenized:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

In [31]:
for i, item in enumerate(clean_tokenized):
    for token in item:
        if token in unique_tokens:
            counts.iloc[i][token] += 1

#### Removing stopwords 

In [32]:
word_counts = counts.sum(axis=0)

counts = counts.loc[:,(word_counts >= 5) & (word_counts <= 100)]

## Training a Model with LInearRegression

first of all, I'll split the data into train and test data and out target column is the upvotes column

In [34]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.35, random_state=1)

In [35]:
from sklearn.linear_model import LinearRegression

clf = LinearRegression()
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

#### Picking an Error Metric

I'll use rmse as thee error metric

In [41]:
mse = sum((predictions - y_test) ** 2) / len(predictions)
rmse = np.sqrt(mse)
rmse

54.556993831940254

## Training a Model with RandomForestRegressor

In [49]:
from sklearn.ensemble import RandomForestRegressor


In [47]:
rfr = RandomForestRegressor(n_estimators=5, random_state=1, min_samples_leaf=2)

rfr.fit(X_train, y_train)

predictions_1 = rfr.predict(X_test)

In [48]:
mse = sum((predictions_1 - y_test) ** 2) / len(predictions_1)
rmse = np.sqrt(mse)
rmse

44.84857168288281

 There's no hard and fast rule about what a "good" error rate is, because it depends on the problem we're solving and our error tolerance.

In this case, the mean number of upvotes is 10, and the standard deviation is 39.5.

Using the Linear Regression Model, we got our rmse to be 54.5. This means that our average error is 54.5 upvotes away from the true value. 

Using the RandomForest Regressor Model, we got our rmse to be 44.8. This means that our average error is 44.8 upvotes away from the true value


it seems the RandomForest Regressor Model performed better than Linear Regression Model, cause it's less farther from the standard deviation