**Web-Scraping For Our Data**

We're going to utilize BeautifulSoup to scrape our data from Reddit's main home page.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

reddit = pd.DataFrame(columns=['Title', 'Subreddit', 'Number of Comments', 'Time', 'Domain', 'Score'])
end_page=''
for x in range(5000):
    try:
        url = "https://www.reddit.com/r/all/"+end_page
        response = requests.get(url, headers={'User-Agent': 'DontWorry'})
        html = response.text
        soup = BeautifulSoup(html, 'lxml')
        titles = [title.text for title in soup.find_all('a', {'data-event-action': 'title'})]
        subreddits = [sub.text for sub in soup.find_all('a', {'class': 'subreddit hover may-blank'})]
        comments = [comment.text for comment in soup.find_all('a', {'class': 'bylink comments may-blank'})]
        time = [time['title'] for time in soup.find_all('time')]  
        domain = [domain.text for domain in soup.find_all('span', {'class': 'domain'})]
        scores = [score.text for score in soup.find_all('div', {'class': 'score likes'})]
        end_tags = soup.find_all('div', {'class':'reportform'})
        last_tag=end_tags[24]['class'][1].split('-')[1]
        end_page="?count=25&after="+last_tag                                
        df = pd.DataFrame({'Title': titles, 'Subreddit': subreddits, 'Number of Comments': comments, 'Time': time, 'Domain': domain, 'Score': scores})
        reddit = reddit.append(df)
        reddit.to_csv('reddit_af.csv')
    except:
        pass

In [None]:
reddit = pd.read_csv('reddit_af.csv')
reddit

**EDA**

In [151]:
import pandas as pd
import numpy as np

In [152]:
df = pd.read_csv('/Users/m.arthurbentil/Documents/project-3/reddit_fa.csv')

In [153]:
df.head()

Unnamed: 0.1,Unnamed: 0,Domain,Number of Comments,Score,Subreddit,Time,Title
0,0,(gfycat.com),177 comments,5080,r/educationalgifs,Thu Feb 22 18:47:55 2018 UTC,"how the Japanese ""roll their sleeves up"""
1,1,(i.redd.it),47 comments,8821,r/tumblr,Thu Feb 22 16:49:02 2018 UTC,There are some universal truths which cannot b...
2,2,(i.redd.it),547 comments,26.7k,r/ProgrammerHumor,Thu Feb 22 15:00:50 2018 UTC,FrontEnd VS BackEnd
3,3,(press.cc.com),523 comments,14.7k,r/television,Thu Feb 22 16:01:27 2018 UTC,"Comedy Central Renews ""Drunk History"" for a Si..."
4,4,(i.redd.it),905 comments,24.3k,r/facepalm,Thu Feb 22 15:42:38 2018 UTC,ironic.jpg


In [154]:
#remove/replace unnecessary string from the dataframe
df['Domain'] = df['Domain'].map(lambda x: x.lstrip('(').rstrip(')')) #remove brackets
df['Number of Comments'] = df['Number of Comments'].map(lambda x: x.rstrip(' comments')) #remove string ' comments'
df['Score'] = df['Score'].str.replace('.', '') #replace the '.' with nothing
df['Score'] = df['Score'].str.replace('k', '00')#replace the 'k' with '00'
df['Score'] = df['Score'].str.replace('•', '0')# Replace the '.' with '0' (to deal with blank spaces)
df['Time_Hour'] = df['Time'].str[11:13] #slice the time portion from the 'Time' column for just the hour (excluding minutes)
df.drop(['Time', 'Unnamed: 0'], axis=1, inplace=True)



In [155]:
df.dtypes

Domain                object
Number of Comments    object
Score                 object
Subreddit             object
Title                 object
Time_Hour             object
dtype: object

In [156]:
df['Number of Comments'] = df['Number of Comments'].astype(int) # Change the 'Number of Comments' column to an integer type

In [157]:
df['Score'] = df['Score'].astype(int) # Change the 'Number of Comments' column to an integer type

In [158]:
df['Score'].head()

0     5080
1     8821
2    26700
3    14700
4    24300
Name: Score, dtype: int64

In [159]:
# Find the median for the number of comments. 
# Afterwards, we will create a binary variable for the comments column with variables as 'High' (above median) vs. 'Low' (below median).
df['Number of Comments'].median()

22.0

In [160]:
# Create a new column for the binary variables of 'Number of Comments'
df['Comments Magnitude'] = np.where(df['Number of Comments']>= 22, 'High', 'Low')

In [161]:
df.columns

Index(['Domain', 'Number of Comments', 'Score', 'Subreddit', 'Title',
       'Time_Hour', 'Comments Magnitude'],
      dtype='object')

Since we have two variables we created ('High' and 'Low'), our baseline accuracy for the model is 50%.

In [162]:
df.head()

Unnamed: 0,Domain,Number of Comments,Score,Subreddit,Title,Time_Hour,Comments Magnitude
0,gfycat.com,177,5080,r/educationalgifs,"how the Japanese ""roll their sleeves up""",18,High
1,i.redd.it,47,8821,r/tumblr,There are some universal truths which cannot b...,16,High
2,i.redd.it,547,26700,r/ProgrammerHumor,FrontEnd VS BackEnd,15,High
3,press.cc.com,523,14700,r/television,"Comedy Central Renews ""Drunk History"" for a Si...",16,High
4,i.redd.it,905,24300,r/facepalm,ironic.jpg,15,High


**Model 1: Random Forests**

In [163]:
# Now, we are going to create a Random Forest model to predict when comments are 'High' or 'Low' using only 'Subreddits' as a feature
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(df['Comments Magnitude'])
X = pd.get_dummies(df.drop(['Comments Magnitude', 'Domain', 'Time_Hour', 'Score', 'Title', 'Number of Comments'], axis=1))

In [164]:
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)

In [165]:
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.586 ± 0.014


Our Random Forest score is 58.6%, which makes sense as we are using only one feature ('Subreddits'). Using one feature doesn't help to create a strong model. If we were to keep all the features except 'Comments Magnitude' in our code, we would get:

In [166]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(df['Comments Magnitude'])
X = pd.get_dummies(df.drop(['Comments Magnitude'], axis=1))
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.991 ± 0.006


That's a high score! What if we exclude all variables except for one to see which one may have the biggest impact?

In [167]:
# Random Forest Model with 'Domain'
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(df['Comments Magnitude'])
X = pd.get_dummies(df.drop(['Comments Magnitude', 'Time_Hour', 'Score', 'Title', 'Number of Comments', 'Subreddit'], axis=1))
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.573 ± 0.023


In [169]:
# Random Forest Model with 'Time_Hour'
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(df['Comments Magnitude'])
X = pd.get_dummies(df.drop(['Comments Magnitude', 'Domain','Score', 'Title', 'Number of Comments', 'Subreddit'], axis=1))
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.644 ± 0.007


In [168]:
# Random Forest Model with 'Score'
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(df['Comments Magnitude'])
X = pd.get_dummies(df.drop(['Comments Magnitude', 'Domain','Time_Hour', 'Title', 'Number of Comments', 'Subreddit'], axis=1))
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.673 ± 0.006


Between the variables 'Score', 'Time_Hour', and 'Domain', it looks like 'Score' gives the better chance of accuracy with 67.3%.

Let's create some new features to explore our data further. I'm curious to see if some of the most popular Google Trends for the day I scraped data (February 21, 2018) corresponds with the Reddit titles. In recent news, two things that are getting a lot of press are gun control and the "Black Panther" movie. We will create variables for these two.

In [170]:
df['Guns'] = np.where(df['Title'].str.contains('gun', case=False),'Yes', 'No' )

In [171]:
df['Black_P'] = np.where(df['Title'].str.contains('black panther', case=False), 'Yes', 'No')

In [172]:
df.head()

Unnamed: 0,Domain,Number of Comments,Score,Subreddit,Title,Time_Hour,Comments Magnitude,Guns,Black_P
0,gfycat.com,177,5080,r/educationalgifs,"how the Japanese ""roll their sleeves up""",18,High,No,No
1,i.redd.it,47,8821,r/tumblr,There are some universal truths which cannot b...,16,High,No,No
2,i.redd.it,547,26700,r/ProgrammerHumor,FrontEnd VS BackEnd,15,High,No,No
3,press.cc.com,523,14700,r/television,"Comedy Central Renews ""Drunk History"" for a Si...",16,High,No,No
4,i.redd.it,905,24300,r/facepalm,ironic.jpg,15,High,No,No


In [173]:
# Creating a Random Forest model using the two new variables in the model
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(df['Comments Magnitude'])
X = pd.get_dummies(df.drop(['Comments Magnitude', 'Score', 'Domain','Time_Hour', 'Title', 'Number of Comments', 'Subreddit'], axis=1))
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.505 ± 0.003


In [174]:
#Creating a Random Forest model using 'Black Panther' variable
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(df['Comments Magnitude'])
X = pd.get_dummies(df.drop(['Comments Magnitude', 'Guns', 'Score', 'Domain','Time_Hour', 'Title', 'Number of Comments', 'Subreddit'], axis=1))
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.507 ± 0.0


In [175]:
#Creating a Random Forest model using 'Guns' variable
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(df['Comments Magnitude'])
X = pd.get_dummies(df.drop(['Comments Magnitude', 'Black_P', 'Score', 'Domain','Time_Hour', 'Title', 'Number of Comments', 'Subreddit'], axis=1))
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.505 ± 0.003


Those two variables we created did not perform as well (both had scores wavering around 50%).

Next, we are going to create a count-vectorizer based on words in the thread titles.

In [231]:
# Let's set up our new X and y
# First, let's create a new column that will have the number of characters in the title

df['Title_Length'] = df['Title'].apply(lambda x: len(x))

In [232]:
df['Target'] = [1 if x=='High' else 0 for x in df['Comments Magnitude']]
y_2 = df['Target']
X_2 = df['Title']

In [233]:
# Set up our train and test data sets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
X_train, X_test, y_train, y_test = train_test_split(X_2,y_2, test_size=.30)

In [234]:
# Next we will instantiate our data
count = CountVectorizer()

In [235]:
# Now we will remove English stop words and create at most 500 new columns
count = CountVectorizer(stop_words='english', max_features=500)

In [236]:
count.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=500, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [237]:
len(count.vocabulary_)

500

In [238]:
# Identify the shape of the training set and fit it
X_train.shape

(1767,)

In [239]:
X_train.head()

1257    Seeing Doom's post-patch-performance makes me ...
544     Depression substantially reduced with multivit...
31                                                 Thanks
2372    🚨🚨 BREAKING : JURY FIND TEXAS DEMOCRAT STATE S...
354     One stem of Lillies I purchased grew triple fl...
Name: Title, dtype: object

In [240]:
# Next, we will fit and transform our data
X_train_matrix = count.transform(X_train)
X_train_matrix = count.fit_transform(X_train)

In [241]:
# Fit our training data to a forest model, then determine our Forest score on the count-vectorized data
forest = RandomForestClassifier(max_depth=10, n_estimators=5)
forest.fit(X_train_matrix, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [242]:
forest.score(X_train_matrix, y_train)

0.5704584040747029

Our score a little bit above 50% (our threshhold), so our score isn't bad. Next, we have to test our data.

In [243]:
X_test_matrix = count.transform(X_test)
forest.predict(X_test_matrix)
forest.score(X_test_matrix, y_test)

0.4868073878627968

Not good. Our score is right below 50% accuracy. We just missed the cut. Now, let build a TDIDF model:

In [247]:
X_train_matrix = tfidf.fit_transform(X_train)
X_test_matrix  = tfidf.transform(X_test)
forest.fit(X_train_matrix, y_train)
forest.score(X_test_matrix, y_test)

0.47493403693931396

The accuracy score actually took a hit. We need to build a table of our new data set.

In [249]:
X_3 = df[['Title','Title_Length']]
X_train, X_test, y_train, y_test = train_test_split(X_3,y_2, test_size=.20)
X_train_matrix = tfidf.fit_transform(X_train['Title'])
X_test_matrix = tfidf.transform(X_test['Title'])

In [250]:
X_train_df = pd.DataFrame(X_train_matrix.todense(),
                     columns=tfidf.get_feature_names(),
                         index=X_train.index)

X_test_df = pd.DataFrame(X_test_matrix.todense(),
                     columns=tfidf.get_feature_names(),
                        index=X_test.index)

In [251]:
X_train = X_train.join(X_train_df).drop('Title', axis=1)

X_test  = X_test.join(X_test_df).drop('Title', axis=1)

In [252]:
X_train.head()

Unnamed: 0,Title_Length,000,10,1080,11,12,15,16,18,1977,...,woman,women,won,wonder,word,work,world,year,years,yes
1317,26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1574,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1404,160,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
867,45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
317,22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.626043,0.0


In [261]:
X_test.head()

Unnamed: 0,Title_Length,000,10,1080,11,12,15,16,18,1977,...,woman,women,won,wonder,word,work,world,year,years,yes
126,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
719,22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2383,53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1590,38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1268,86,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [269]:
X_train.describe()

Unnamed: 0,Title_Length,000,10,1080,11,12,15,16,18,1977,...,woman,women,won,wonder,word,work,world,year,years,yes
count,2020.0,2020.0,2020.0,2020.0,2020.0,2020.0,2020.0,2020.0,2020.0,2020.0,...,2020.0,2020.0,2020.0,2020.0,2020.0,2020.0,2020.0,2020.0,2020.0,2020.0
mean,60.147525,0.001973,0.003632,0.001244,0.00213,0.002388,0.001928,0.001506,0.001569,0.001802,...,0.001215,0.002059,0.003451,0.002226,0.001475,0.006684,0.003189,0.007393,0.005857,0.002216
std,49.310582,0.034651,0.043579,0.032581,0.039332,0.036652,0.035027,0.030517,0.033144,0.038444,...,0.028233,0.035126,0.047251,0.045124,0.03028,0.0633,0.044834,0.063598,0.051207,0.03645
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,48.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,77.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,300.0,0.785535,1.0,1.0,1.0,1.0,1.0,0.707107,1.0,1.0,...,0.768014,1.0,1.0,1.0,0.715143,1.0,1.0,1.0,0.650278,0.806653


In [267]:
# Let's create a new variable where we separate posts with a high number of characters in the title from posts with a low number of characters in their post
X_train['Title_Length'].median()

48.0

In [279]:
# Our median is 48.0 so that will be our threshhold. Next, we'll create some new features based on the most popular words in the posts
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X_train)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()
top_n = 3
top_features = [features[i] for i in indices[:top_n]]
print(top_features)

# Our three most popular words are 'yes', 'gif', and 'fraud'

['yes', 'gif', 'fraud']


In [284]:
df['Popular'] = np.where(df['Title'].str.contains('yes' or 'gif' or 'fraud', case=False),'Yes', 'No' )
df['Gif'] = np.where(df['Title'].str.contains( 'gif', case=False),'Yes', 'No' )
df['Char'] = np.where(len(df['Title'])>=48, 'High', 'Low')

In [285]:
df.head()

Unnamed: 0,Domain,Number of Comments,Score,Subreddit,Title,Time_Hour,Comments Magnitude,Guns,Black_P,Title_Length,Target,Popular,Gif,Char
0,gfycat.com,177,5080,r/educationalgifs,"how the Japanese ""roll their sleeves up""",18,High,No,No,40,1,No,No,High
1,i.redd.it,47,8821,r/tumblr,There are some universal truths which cannot b...,16,High,No,No,55,1,No,No,High
2,i.redd.it,547,26700,r/ProgrammerHumor,FrontEnd VS BackEnd,15,High,No,No,19,1,No,No,High
3,press.cc.com,523,14700,r/television,"Comedy Central Renews ""Drunk History"" for a Si...",16,High,No,No,57,1,No,No,High
4,i.redd.it,905,24300,r/facepalm,ironic.jpg,15,High,No,No,10,1,No,No,High


In [286]:
# Let's see what type of impact a post with a popular post will make 
y = LabelEncoder().fit_transform(df['Comments Magnitude'])
X = pd.get_dummies(df.drop(['Comments Magnitude', 'Black_P', 'Title_Length', 'Target', 'Gif', 'Char', 'Guns', 'Score', 'Domain','Time_Hour', 'Title', 'Number of Comments', 'Subreddit'], axis=1))
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.509 ± 0.004


In [287]:
# Let's see what type of impact a post with a gif will make
y = LabelEncoder().fit_transform(df['Comments Magnitude'])
X = pd.get_dummies(df.drop(['Comments Magnitude', 'Black_P', 'Title_Length', 'Target', 'Popular', 'Char', 'Guns', 'Score', 'Domain','Time_Hour', 'Title', 'Number of Comments', 'Subreddit'], axis=1))
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.507 ± 0.003


In [288]:
# What type of impact does character length make?
y = LabelEncoder().fit_transform(df['Comments Magnitude'])
X = pd.get_dummies(df.drop(['Comments Magnitude', 'Black_P', 'Title_Length', 'Target', 'Popular', 'Gif', 'Guns', 'Score', 'Domain','Time_Hour', 'Title', 'Number of Comments', 'Subreddit'], axis=1))
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.507 ± 0.0


All of our accuracy scores were above 50%, but they didn't perform as well as we'd like them to. Now, let's perform a cross-validation of our data.

In [289]:
import numpy as np
from sklearn.model_selection import KFold
X = pd.get_dummies(df.drop(['Comments Magnitude', 'Black_P', 'Title_Length', 'Target', 'Gif', 'Char', 'Guns', 'Score', 'Domain','Time_Hour', 'Title', 'Number of Comments', 'Subreddit'], axis=1))
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print("%s %s" % (train, test))

[1263 1264 1265 ... 2522 2523 2524] [   0    1    2 ... 1260 1261 1262]
[   0    1    2 ... 1260 1261 1262] [1263 1264 1265 ... 2522 2523 2524]


**Model 2: KNN**

The second model we're going to try (to see if it's better than the Random Forest) is KNN.

In [291]:
# Let's obtain our baseline accuracy
np.mean(y)

0.49306930693069306

In [292]:
baseline = 1. - np.mean(y)
print('baseline:', baseline)

baseline: 0.5069306930693069


In [294]:
# Our previous model where we used the 'Popular' feature gave us a score of 50.9%. Let's see if we can improve on it.
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
Xs = pd.get_dummies(df.drop(['Comments Magnitude', 'Black_P', 'Title_Length', 'Target', 'Popular', 'Char', 'Guns', 'Score', 'Domain','Time_Hour', 'Title', 'Number of Comments', 'Subreddit'], axis=1))
knn5 = KNeighborsClassifier(n_neighbors=5, weights='uniform')

scores = cross_val_score(knn5, Xs, y, cv=5)
np.mean(scores), np.std(scores)

(0.5065346534653465, 0.0007920792079207928)

From our score above, we obtain an accuracy of 50.6%, slightly worse than our previous score. This shows that the Random Forest model works better in this case.