<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Reddit - Preprocessing & Model Selection

--- 
# Notebook 4

The fourth notebook will examine preprocessing techniques such as lemmatization, stemming and adding more stopwords and optimising models.

---

# 0. Import Package and Read in Data


In [26]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from nltk.tokenize import word_tokenize, WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords

# Import model
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

# Import Evaluations
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.feature_extraction import text
from sklearn import metrics

In [2]:
# Reading in the dataset
df_reddit = pd.read_csv('./dataset/cleaned.csv') 
df_reddit.head()

Unnamed: 0,subreddit,title,emoji,website,word_count,title_length
0,0,i came home sick from work with pneumonia here...,0,0,23,121
1,0,my sleepyhead curled up this morning,0,0,6,36
2,0,spooky takes care of me when i wfh who needs h...,0,0,15,67
3,0,hi i am a new cat mom and am hoping someone mi...,0,0,25,128
4,0,spooky looks after me when i wfh who needs a h...,0,0,19,94


In [3]:
# Check the number of rows and columns
df_reddit.shape

(19311, 6)

# 1. Text to Matrix Representation
We will use countvectorise to split the words of the titles.

In [4]:
X = df_reddit['title']

# Instantiate a CountVectorizer with english stopwords.
cvec = CountVectorizer(stop_words='english')

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_vec_reddit = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

df_vec_reddit.shape

(19311, 13472)

After splitting the title into words, the number of columns change from 1 to 13472. 

We will try to reduce the number of columns using:
* Increasing the stopwords
* Stemming 
* Lemmatization
for the model to perform better.

# 2.1 Stopwords
To find and add more stopwords, we will
* Check if the vectorised columns and the original columns are the same.
* Drop the vectorised columns with the same name as the original.
* Add back the original columns
* Split the dataframe into dog and cat subreddit
* Print out the top words for each subreddit
* Add more stopwords

In [5]:
# Drop the two columns as we want to add back the original columns
df_vec_reddit.drop(columns =['subreddit', 'website'], inplace = True)

df_vec_reddit.shape

(19311, 13470)

In [6]:
# Adding back the columns after countvectoriser
df_vec_reddit = pd.concat([df_vec_reddit, df_reddit[['subreddit', 'emoji', 'word_count', 'title_length']]], axis=1)

# Checking if 5 rows have been added.
df_vec_reddit.shape

(19311, 13474)

In [7]:
# Separating the dataframe into cats and dogs to find the top words for each subreddit
df_vec_reddit_dogs = df_vec_reddit[df_vec_reddit['subreddit'] == 1]
df_vec_reddit_cats = df_vec_reddit[df_vec_reddit['subreddit'] == 0]

In [8]:
df_vec_reddit_dogs.sum().sort_values(ascending=False).head(50)

title_length       453198
word_count          87343
subreddit            9737
dog                  4961
dogs                 1257
help                  805
puppy                 579
old                   402
advice                362
need                  312
does                  302
breed                 293
new                   291
food                  283
just                  227
im                    223
like                  215
know                  197
best                  190
getting               183
training              175
time                  170
wont                  165
year                  164
stop                  156
dont                  156
home                  156
looking               155
good                  149
vet                   138
ate                   137
pet                   135
got                   128
breeds                122
rescue                119
question              116
owner                 115
today                 114
tips        

In [9]:
df_vec_reddit_cats.sum().sort_values(ascending=False).head(50)

title_length    497550
word_count       99737
cat               2726
emoji             1008
cats               707
new                580
just               492
like               419
little             400
year               354
kitten             327
im                 298
old                287
got                282
love               275
kitty              272
hes                261
help               260
boy                254
shes               253
happy              253
years              235
does               231
baby               222
time               218
day                199
loves              198
know               186
think              183
best               181
today              178
cute               176
home               174
meet               174
dont               168
christmas          157
ago                157
good               155
months             146
look               143
adopted            137
girl               136
night              134
need       

In [10]:
# Original list of english stop words in SkLearn Library
text.ENGLISH_STOP_WORDS

# Create list for new stop words
add_stop_words = ['im', 'does']

# Joining new list of stop words to list in SKLearn Library
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Print to check
#stop_words

# 2.2 Lemmatization

In [11]:
# Instantiate lemmatizer.
w_tokenizer = WhitespaceTokenizer()
lemmatizer = WordNetLemmatizer()

# Create function to lemmatize
def lemmatize_text(text):
    sentence = " ".join([lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)])
    return sentence

df_lemmatized = pd.DataFrame(df_reddit['title'].apply(lemmatize_text))

In [12]:
X = df_lemmatized['title']

# Instantiate a CountVectorizer with the default hyperparameters.
cvec = CountVectorizer(stop_words=stop_words)

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_lemma_reddit = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

df_lemma_reddit.shape

(19311, 12196)

Lemmatization reduces the number of columns from 13472 to 12196. A total of 1276 words were reduced to similar forms as the rest.

# 2.3 Stemming

In [13]:
# Instantiate PorterStemmer
porter_stemmer = PorterStemmer()

# Create function to stem
def stem_sentences(sentence):
    tokens = sentence.split()
    stemmed_tokens = [porter_stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

df_stemming = pd.DataFrame(df_reddit['title'].apply(stem_sentences))

df_stemming.head()

Unnamed: 0,title
0,i came home sick from work with pneumonia here...
1,my sleepyhead curl up thi morn
2,spooki take care of me when i wfh who need hr ...
3,hi i am a new cat mom and am hope someon might...
4,spooki look after me when i wfh who need a hr ...


In [14]:
X = df_stemming['title']

# Instantiate a CountVectorizer with the default hyperparameters.
cvec = CountVectorizer(stop_words=stop_words)

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_stem_reddit = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

df_stem_reddit.shape

(19311, 10178)

Stemming reduces the number of columns from 13472 to 10178. A total of 3294 words were reduced to similar forms as the rest. Stemming is the better preprocessing choice as it helps reduce more features.

# 3. Train Test Split

In [15]:
# Reloading the data and implement what we found above.
# Stemming
df_model = pd.read_csv('./dataset/cleaned.csv') 
df_model['title'] = pd.DataFrame(df_model['title'].apply(stem_sentences))

# Count Vectoriser with the additional stop words
X = df_model['title']

# Instantiate a CountVectorizer with the new added stop words and min df = 3 and ngram range from 1 to 3.
cvec = CountVectorizer(stop_words = stop_words, min_df = 3, ngram_range=(1,3), max_features = 450)

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_vec_model = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

# Remove the subreddit columns and add back the original columns
#df_vec_model.drop(columns =['subreddit'], inplace = True)
df_vec_model = pd.concat([df_vec_model, df_model[['subreddit', 'emoji', 'word_count', 'title_length']]], axis=1)

df_vec_model.shape

(19311, 454)

In [16]:
# Define target variable
y = df_vec_model.pop('subreddit')
X = df_vec_model

# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# 4.1 Modelling Mass Fitting
We will first attempt to fit a variety of models and pick the top 2 for optimatization.

In [17]:
# Instantiate models
rfc = RandomForestClassifier(n_jobs=-1)
etc = ExtraTreesClassifier(n_jobs=-1)
lr = LogisticRegression(n_jobs=-1)
nb = MultinomialNB()
ada = AdaBoostClassifier()
knn = KNeighborsClassifier(n_jobs=-1)
bag = BaggingClassifier(n_jobs=-1)

In [18]:
# Create a function to find the train and test accuracy scores of the models with 5 fold cross validation
def report_error(model, X1, y1, X2, y2):
    model.fit(X1, y1)
    print('The accuracy train score for', model, 'is', cross_val_score(model, X1, y1, cv=5).mean(),'.')
    print('The accuracy test score for',model, 'is', cross_val_score(model, X2, y2, cv=5).mean(),'.')
    print()

In [19]:
report_error(rfc, X_train, y_train, X_test, y_test)
report_error(etc, X_train, y_train, X_test, y_test)
report_error(lr, X_train, y_train, X_test, y_test)
report_error(nb, X_train, y_train, X_test, y_test)
report_error(ada, X_train, y_train, X_test, y_test)
report_error(knn, X_train, y_train, X_test, y_test)
report_error(bag, X_train, y_train, X_test, y_test)

The accuracy train score for RandomForestClassifier(n_jobs=-1) is 0.8940316249013884 .
The accuracy test score for RandomForestClassifier(n_jobs=-1) is 0.8806604374317143 .

The accuracy train score for ExtraTreesClassifier(n_jobs=-1) is 0.89435510282336 .
The accuracy test score for ExtraTreesClassifier(n_jobs=-1) is 0.8819527579111061 .

The accuracy train score for LogisticRegression(n_jobs=-1) is 0.9044535521701915 .
The accuracy test score for LogisticRegression(n_jobs=-1) is 0.8923064032871058 .

The accuracy train score for MultinomialNB() is 0.8990157789253234 .
The accuracy test score for MultinomialNB() is 0.8938651643217662 .

The accuracy train score for AdaBoostClassifier() is 0.8781069480283415 .
The accuracy test score for AdaBoostClassifier() is 0.865903652414052 .

The accuracy train score for KNeighborsClassifier(n_jobs=-1) is 0.7863156769872426 .
The accuracy test score for KNeighborsClassifier(n_jobs=-1) is 0.7501886868334797 .

The accuracy train score for BaggingC

All models does not show signs of overfitting. All models except K-Nearest Neigbours have accuracy scores above 85%. The top 2 performing models are LogisticRegression and MultinomialNB with MultinomialNB scoring about 0.01 higher. We will proceed to optimise using grid search.

# 4.2 Model Optimisation

In [20]:
# Reloading the data and implement what we found above.
# Stemming
df_model = pd.read_csv('./dataset/cleaned.csv') 
df_model['title'] = pd.DataFrame(df_model['title'].apply(stem_sentences))

# Define target variable
y = df_model.pop('subreddit')
X = df_model['title']

# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

In [21]:
# Set up pipeline
pipe_nb_c = Pipeline([
    ('cvec', CountVectorizer(stop_words=stop_words)),
    ('nb', MultinomialNB())])

# Set up pipeline params
pipe_params = {
    'cvec__max_features': [3000, 3500, 4000],
    'cvec__min_df': [1, 2, 3],
    'cvec__max_df': [0.25, 0.35, 0.45],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)]
}

# Set up a gridsearch
gs_nb_c = GridSearchCV(pipe_nb_c, param_grid = pipe_params, cv=5)

# Fit the gridsearch
gs_nb_c.fit(X_test, y_test)

# Find best score
print(gs_nb_c.best_score_)

# Find best parameters
gs_nb_c.best_params_

0.9005928721286421


{'cvec__max_df': 0.35,
 'cvec__max_features': 3500,
 'cvec__min_df': 1,
 'cvec__ngram_range': (1, 1)}

In [22]:
# Set up pipeline
pipe_lr_c = Pipeline([
    ('cvec', CountVectorizer(stop_words=stop_words)),
    ('lr', LogisticRegression())])

# Set up pipeline params
pipe_params = {
    'cvec__max_features': [1000, 1500, 2000],
    'cvec__min_df': [1, 2, 3],
    'cvec__max_df': [0.25, 0.3, 0.35],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)]
}

# Set up a gridsearch
gs_lr_c = GridSearchCV(pipe_lr_c, param_grid = pipe_params, cv=5)

# Fit the gridsearch
gs_lr_c.fit(X_test, y_test)

# Find best score
print(gs_lr_c.best_score_)

# Find best parameters
gs_lr_c.best_params_

0.9047325875232088


{'cvec__max_df': 0.3,
 'cvec__max_features': 1500,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1)}

# 5.1 Choice of Model - MultinomialNB

In [31]:
# Reloading the data and implement what we found above.
# Stemming
df_model = pd.read_csv('./dataset/cleaned.csv') 
df_model['title'] = pd.DataFrame(df_model['title'].apply(stem_sentences))

# Count Vectoriser with the additional stop words
X = df_model['title']

# Instantiate a CountVectorizer with the new added stop words and min df = 3 and ngram range from 1 to 3.
cvec = CountVectorizer(stop_words = stop_words, min_df = 1, max_df = 0.35,  ngram_range=(1,1), max_features = 3500)

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_vec_model = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

# Remove the subreddit columns and add back the original columns
df_vec_model.drop(columns =['subreddit'], inplace = True)
df_vec_model = pd.concat([df_vec_model, df_model[['subreddit', 'emoji', 'word_count', 'title_length']]], axis=1)

# Define target variable
y = df_vec_model.pop('subreddit')
X = df_vec_model

# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

report_error(nb, X_train, y_train, X_test, y_test)

The accuracy train score for MultinomialNB() is 0.9228375873885936 .
The accuracy test score for MultinomialNB() is 0.9112072605889174 .



# 5.2 Choice of Model - Logistic Regression

In [24]:
# Reloading the data and implement what we found above.
# Stemming
df_model = pd.read_csv('./dataset/cleaned.csv') 
df_model['title'] = pd.DataFrame(df_model['title'].apply(stem_sentences))

# Count Vectoriser with the additional stop words
X = df_model['title']

# Instantiate a CountVectorizer with the new added stop words and min df = 3 and ngram range from 1 to 3.
cvec = CountVectorizer(stop_words = stop_words, min_df = 2, max_df = 0.3,  ngram_range=(1,1), max_features = 1500)

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_vec_model = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

# Remove the subreddit columns and add back the original columns
df_vec_model.drop(columns =['subreddit'], inplace = True)
df_vec_model = pd.concat([df_vec_model, df_model[['subreddit', 'emoji', 'word_count', 'title_length']]], axis=1)

# Define target variable
y = df_vec_model.pop('subreddit')
X = df_vec_model

# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

report_error(lr, X_train, y_train, X_test, y_test)

The accuracy train score for LogisticRegression(n_jobs=-1) is 0.8539612425759637 .
The accuracy test score for LogisticRegression(n_jobs=-1) is 0.8350967564632782 .



Comparing the top 2 models, logistic regression and Naive Bayes after optimisation, the 2 models both score 0.90 with logistic regression been slightly higher. After adding in the held out features such as emoji, word_count and title_length, the Navie Bayes model improve to 0.91 while logistic regression dropped to 0.83. 

The withheld features were quite significant to the model training. As discussed previously, cat owners like to post a lot of emojis while dog owners tend to write a lot and have higher word_count and title_length.

Thus our choice of model would be Naive Bayes.

# 6. Model's Performance

Naïve Bayes is a classification method based on Bayes’ theorem that derives the probability of the given feature vector being associated with a label. Naïve Bayes has a naive assumption of conditional independence for every feature, which means that the algorithm expects the features to be independent which not always is the case.

Logistic regression is a linear classification method that learns the probability of a sample belonging to a certain class. Logistic regression tries to find the optimal decision boundary that best separates the classes.

How the 2 models work is while Naive Bayes tries to generalise the features to the output, logistic regresssion tries to disciminate them instead. While both models performs equally during the NLP process, logistic regression performance dropped when additional withheld features are added back. It fails to recognise that the withheld features belongs to which class. Naive Bayes on the other hand improves its performance by grouping the withheld features with the current ones and better be able to classify all the features with the output variable.