<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Reddit

--- 
# Notebook 3

The third notebook will examine preprocessing techniques and try out a few models.
---

# 0. Import Package and Read in Data


In [28]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from nltk.tokenize import word_tokenize, WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords

# Import model
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

# Import Evaluations
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.feature_extraction import text
from sklearn import metrics

In [29]:
# Reading in the dataset
df_reddit = pd.read_csv('./dataset/cleaned.csv') 
df_reddit.head()

Unnamed: 0.1,Unnamed: 0,subreddit,title
0,0,dogs,My dog bites himself
1,1,dogs,How do you keep your dogs passively stimulated?
2,2,dogs,Dog's Sleeping Behavior Changed with Pregnant ...
3,3,dogs,Soft food diet after surgery
4,4,dogs,Does anyone know much about Intracranial Arach...


In [30]:
# Check the number of rows and columns
df_reddit.shape

(19993, 3)

# 1. Text to Matrix Representation
We will use countvectorise to split the words of the titles.

In [32]:
X = df_reddit['title']

# Instantiate a CountVectorizer with english stopwords.
cvec = CountVectorizer(stop_words='english')

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_vec_reddit = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

df_vec_reddit.shape

(19993, 12974)

After splitting the title into words, the number of columns change from 1 to 13472. 

We will try to reduce the number of columns using:
* Increasing the stopwords
* Stemming 
* Lemmatization
for the model to perform better.

# 2.1 Stopwords
To find and add more stopwords, we will
* Check if the vectorised columns and the original columns are the same.
* Drop the vectorised columns with the same name as the original.
* Add back the original columns
* Split the dataframe into dog and cat subreddit
* Print out the top words for each subreddit
* Add more stopwords

In [33]:
# Drop the two columns as we want to add back the original columns
df_vec_reddit.drop(columns =['subreddit', 'website'], inplace = True)

df_vec_reddit.shape

(19993, 12972)

In [None]:
# Adding back the columns after countvectoriser
df_vec_reddit = pd.concat([df_vec_reddit, df_reddit[['subreddit', 'emoji', 'word_count', 'title_length']]], axis=1)

# Checking if 5 rows have been added.
df_vec_reddit.shape

In [None]:
# Separating the dataframe into cats and dogs to find the top words for each subreddit
df_vec_reddit_dogs = df_vec_reddit[df_vec_reddit['subreddit'] == 1]
df_vec_reddit_cats = df_vec_reddit[df_vec_reddit['subreddit'] == 0]

In [None]:
df_vec_reddit_dogs.sum().sort_values(ascending=False).head(50)

In [None]:
df_vec_reddit_cats.sum().sort_values(ascending=False).head(50)

In [12]:
# Original list of english stop words in SkLearn Library
text.ENGLISH_STOP_WORDS

# Create list for new stop words
add_stop_words = ['im', 'does']

# Joining new list of stop words to list in SKLearn Library
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Print to check
#stop_words

# 2.2 Lemmatization

In [13]:
# Instantiate lemmatizer.
w_tokenizer = WhitespaceTokenizer()
lemmatizer = WordNetLemmatizer()

# Create function to lemmatize
def lemmatize_text(text):
    sentence = " ".join([lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)])
    return sentence

df_lemmatized = pd.DataFrame(df_reddit['title'].apply(lemmatize_text))

In [14]:
X = df_lemmatized['title']

# Instantiate a CountVectorizer with the default hyperparameters.
cvec = CountVectorizer(stop_words=stop_words)

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_lemma_reddit = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

df_lemma_reddit.shape

(19278, 12217)

Lemmatization reduces the number of columns from 13472 to 12196. A total of 1276 words were reduced to similar forms as the rest.

# 2.3 Stemming

In [15]:
# Instantiate PorterStemmer
porter_stemmer = PorterStemmer()

# Create function to stem
def stem_sentences(sentence):
    tokens = sentence.split()
    stemmed_tokens = [porter_stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

df_stemming = pd.DataFrame(df_reddit['title'].apply(stem_sentences))

df_stemming.head()

Unnamed: 0,title
0,i came home sick from work with pneumonia here...
1,my sleepyhead curl up thi morn
2,spooki take care of me when i wfh who need hr ...
3,hi i am a new cat mom and am hope someon might...
4,spooki look after me when i wfh who need a hr ...


In [None]:
X = df_stemming['title']

# Instantiate a CountVectorizer with the default hyperparameters.
cvec = CountVectorizer(stop_words=stop_words)

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_stem_reddit = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

df_stem_reddit.shape

Stemming reduces the number of columns from 13472 to 10178. A total of 3294 words were reduced to similar forms as the rest. Stemming is the better preprocessing choice as it helps reduce more features.

# 3. Train Test Split

In [None]:
# Reloading the data and implement what we found above.
# Stemming
df_model = pd.read_csv('./dataset/cleaned.csv') 
df_model['title'] = pd.DataFrame(df_model['title'].apply(stem_sentences))

# Count Vectoriser with the additional stop words
X = df_model['title']

# Instantiate a CountVectorizer with the new added stop words and min df = 3 and ngram range from 1 to 3.
cvec = CountVectorizer(stop_words = stop_words, min_df = 3, ngram_range=(1,3), max_features = 450)

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_vec_model = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

# Remove the subreddit columns and add back the original columns
#df_vec_model.drop(columns =['subreddit'], inplace = True)
df_vec_model = pd.concat([df_vec_model, df_model[['subreddit', 'emoji', 'word_count', 'title_length']]], axis=1)

df_vec_model.shape

In [None]:
# Define target variable
y = df_vec_model.pop('subreddit')
X = df_vec_model

# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# 4.1 Modelling Mass Fitting
We will first attempt to fit a variety of models and pick the top 2 for optimatization.

In [19]:
# Instantiate models
rfc = RandomForestClassifier(n_jobs=-1)
etc = ExtraTreesClassifier(n_jobs=-1)
lr = LogisticRegression(n_jobs=-1)
nb = MultinomialNB()
ada = AdaBoostClassifier()
knn = KNeighborsClassifier(n_jobs=-1)
bag = BaggingClassifier(n_jobs=-1)

In [20]:
# Create a function to find the train and test accuracy scores of the models with 5 fold cross validation
def report_error(model, X1, y1, X2, y2):
    model.fit(X1, y1)
    print('The accuracy train score for', model, 'is', cross_val_score(model, X1, y1, cv=5).mean(),'.')
    print('The accuracy test score for',model, 'is', cross_val_score(model, X2, y2, cv=5).mean(),'.')
    print()

In [None]:
report_error(rfc, X_train, y_train, X_test, y_test)
report_error(etc, X_train, y_train, X_test, y_test)
report_error(lr, X_train, y_train, X_test, y_test)
report_error(nb, X_train, y_train, X_test, y_test)
report_error(ada, X_train, y_train, X_test, y_test)
report_error(knn, X_train, y_train, X_test, y_test)
report_error(bag, X_train, y_train, X_test, y_test)

All models does not show signs of overfitting. All models except K-Nearest Neigbours have accuracy scores above 85%. The top 2 performing models are LogisticRegression and MultinomialNB with MultinomialNB scoring about 0.01 higher. We will proceed to optimise using grid search.

# 4.2 Model Optimisation

In [22]:
# Reloading the data and implement what we found above.
# Stemming
df_model = pd.read_csv('./dataset/cleaned.csv') 
df_model['title'] = pd.DataFrame(df_model['title'].apply(stem_sentences))

# Define target variable
y = df_model.pop('subreddit')
X = df_model['title']

# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

In [23]:
# Set up pipeline
pipe_nb_c = Pipeline([
    ('cvec', CountVectorizer(stop_words=stop_words)),
    ('nb', MultinomialNB())])

# Set up pipeline params
pipe_params = {
    'cvec__max_features': [3000, 3500, 4000],
    'cvec__min_df': [1, 2, 3],
    'cvec__max_df': [0.25, 0.35, 0.45],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)]
}

# Set up a gridsearch
gs_nb_c = GridSearchCV(pipe_nb_c, param_grid = pipe_params, cv=5)

# Fit the gridsearch
gs_nb_c.fit(X_test, y_test)

# Find best score
print(gs_nb_c.best_score_)

# Find best parameters
gs_nb_c.best_params_

0.8967799708339215


{'cvec__max_df': 0.35,
 'cvec__max_features': 4000,
 'cvec__min_df': 1,
 'cvec__ngram_range': (1, 1)}

In [24]:
# Set up pipeline
pipe_lr_c = Pipeline([
    ('cvec', CountVectorizer(stop_words=stop_words)),
    ('lr', LogisticRegression())])

# Set up pipeline params
pipe_params = {
    'cvec__max_features': [1000, 1500, 2000],
    'cvec__min_df': [1, 2, 3],
    'cvec__max_df': [0.25, 0.3, 0.35],
    'cvec__ngram_range': [(1,1), (1,2), (1,3)]
}

# Set up a gridsearch
gs_lr_c = GridSearchCV(pipe_lr_c, param_grid = pipe_params, cv=5)

# Fit the gridsearch
gs_lr_c.fit(X_test, y_test)

# Find best score
print(gs_lr_c.best_score_)

# Find best parameters
gs_lr_c.best_params_

0.8962641882220117


{'cvec__max_df': 0.3,
 'cvec__max_features': 2000,
 'cvec__min_df': 1,
 'cvec__ngram_range': (1, 2)}

# 5.1 Choice of Model - MultinomialNB

In [None]:
# Reloading the data and implement what we found above.
# Stemming
df_model = pd.read_csv('./dataset/cleaned.csv') 
df_model['title'] = pd.DataFrame(df_model['title'].apply(stem_sentences))

# Count Vectoriser with the additional stop words
X = df_model['title']

# Instantiate a CountVectorizer with the new added stop words and min df = 3 and ngram range from 1 to 3.
cvec = CountVectorizer(stop_words = stop_words, min_df = 1, max_df = 0.35,  ngram_range=(1,1), max_features = 3500)

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_vec_model = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

# Remove the subreddit columns and add back the original columns
df_vec_model.drop(columns =['subreddit'], inplace = True)
df_vec_model = pd.concat([df_vec_model, df_model[['subreddit', 'emoji', 'word_count', 'title_length']]], axis=1)

# Define target variable
y = df_vec_model.pop('subreddit')
X = df_vec_model

# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

report_error(nb, X_train, y_train, X_test, y_test)

# 5.2 Choice of Model - Logistic Regression

In [None]:
# Reloading the data and implement what we found above.
# Stemming
df_model = pd.read_csv('./dataset/cleaned.csv') 
df_model['title'] = pd.DataFrame(df_model['title'].apply(stem_sentences))

# Count Vectoriser with the additional stop words
X = df_model['title']

# Instantiate a CountVectorizer with the new added stop words and min df = 3 and ngram range from 1 to 3.
cvec = CountVectorizer(stop_words = stop_words, min_df = 2, max_df = 0.3,  ngram_range=(1,1), max_features = 1500)

# Fit the vectorizer on our corpus and transform it.
X_vec = cvec.fit_transform(X)

# Create a dataframe after count vectoriser
df_vec_model = pd.DataFrame(X_vec.todense(), columns = cvec.get_feature_names())

# Remove the subreddit columns and add back the original columns
df_vec_model.drop(columns =['subreddit'], inplace = True)
df_vec_model = pd.concat([df_vec_model, df_model[['subreddit', 'emoji', 'word_count', 'title_length']]], axis=1)

# Define target variable
y = df_vec_model.pop('subreddit')
X = df_vec_model

# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

report_error(lr, X_train, y_train, X_test, y_test)

# 6. Model's Performance

In [None]:
def get_salient_words(nb_clf, vect, class_ind):
    words = cvec.get_feature_names()
    zipped = list(zip(words, nb_clf.feature_log_prob_[class_ind]))
    sorted_zip = sorted(zipped, key=lambda t: t[1], reverse=True)
    return sorted_zip

neg_salient_top_20 = get_salient_words(nb, cvec, 0)[:20]
pos_salient_top_20 = get_salient_words(nb, cvec, 1)[:20]

pos_salient_top_20
neg_salient_top_20