<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 3: Web APIs & NLP - Ngram

## Problem Statement

Current classifying model using straightforward keywords such as ‘bootcamp’ and ‘coding’ yields around 79% accuracy.

Build a model with >90% accuracy that helps to identify between those who are looking for bootcamp style learning vs computer science majors/prospective students based on the words they use online.

## Jupyter Notebook Purpose:
#### Notebook contains codes for:
- N-gram data preprocessing method.
- Gridsearch CV (model optimization) on selected Models such as Bernoulli NB, Multinomial NB, KNN, Logistic Regression

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
import string
import jupyternotify
from sklearn import metrics

from sklearn.model_selection import cross_val_predict, train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, recall_score

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression 

#Ignore warnings
import warnings
warnings.filterwarnings("ignore")

#change display options
pd.set_option("display.max_rows", 160)
pd.set_option("display.max_columns", 80)

In [2]:
%load_ext jupyternotify

<IPython.core.display.Javascript object>

## Import Data

In [3]:
#dataset for model training
data = pd.read_csv('../project_3/datasets/data_2.csv')

## Data Dictionary:

**Dataset name: data_2**
- Size: 9982 observations, 2 variables
- Description: Final dataset that contains scrapped data from Reddit

|Feature|Type|Dataset|Description|
|:---|:---|:---|:---|
|**subreddit**|*integer*|data|Subreddit categories. 0 refers to csMajors, 1 refers to codingbootcamp| 
|**text**|*string*|data|Posts extracted from csMajor and codingbootcamp subreddit|

## Pre-processing data

In [4]:
# NLTK(Natural Language Toolkit).
stopword = nltk.corpus.stopwords.words('english')

# add_stopword
add_stopword = stopword + ["want","one","anyone","like","im","get","would","got"] + ['really','also','ive','know','dont','go','much','lot','take','think','even','getting','back',""]

# Lemmatizing
wn = nltk.WordNetLemmatizer()

In [5]:
def clean_text_ngram(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    
    # Requires a full sentence to be passed in as opposed to a tokenized list
    text = " ".join([wn.lemmatize(word) for word in tokens if word not in add_stopword])
    return text

data['cleaned_text'] = data['text'].apply(lambda x: clean_text_ngram(x))


## Setting Up Data For Modeling

### Train Test Split

In [6]:
X = data["cleaned_text"]
y = data["subreddit"]

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    stratify=y,
                                                    random_state=42)

In [8]:
# Create an instance of CountVectorizer
cv = CountVectorizer()

In [9]:
# ngram_range refers to the range of n-grams that we would like to look for. In this case, (2,2) = bigrams only
# (1,3) will return unigram, bigram, trigram since lower limit is 1 and upper limit is 3
ngram_vect_1 = CountVectorizer(ngram_range=(1,1)) 
X_ngram_1 = ngram_vect_1.fit_transform(X_train)

ngram_vect_2 = CountVectorizer(ngram_range=(2,2)) 
X_ngram_2 = ngram_vect_2.fit_transform(X_train)

ngram_vect_3 = CountVectorizer(ngram_range=(3,3)) 
X_ngram_3 = ngram_vect_3.fit_transform(X_train)

In [10]:
print(X_ngram_1.shape)
print(X_ngram_2.shape)
print(X_ngram_3.shape)

(7486, 16457)
(7486, 166109)
(7486, 223311)


- Data Set has 15,155 unique 1-word combinations
- Data Set has 153,394 unique 2-word combinations
- Data Set has 214,847 unique 3-word combinations

## Modeling

### Setting up pipelines

#### Bi-Grams (2-2) + Bernoulli Naive Bayers + Data Preprocessing

In [11]:
# Setting up pipeline with 2 stages:
# 1. N-grams (2,2)
# 2. Bernoulli Naive Bayes (estimator)

pipe_bi_ber = Pipeline([
    ("cv", CountVectorizer(ngram_range=(2,2))),
    ('ber', BernoulliNB())
])

In [12]:
params_bi_ber = {
    'cv__max_features': [9_000],
    'cv__min_df': [3],
    'cv__max_df': [.4],
    'ber__alpha': [0],
    'ber__binarize' : [0]
}

In [13]:
# Instantiate GridSearchCV
gs_bi_ber = GridSearchCV(pipe_bi_ber, # what object are we optimizing?
                  param_grid=params_bi_ber, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [14]:
# Fit GridSearch to training data.
gs_bi_ber.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cv',
                                        CountVectorizer(ngram_range=(2, 2))),
                                       ('ber', BernoulliNB())]),
             param_grid={'ber__alpha': [0], 'ber__binarize': [0],
                         'cv__max_df': [0.4], 'cv__max_features': [9000],
                         'cv__min_df': [3]})

In [15]:
# best params
gs_bi_ber.best_params_

{'ber__alpha': 0,
 'ber__binarize': 0,
 'cv__max_df': 0.4,
 'cv__max_features': 9000,
 'cv__min_df': 3}

In [16]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross validation score: {gs_bi_ber.best_score_}') 

# Score model on training set.
print(f'score on train set: {gs_bi_ber.score(X_train, y_train)}')

# Score model on testing set.
print(f'score on test set: {gs_bi_ber.score(X_test, y_test)}')

best cross validation score: 0.8406347184801289
score on train set: 0.9043547956184879
score on test set: 0.8605769230769231


- best cross validation score: 0.8406347184801289
- score on train set: 0.9043547956184879
- score on test set: 0.8605769230769231

In [17]:
y_pred_bi_ber = gs_bi_ber.predict(X_test)
print(classification_report(y_test, y_pred_bi_ber))

              precision    recall  f1-score   support

           0       0.82      0.93      0.87      1248
           1       0.91      0.80      0.85      1248

    accuracy                           0.86      2496
   macro avg       0.87      0.86      0.86      2496
weighted avg       0.87      0.86      0.86      2496



#### Tri-Grams (3-3) + Bernoulli Naive Bayers + Data Preprocessing

In [18]:
# Setting up pipeline with 2 stages:
# 1. N-grams (3,3)
# 2. Bernoulli Naive Bayes (estimator)

pipe_tri_ber = Pipeline([
    ("cv", CountVectorizer(ngram_range=(3,3))),
    ('ber', BernoulliNB())
])

In [19]:
params_tri_ber = {
    'cv__min_df': [1],
    'cv__max_df': [0.4],
    'ber__alpha': [0],
    'ber__binarize' : [0]
}

In [20]:
# Instantiate GridSearchCV
gs_tri_ber = GridSearchCV(pipe_tri_ber, # what object are we optimizing?
                  param_grid=params_tri_ber, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [21]:
# Fit GridSearch to training data.
gs_tri_ber.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cv',
                                        CountVectorizer(ngram_range=(3, 3))),
                                       ('ber', BernoulliNB())]),
             param_grid={'ber__alpha': [0], 'ber__binarize': [0],
                         'cv__max_df': [0.4], 'cv__min_df': [1]})

In [22]:
# best params: 
gs_tri_ber.best_params_

{'ber__alpha': 0, 'ber__binarize': 0, 'cv__max_df': 0.4, 'cv__min_df': 1}

In [23]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross validation score: {gs_tri_ber.best_score_}')

# Score model on training set.
print(f'score on train set: {gs_tri_ber.score(X_train, y_train)}')

# Score model on testing set.
print(f'score on test set: {gs_tri_ber.score(X_test, y_test)}')

best cross validation score: 0.7568778857224908
score on train set: 0.9687416510820198
score on test set: 0.7688301282051282


- best cross validation score: 0.7568778857224908
- score on train set: 0.9687416510820198
- score on test set: 0.7688301282051282

In [24]:
y_pred_tri_ber = gs_tri_ber.predict(X_test)
print(classification_report(y_test, y_pred_tri_ber))

              precision    recall  f1-score   support

           0       0.70      0.94      0.80      1248
           1       0.91      0.59      0.72      1248

    accuracy                           0.77      2496
   macro avg       0.81      0.77      0.76      2496
weighted avg       0.81      0.77      0.76      2496



#### Bi-Grams (2-2) + Multinomial Naive Bayers + Data Preprocessing

In [25]:
# Setting up pipeline with 2 stages:
# 1. Ngram (2,2) 
# 2. Multinomial Naive Bayes (estimator)

pipe_bi_mnb = Pipeline([
    ("cv", CountVectorizer(ngram_range=(2,2))),
    ('mnb', MultinomialNB())
])

In [26]:
#hyperparameter tuning
params_bi_mnb = {
    'cv__min_df': [0],
    'cv__max_df': [0.4]
}

In [27]:
# Instantiate GridSearchCV
gs_bi_mnb = GridSearchCV(pipe_bi_mnb, # what object are we optimizing?
                  param_grid=params_bi_mnb, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [28]:
# Fit GridSearch to training data.
gs_bi_mnb.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cv',
                                        CountVectorizer(ngram_range=(2, 2))),
                                       ('mnb', MultinomialNB())]),
             param_grid={'cv__max_df': [0.4], 'cv__min_df': [0]})

In [29]:
# best params: 
gs_bi_mnb.best_params_

{'cv__max_df': 0.4, 'cv__min_df': 0}

In [30]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross val score for Bi-gram + Multinomial NB: {gs_bi_mnb.best_score_}')

# Score model on training set.
print(f'Bi-gram + Multinomial NB score on train set: {gs_bi_mnb.score(X_train, y_train)}')

# Score model on testing set.
print(f'Bi-gram + Multinomial NB score on test set: {gs_bi_mnb.score(X_test, y_test)}')

best cross val score for Bi-gram + Multinomial NB: 0.8837829642373309
Bi-gram + Multinomial NB score on train set: 0.9846379909163773
Bi-gram + Multinomial NB score on test set: 0.905448717948718


- best cross val score for Bi-gram + Multinomial NB: 0.8837829642373309
- Bi-gram + Multinomial NB score on train set: 0.9846379909163773
- Bi-gram + Multinomial NB score on test set: 0.905448717948718

In [31]:
y_pred_bi_mnb = gs_bi_mnb.predict(X_test)
print(classification_report(y_test, y_pred_bi_mnb))

              precision    recall  f1-score   support

           0       0.91      0.90      0.90      1248
           1       0.90      0.91      0.91      1248

    accuracy                           0.91      2496
   macro avg       0.91      0.91      0.91      2496
weighted avg       0.91      0.91      0.91      2496



#### Tri-Grams (3-3) + Multinomial Naive Bayers + Data Preprocessing

In [32]:
# Setting up pipeline with 2 stages:
# 1. Ngram (3,3) 
# 2. Multinomial Naive Bayes (estimator)

pipe_tri_mnb = Pipeline([
    ("cv", CountVectorizer(ngram_range=(3,3))),
    ('mnb', MultinomialNB())
])

In [33]:
#hyperparameter tuning
params_tri_mnb = {
    'cv__min_df': [0, 1, 2, 3],
    'cv__max_df': [0.4,.9, .95]
}

In [34]:
# Instantiate GridSearchCV
gs_tri_mnb = GridSearchCV(pipe_tri_mnb, # what object are we optimizing?
                  param_grid=params_tri_mnb, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [35]:
# Fit GridSearch to training data.
gs_tri_mnb.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cv',
                                        CountVectorizer(ngram_range=(3, 3))),
                                       ('mnb', MultinomialNB())]),
             param_grid={'cv__max_df': [0.4, 0.9, 0.95],
                         'cv__min_df': [0, 1, 2, 3]})

In [36]:
# best params: 
gs_tri_mnb.best_params_

{'cv__max_df': 0.4, 'cv__min_df': 0}

In [37]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross val score for Tri-gram + Multinomial NB: {gs_tri_mnb.best_score_}')

# Score model on training set.
print(f'Tri-gram + Multinomial NB score on train set: {gs_tri_mnb.score(X_train, y_train)}')

# Score model on testing set.
print(f'Tri-gram + Multinomial NB score on test set: {gs_tri_mnb.score(X_test, y_test)}')

best cross val score for Tri-gram + Multinomial NB: 0.7618201244500572
Tri-gram + Multinomial NB score on train set: 0.9695431472081218
Tri-gram + Multinomial NB score on test set: 0.7748397435897436


- best cross val score for Tri-gram + Multinomial NB: 0.7618201244500572
- Tri-gram + Multinomial NB score on train set: 0.9695431472081218
- Tri-gram + Multinomial NB score on test set: 0.7748397435897436

In [38]:
y_pred_tri_mnb = gs_tri_mnb.predict(X_test)
print(classification_report(y_test, y_pred_tri_mnb))

              precision    recall  f1-score   support

           0       0.71      0.93      0.81      1248
           1       0.90      0.62      0.73      1248

    accuracy                           0.77      2496
   macro avg       0.80      0.77      0.77      2496
weighted avg       0.80      0.77      0.77      2496



#### Bi-gram (2,2) + Logistic Regression + Data Preprocessing

In [39]:
# Setting up pipeline with 2 stages:
# 1. Bi-gram (2,2) (transformer) + clean text
# 2. Logistic Regression (estimator)

pipe_bi_lg = Pipeline([
    ("cv", CountVectorizer(ngram_range=(2,2))),
    ('lg', LogisticRegression(solver='liblinear'))
])

In [40]:
#hyperparameter tuning
params_bi_lg = {
    'cv__max_features': [11_000],
    'cv__min_df': [0],
    'cv__max_df': [0.4]
}

In [41]:
# pipe_bi_lg.get_params()

In [42]:
# Instantiate GridSearchCV
gs_bi_lg = GridSearchCV(pipe_bi_lg, # what object are we optimizing?
                  param_grid=params_bi_lg, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [43]:
# Fit GridSearch to training data.
gs_bi_lg.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cv',
                                        CountVectorizer(ngram_range=(2, 2))),
                                       ('lg',
                                        LogisticRegression(solver='liblinear'))]),
             param_grid={'cv__max_df': [0.4], 'cv__max_features': [11000],
                         'cv__min_df': [0]})

In [44]:
# best params:
gs_bi_lg.best_params_

{'cv__max_df': 0.4, 'cv__max_features': 11000, 'cv__min_df': 0}

In [45]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross validation score: {gs_bi_lg.best_score_}') 

# Score model on training set.
print(f'score on train set: {gs_bi_lg.score(X_train, y_train)}')

# Score model on testing set.
print(f'score on test set: {gs_bi_lg.score(X_test, y_test)}')

best cross validation score: 0.858400289675925
score on train set: 0.9441624365482234
score on test set: 0.875


- best cross validation score: 0.858400289675925
- score on train set: 0.9441624365482234
- score on test set: 0.875

In [46]:
y_pred_bi_lg = gs_bi_lg.predict(X_test)
print(classification_report(y_test, y_pred_bi_lg))

              precision    recall  f1-score   support

           0       0.85      0.92      0.88      1248
           1       0.91      0.83      0.87      1248

    accuracy                           0.88      2496
   macro avg       0.88      0.88      0.87      2496
weighted avg       0.88      0.88      0.87      2496



#### Tri-gram (3,3) + Logistic Regression + Data Preprocessing

In [47]:
# Setting up pipeline with 2 stages:
# 1. Tri-gram (3,3) (transformer) + clean text
# 2. Logistic Regression (estimator)

pipe_tri_lg = Pipeline([
    ("cv", CountVectorizer(ngram_range=(3,3))),
    ('lg', LogisticRegression(solver='liblinear'))
])

In [48]:
#hyperparameter tuning
params_tri_lg = {
    'cv__max_features': [20_000],
    'cv__min_df': [0],
    'cv__max_df': [0.4]
}

In [49]:
# Instantiate GridSearchCV
gs_tri_lg = GridSearchCV(pipe_tri_lg, # what object are we optimizing?
                  param_grid=params_tri_lg, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [50]:
# Fit GridSearch to training data.
gs_tri_lg.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cv',
                                        CountVectorizer(ngram_range=(3, 3))),
                                       ('lg',
                                        LogisticRegression(solver='liblinear'))]),
             param_grid={'cv__max_df': [0.4], 'cv__max_features': [20000],
                         'cv__min_df': [0]})

In [51]:
# best params:
gs_tri_lg.best_params_

{'cv__max_df': 0.4, 'cv__max_features': 20000, 'cv__min_df': 0}

In [52]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross validation score: {gs_tri_lg.best_score_}') 

# Score model on training set.
print(f'score on train set: {gs_tri_lg.score(X_train, y_train)}')

# Score model on testing set.
print(f'score on test set: {gs_tri_lg.score(X_test, y_test)}')

best cross validation score: 0.7114615523882657
score on train set: 0.8820464867753139
score on test set: 0.7183493589743589


- best cross validation score: 0.7114615523882657
- score on train set: 0.8820464867753139
- score on test set: 0.7183493589743589

In [53]:
y_pred_tri_lg = gs_tri_lg.predict(X_test)
print(classification_report(y_test, y_pred_tri_lg))

              precision    recall  f1-score   support

           0       0.65      0.95      0.77      1248
           1       0.91      0.48      0.63      1248

    accuracy                           0.72      2496
   macro avg       0.78      0.72      0.70      2496
weighted avg       0.78      0.72      0.70      2496



#### Bi-Gram (2,2) + KNN + Data Preprocessing

In [54]:
# Setting up pipeline with 2 stages:
# 1. Bi-Gram (2,2) + clean text
# 2. KNN (estimator)

pipe_bi_knn = Pipeline([
    ("bi", CountVectorizer(ngram_range=(3,3))),
    ('knn', KNeighborsClassifier())
])

In [55]:
#hyperparameter tuning
params_bi_knn = {
    'bi__max_features': [7_000],
    'bi__min_df': [0],
    'bi__max_df': [0.4],
    'knn__n_neighbors': [3]
}

In [56]:
# Instantiate GridSearchCV
gs_bi_knn = GridSearchCV(pipe_bi_knn, # what object are we optimizing?
                  param_grid=params_bi_knn, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [57]:
# Fit GridSearch to training data.
gs_bi_knn.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('bi',
                                        CountVectorizer(ngram_range=(3, 3))),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'bi__max_df': [0.4], 'bi__max_features': [7000],
                         'bi__min_df': [0], 'knn__n_neighbors': [3]})

In [58]:
# best params:
gs_bi_knn.best_params_

{'bi__max_df': 0.4,
 'bi__max_features': 7000,
 'bi__min_df': 0,
 'knn__n_neighbors': 3}

In [59]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross validation score: {gs_bi_knn.best_score_}') 

# Score model on training set.
print(f'score on train set: {gs_bi_knn.score(X_train, y_train)}')

# Score model on testing set.
print(f'score on test set: {gs_bi_knn.score(X_test, y_test)}')

best cross validation score: 0.5861653881862524
score on train set: 0.7898744322735773
score on test set: 0.6121794871794872


In [60]:
y_pred_bi_knn = gs_bi_knn.predict(X_test)
print(classification_report(y_test, y_pred_bi_knn))

              precision    recall  f1-score   support

           0       0.88      0.26      0.40      1248
           1       0.57      0.96      0.71      1248

    accuracy                           0.61      2496
   macro avg       0.72      0.61      0.56      2496
weighted avg       0.72      0.61      0.56      2496



In [61]:
%notify

<IPython.core.display.Javascript object>

## Summary on Model Accuracy

|**Vectorization Method**|**Model**|**Train Results**|**Test Results**|
|:---|:---|:---:|:---:|
|Bi-Gram|Naive Bayes - Bernoulli|0.90435|0.86057|
|Bi-Gram|Naive Bayes - Multinomial|0.98464|0.90545|
|Bi-Gram|Logistic Regression|0.94416|0.875|
|Bi-Gram|K-Nearest Neighbours|0.78987|0.61218|
|Tri-Gram|Naive Bayes - Bernoulli|0.96874|0.76883|
|Tri-Gram|Naive Bayes - Multinomial|0.96954|0.77484|
|Tri-Gram|Logistic Regression|0.88204|0.71834|