<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 3: Web APIs & NLP - Count Vectorizing

## Problem Statement

Current classifying model using straightforward keywords such as ‘bootcamp’ and ‘coding’ yields around 79% accuracy.

Build a model with >90% accuracy that helps to identify between those who are looking for bootcamp style learning vs computer science majors/prospective students based on the words they use online.

## Jupyter Notebook Purpose:
#### Notebook contains codes for:
- Count Vectorizing data preprocessing method.
- Gridsearch CV (model optimization) on selected Models such as Bernoulli NB, Multinomial NB, KNN, Logistic Regression

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
import string
import jupyternotify
from sklearn import metrics

from sklearn.model_selection import cross_val_predict, train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, recall_score

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression 

#Ignore warnings
import warnings
warnings.filterwarnings("ignore")

#change display options
pd.set_option("display.max_rows", 160)
pd.set_option("display.max_columns", 80)

In [2]:
%load_ext jupyternotify 
# %notify

<IPython.core.display.Javascript object>

## Import Data

In [3]:
#dataset for model training
data = pd.read_csv('../project_3/datasets/data_2.csv')

## Data Dictionary:

**Dataset name: data_2**
- Size: 9982 observations, 2 variables
- Description: Final dataset that contains scrapped data from Reddit

|Feature|Type|Dataset|Description|
|:---|:---|:---|:---|
|**subreddit**|*integer*|data|Subreddit categories. 0 refers to csMajors, 1 refers to codingbootcamp| 
|**text**|*string*|data|Posts extracted from csMajor and codingbootcamp subreddit|

## Setting Up Data For Modeling

### Train Test Split

In [4]:
X = data["text"]
y = data["subreddit"]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    stratify=y,
                                                    random_state=42)

## Vectorizing Data

### Count Vectorizer

In [6]:
# NLTK(Natural Language Toolkit).
stopword = nltk.corpus.stopwords.words('english')

# add more stopwords
add_stopword = stopword + ["want","one","anyone","like","im","get","would","got"] + ['really','also','ive','know','dont','go','much','lot','take','think','even','getting','back',""]

# Lemmatizing
wn = nltk.WordNetLemmatizer()

In [7]:
# function to remove punctuation, tokenize, remove stopwords and lemmatize
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    # \W matches any non-word character (equivalent to [^a-zA-Z0-9_]). This does not include spaces i.e. \s
    # Add a + just in case there are 2 or more spaces between certain words
    tokens = re.split('\W+', text)
    
    # apply stemming and stopwords exclusion within the same step
    text = [wn.lemmatize(word) for word in tokens if word not in add_stopword]
    return text

## Modeling

### Setting up pipelines

#### Pipeline Count Vectorizer + Bernoulli Naive Bayers + Data Preprocessing

In [8]:
# Setting up pipeline with 2 stages:
# 1. CountVectorizer (transformer) + clean text
# 2. Bernoulli Naive Bayes (estimator)

pipe_cv_ber_cln = Pipeline([
    ("cv", CountVectorizer(analyzer = clean_text)),
    ('ber', BernoulliNB())
])

In [9]:
# checking the params for gridsearch
pipe_cv_ber_cln.get_params()

{'memory': None,
 'steps': [('cv',
   CountVectorizer(analyzer=<function clean_text at 0x000002B0F4228F70>)),
  ('ber', BernoulliNB())],
 'verbose': False,
 'cv': CountVectorizer(analyzer=<function clean_text at 0x000002B0F4228F70>),
 'ber': BernoulliNB(),
 'cv__analyzer': <function __main__.clean_text(text)>,
 'cv__binary': False,
 'cv__decode_error': 'strict',
 'cv__dtype': numpy.int64,
 'cv__encoding': 'utf-8',
 'cv__input': 'content',
 'cv__lowercase': True,
 'cv__max_df': 1.0,
 'cv__max_features': None,
 'cv__min_df': 1,
 'cv__ngram_range': (1, 1),
 'cv__preprocessor': None,
 'cv__stop_words': None,
 'cv__strip_accents': None,
 'cv__token_pattern': '(?u)\\b\\w\\w+\\b',
 'cv__tokenizer': None,
 'cv__vocabulary': None,
 'ber__alpha': 1.0,
 'ber__binarize': 0.0,
 'ber__class_prior': None,
 'ber__fit_prior': True}

In [10]:
#hyperparameter tuning
params_cv_ber = {
    'cv__max_features': [3_000],
    'cv__min_df': [5],
    'cv__max_df': [0.4],
    'ber__alpha': [0],
    'ber__binarize' : [0]
}

In [11]:
# Instantiate GridSearchCV
gs_cv_ber_cln = GridSearchCV(pipe_cv_ber_cln, # what object are we optimizing?
                  param_grid=params_cv_ber, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [12]:
# Fit GridSearch to training data.
gs_cv_ber_cln.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cv',
                                        CountVectorizer(analyzer=<function clean_text at 0x000002B0F4228F70>)),
                                       ('ber', BernoulliNB())]),
             param_grid={'ber__alpha': [0], 'ber__binarize': [0],
                         'cv__max_df': [0.4], 'cv__max_features': [3000],
                         'cv__min_df': [5]})

In [13]:
# best params: 
gs_cv_ber_cln.best_params_

{'ber__alpha': 0,
 'ber__binarize': 0,
 'cv__max_df': 0.4,
 'cv__max_features': 3000,
 'cv__min_df': 5}

In [14]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross validation score: {gs_cv_ber_cln.best_score_}')

# Score model on training set.
print(f'score on train set: {gs_cv_ber_cln.score(X_train, y_train)}')

# Score model on testing set.
print(f'score on test set: {gs_cv_ber_cln.score(X_test, y_test)}')

best cross validation score: 0.8274108519665052
score on train set: 0.8461127437884051
score on test set: 0.8421474358974359


- best cross validation score: 0.8274108519665052
- score on train set: 0.8461127437884051
- score on test set: 0.8421474358974359

In [15]:
y_pred_cv_ber_cln = gs_cv_ber_cln.predict(X_test)
print(classification_report(y_test, y_pred_cv_ber_cln))

              precision    recall  f1-score   support

           0       0.79      0.92      0.85      1248
           1       0.91      0.76      0.83      1248

    accuracy                           0.84      2496
   macro avg       0.85      0.84      0.84      2496
weighted avg       0.85      0.84      0.84      2496



#### Pipeline Count Vectorizer + Multinomial Naive Bayers + Data Preprocessing

In [16]:
# Setting up pipeline with 2 stages:
# 1. CountVectorizer (transformer) + clean text
# 2. Multinomial Naive Bayes (estimator)

pipe_cv_mnb_cln = Pipeline([
    ("cv", CountVectorizer(analyzer = clean_text)),
    ('mnb', MultinomialNB())
])

In [17]:
#hyperparameter tuning
params_cv_mnb = {
    'cv__max_features': [5_000, 7_000, 9_000, 11_000,13_000],
    'cv__min_df': [3],
    'cv__max_df': [.4]
}

In [18]:
# Instantiate GridSearchCV
gs_cv_mnb_cln = GridSearchCV(pipe_cv_mnb_cln, # what object are we optimizing?
                  param_grid=params_cv_mnb, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [19]:
# Fit GridSearch to training data.
gs_cv_mnb_cln.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cv',
                                        CountVectorizer(analyzer=<function clean_text at 0x000002B0F4228F70>)),
                                       ('mnb', MultinomialNB())]),
             param_grid={'cv__max_df': [0.4],
                         'cv__max_features': [5000, 7000, 9000, 11000, 13000],
                         'cv__min_df': [3]})

In [20]:
# best params:
gs_cv_mnb_cln.best_params_

{'cv__max_df': 0.4, 'cv__max_features': 5000, 'cv__min_df': 3}

In [21]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross validation score: {gs_cv_mnb_cln.best_score_}')

# Score model on training set.
print(f'score on train set: {gs_cv_mnb_cln.score(X_train, y_train)}')

# Score model on testing set.
print(f'score on test set: {gs_cv_mnb_cln.score(X_test, y_test)}')

best cross validation score: 0.9122355971399854
score on train set: 0.9321399946566925
score on test set: 0.9314903846153846


- best cross validation score: 0.9122355971399854
- score on train set: 0.9321399946566925
- score on test set: 0.9314903846153846

In [22]:
y_pred_cv_mnb = gs_cv_mnb_cln.predict(X_test)
print(classification_report(y_test, y_pred_cv_mnb))

              precision    recall  f1-score   support

           0       0.95      0.91      0.93      1248
           1       0.92      0.95      0.93      1248

    accuracy                           0.93      2496
   macro avg       0.93      0.93      0.93      2496
weighted avg       0.93      0.93      0.93      2496



#### Pipeline Count Vectorizer + Logisitic Regression + Data Preprocessing

In [23]:
# Setting up pipeline with 2 stages:
# 1. CountVectorizer (transformer) + clean text
# 2. Logistic Regression (estimator)

pipe_cv_lg = Pipeline([
    ("cv", CountVectorizer(analyzer = clean_text)),
    ('lg', LogisticRegression(solver='liblinear'))
])

In [24]:
#hyperparameter tuning
params_cv_lg = {
    'cv__max_features': [7_000],
    'cv__min_df': [0],
    'cv__max_df': [0.4]
}

In [25]:
# Instantiate GridSearchCV
gs_cv_lg = GridSearchCV(pipe_cv_lg, # what object are we optimizing?
                  param_grid=params_cv_lg, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [26]:
# Fit GridSearch to training data.
gs_cv_lg.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cv',
                                        CountVectorizer(analyzer=<function clean_text at 0x000002B0F4228F70>)),
                                       ('lg',
                                        LogisticRegression(solver='liblinear'))]),
             param_grid={'cv__max_df': [0.4], 'cv__max_features': [7000],
                         'cv__min_df': [0]})

In [27]:
# best params:
gs_cv_lg.best_params_

{'cv__max_df': 0.4, 'cv__max_features': 7000, 'cv__min_df': 0}

In [28]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross validation score: {gs_cv_lg.best_score_}') 

# Score model on training set.
print(f'score on train set: {gs_cv_lg.score(X_train, y_train)}')

# Score model on testing set.
print(f'score on test set: {gs_cv_lg.score(X_test, y_test)}')

best cross validation score: 0.9249257750035005
score on train set: 0.986775313919316
score on test set: 0.9342948717948718


- best cross validation score: 0.9249257750035005
- score on train set: 0.986775313919316
- score on test set: 0.9342948717948718

In [29]:
y_pred_cv_lg = gs_cv_lg.predict(X_test)
print(classification_report(y_test, y_pred_cv_lg))

              precision    recall  f1-score   support

           0       0.94      0.92      0.93      1248
           1       0.93      0.94      0.93      1248

    accuracy                           0.93      2496
   macro avg       0.93      0.93      0.93      2496
weighted avg       0.93      0.93      0.93      2496



#### Pipeline Count Vectorizer + KNN + Data Preprocessing

In [30]:
# Setting up pipeline with 2 stages:
# 1. Countvector + clean text
# 2. KNN (estimator)

pipe_cv_knn = Pipeline([
    ("cv", CountVectorizer(analyzer = clean_text)),
    ('knn', KNeighborsClassifier())
])

In [31]:
#hyperparameter tuning
params_cv_knn = {
    'cv__max_features': [16_000],
    'cv__min_df': [0],
    'cv__max_df': [0.4],
    'knn__n_neighbors': [3]
}

In [32]:
# Instantiate GridSearchCV
gs_cv_knn = GridSearchCV(pipe_cv_knn, # what object are we optimizing?
                  param_grid=params_cv_knn, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [33]:
# Fit GridSearch to training data.
gs_cv_knn.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cv',
                                        CountVectorizer(analyzer=<function clean_text at 0x000002B0F4228F70>)),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'cv__max_df': [0.4], 'cv__max_features': [16000],
                         'cv__min_df': [0], 'knn__n_neighbors': [3]})

In [34]:
# best params:
gs_cv_knn.best_params_

{'cv__max_df': 0.4,
 'cv__max_features': 16000,
 'cv__min_df': 0,
 'knn__n_neighbors': 3}

In [35]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross validation score: {gs_cv_knn.best_score_}') 

# Score model on training set.
print(f'score on train set: {gs_cv_knn.score(X_train, y_train)}')

# Score model on testing set.
print(f'score on test set: {gs_cv_knn.score(X_test, y_test)}')

best cross validation score: 0.8254079141817237
score on train set: 0.9048891263692226
score on test set: 0.8369391025641025


- best cross validation score: 0.8254079141817237
- score on train set: 0.9048891263692226
- score on test set: 0.8369391025641025

In [36]:
y_pred_cv_knn = gs_cv_knn.predict(X_test)
print(classification_report(y_test, y_pred_cv_knn))

              precision    recall  f1-score   support

           0       0.92      0.74      0.82      1248
           1       0.78      0.93      0.85      1248

    accuracy                           0.84      2496
   macro avg       0.85      0.84      0.84      2496
weighted avg       0.85      0.84      0.84      2496



In [37]:
%notify

<IPython.core.display.Javascript object>

## Summary on Model Accuracy

|**Vectorization Method**|**Model**|**Train Results**|**Test Results**|
|:---|:---|:---:|:---:|
|Count Vectorizer|Naive Bayes - Bernoulli|0.84611|0.84215|
|Count Vectorizer|Naive Bayes - Multinomial|0.93214|0.93149|
|Count Vectorizer|Logistic Regression|0.98678|0.93429|
|Count Vectorizer|K-Nearest Neighbours|0.90489|0.83694|