<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 3: Web APIs & NLP - TF-IDF

## Problem Statement

Current classifying model using straightforward keywords such as ‘bootcamp’ and ‘coding’ yields around 79% accuracy.

Build a model with >90% accuracy that helps to identify between those who are looking for bootcamp style learning vs computer science majors/prospective students based on the words they use online.

## Jupyter Notebook Purpose:
#### Notebook contains codes for:
- TF-IDF data preprocessing method.
- Gridsearch CV (model optimization) on selected Models such as Bernoulli NB, Multinomial NB, KNN, Logistic Regression

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import nltk
import re
import jupyternotify
import string
from sklearn import metrics

from sklearn.model_selection import cross_val_predict, train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, recall_score

from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression 

from sklearn.feature_extraction.text import TfidfVectorizer

#Ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
%load_ext jupyternotify

<IPython.core.display.Javascript object>

## Import Data

In [3]:
#dataset for model training
data = pd.read_csv('../project_3/datasets/data_2.csv')

## Data Dictionary:

**Dataset name: data_2**
- Size: 9982 observations, 2 variables
- Description: Final dataset that contains scrapped data from Reddit

|Feature|Type|Dataset|Description|
|:---|:---|:---|:---|
|**subreddit**|*integer*|data|Subreddit categories. 0 refers to csMajors, 1 refers to codingbootcamp| 
|**text**|*string*|data|Posts extracted from csMajor and codingbootcamp subreddit|

## Setting Up Data For Modeling

### Train Test Split

In [4]:
X = data["text"]
y = data["subreddit"]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    stratify=y,
                                                    random_state=42)

## Vectorizing Data

### TF-IDF

In [6]:
# NLTK(Natural Language Toolkit).
stopword = nltk.corpus.stopwords.words('english')
# add more stopwords
add_stopword = stopword + stopword + ["want","one","anyone","like","im","get","would","got"] + ['really','also','ive','know','dont','go','much','lot','take','think','even','getting','back',""]

# Lemmatizing
wn = nltk.WordNetLemmatizer()

In [7]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [wn.lemmatize(word) for word in tokens if word not in add_stopword]
    return text

## Modeling

### Setting up pipelines

#### TF-IDF + Bernoulli Naive Bayers + Data Preprocessing

In [8]:
# Setting up pipeline with 2 stages:
# 1. TF-IDF + clean text
# 2. Bernoulli Naive Bayes (estimator)

pipe_tf_ber = Pipeline([
    ("tf", TfidfVectorizer(analyzer=clean_text)),
    ('ber', BernoulliNB())
])

In [9]:
pipe_tf_ber.get_params()

{'memory': None,
 'steps': [('tf',
   TfidfVectorizer(analyzer=<function clean_text at 0x000001C454A1BEE0>)),
  ('ber', BernoulliNB())],
 'verbose': False,
 'tf': TfidfVectorizer(analyzer=<function clean_text at 0x000001C454A1BEE0>),
 'ber': BernoulliNB(),
 'tf__analyzer': <function __main__.clean_text(text)>,
 'tf__binary': False,
 'tf__decode_error': 'strict',
 'tf__dtype': numpy.float64,
 'tf__encoding': 'utf-8',
 'tf__input': 'content',
 'tf__lowercase': True,
 'tf__max_df': 1.0,
 'tf__max_features': None,
 'tf__min_df': 1,
 'tf__ngram_range': (1, 1),
 'tf__norm': 'l2',
 'tf__preprocessor': None,
 'tf__smooth_idf': True,
 'tf__stop_words': None,
 'tf__strip_accents': None,
 'tf__sublinear_tf': False,
 'tf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tf__tokenizer': None,
 'tf__use_idf': True,
 'tf__vocabulary': None,
 'ber__alpha': 1.0,
 'ber__binarize': 0.0,
 'ber__class_prior': None,
 'ber__fit_prior': True}

In [10]:
#hyperparameter tuning
params_tf_ber = {
    'tf__min_df': [0,1],
    'tf__max_df': [0.4],
    'ber__alpha': [0.3],
    'ber__binarize' : [0.1]
}

In [11]:
# Instantiate GridSearchCV
gs_tf_ber = GridSearchCV(pipe_tf_ber, # what object are we optimizing?
                  param_grid=params_tf_ber, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [12]:
# Fit GridSearch to training data.
gs_tf_ber.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tf',
                                        TfidfVectorizer(analyzer=<function clean_text at 0x000001C454A1BEE0>)),
                                       ('ber', BernoulliNB())]),
             param_grid={'ber__alpha': [0.3], 'ber__binarize': [0.1],
                         'tf__max_df': [0.4], 'tf__min_df': [0, 1]})

In [13]:
# best params
gs_tf_ber.best_params_

{'ber__alpha': 0.3, 'ber__binarize': 0.1, 'tf__max_df': 0.4, 'tf__min_df': 0}

In [14]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
gs_tf_ber.best_score_

# Score model on training set.
gs_tf_ber.score(X_train, y_train)

# Score model on testing set.
gs_tf_ber.score(X_test, y_test)

print(f'best cross validation score: {gs_tf_ber.best_score_}')
print(f'score on train set: {gs_tf_ber.score(X_train, y_train)}')
print(f'score on test set: {gs_tf_ber.score(X_test, y_test)}')

best cross validation score: 0.9051567309072974
score on train set: 0.9569863745658562
score on test set: 0.9254807692307693


- best cross validation score: 0.9051567309072974
- score on train set: 0.9569863745658562
- score on test set: 0.9254807692307693

#### TF-IDF + Multinomial Naive Bayers + Data Preprocessing

In [15]:
# Setting up pipeline with 2 stages:
# 1. TF-IDF + clean text
# 2. Multinomial Naive Bayes (estimator)

pipe_tf_mnb = Pipeline([
    ("tf", TfidfVectorizer(analyzer=clean_text)),
    ('mnb', MultinomialNB())
])

In [16]:
pipe_tf_mnb.get_params()

{'memory': None,
 'steps': [('tf',
   TfidfVectorizer(analyzer=<function clean_text at 0x000001C454A1BEE0>)),
  ('mnb', MultinomialNB())],
 'verbose': False,
 'tf': TfidfVectorizer(analyzer=<function clean_text at 0x000001C454A1BEE0>),
 'mnb': MultinomialNB(),
 'tf__analyzer': <function __main__.clean_text(text)>,
 'tf__binary': False,
 'tf__decode_error': 'strict',
 'tf__dtype': numpy.float64,
 'tf__encoding': 'utf-8',
 'tf__input': 'content',
 'tf__lowercase': True,
 'tf__max_df': 1.0,
 'tf__max_features': None,
 'tf__min_df': 1,
 'tf__ngram_range': (1, 1),
 'tf__norm': 'l2',
 'tf__preprocessor': None,
 'tf__smooth_idf': True,
 'tf__stop_words': None,
 'tf__strip_accents': None,
 'tf__sublinear_tf': False,
 'tf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tf__tokenizer': None,
 'tf__use_idf': True,
 'tf__vocabulary': None,
 'mnb__alpha': 1.0,
 'mnb__class_prior': None,
 'mnb__fit_prior': True}

In [17]:
#hyperparameter tuning
params_tf_mnb = {
    'tf__min_df': [1],
    'tf__max_df': [0.6],
    'mnb__alpha': [0.3]
}

In [18]:
# Instantiate GridSearchCV
gs_tf_mnb = GridSearchCV(pipe_tf_mnb, # what object are we optimizing?
                  param_grid=params_tf_mnb, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [19]:
# Fit GridSearch to training data.
gs_tf_mnb.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tf',
                                        TfidfVectorizer(analyzer=<function clean_text at 0x000001C454A1BEE0>)),
                                       ('mnb', MultinomialNB())]),
             param_grid={'mnb__alpha': [0.3], 'tf__max_df': [0.6],
                         'tf__min_df': [1]})

In [20]:
# best params:
gs_tf_mnb.best_params_

{'mnb__alpha': 0.3, 'tf__max_df': 0.6, 'tf__min_df': 1}

In [21]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
gs_tf_mnb.best_score_

# Score model on training set.
gs_tf_mnb.score(X_train, y_train)

# Score model on testing set.
gs_tf_mnb.score(X_test, y_test)

print(f'best cross validation score: {gs_tf_mnb.best_score_}')
print(f'score on train set: {gs_tf_mnb.score(X_train, y_train)}')
print(f'score on test set: {gs_tf_mnb.score(X_test, y_test)}')

best cross validation score: 0.9095654593566305
score on train set: 0.9543147208121827
score on test set: 0.9274839743589743


- best cross validation score: 0.9095654593566305
- score on train set: 0.9543147208121827
- score on test set: 0.9274839743589743

In [22]:
y_pred_tf_mnb = gs_tf_mnb.predict(X_test)
print(classification_report(y_test, y_pred_tf_mnb))

              precision    recall  f1-score   support

           0       0.96      0.90      0.93      1248
           1       0.90      0.96      0.93      1248

    accuracy                           0.93      2496
   macro avg       0.93      0.93      0.93      2496
weighted avg       0.93      0.93      0.93      2496



#### TF-IDF + Logistic Regression + Data Preprocessing

In [23]:
# Setting up pipeline with 2 stages:
# 1. TF-IDF (transformer) + clean text
# 2. Logistic Regression (estimator)

pipe_tf_lg = Pipeline([
    ("tf", TfidfVectorizer(analyzer=clean_text)),
    ('lg', LogisticRegression())
])

In [24]:
pipe_tf_lg.get_params()

{'memory': None,
 'steps': [('tf',
   TfidfVectorizer(analyzer=<function clean_text at 0x000001C454A1BEE0>)),
  ('lg', LogisticRegression())],
 'verbose': False,
 'tf': TfidfVectorizer(analyzer=<function clean_text at 0x000001C454A1BEE0>),
 'lg': LogisticRegression(),
 'tf__analyzer': <function __main__.clean_text(text)>,
 'tf__binary': False,
 'tf__decode_error': 'strict',
 'tf__dtype': numpy.float64,
 'tf__encoding': 'utf-8',
 'tf__input': 'content',
 'tf__lowercase': True,
 'tf__max_df': 1.0,
 'tf__max_features': None,
 'tf__min_df': 1,
 'tf__ngram_range': (1, 1),
 'tf__norm': 'l2',
 'tf__preprocessor': None,
 'tf__smooth_idf': True,
 'tf__stop_words': None,
 'tf__strip_accents': None,
 'tf__sublinear_tf': False,
 'tf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tf__tokenizer': None,
 'tf__use_idf': True,
 'tf__vocabulary': None,
 'lg__C': 1.0,
 'lg__class_weight': None,
 'lg__dual': False,
 'lg__fit_intercept': True,
 'lg__intercept_scaling': 1,
 'lg__l1_ratio': None,
 'lg__max_iter': 1

In [25]:
#hyperparameter tuning
params_tf_lg = {
    'tf__max_features': [13_000],
    'tf__min_df': [1],
    'tf__max_df': [0.4],
    'lg__solver': ["saga"]
}

In [26]:
# Instantiate GridSearchCV
gs_tf_lg = GridSearchCV(pipe_tf_lg, # what object are we optimizing?
                  param_grid=params_tf_lg, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [27]:
# Fit GridSearch to training data.
gs_tf_lg.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tf',
                                        TfidfVectorizer(analyzer=<function clean_text at 0x000001C454A1BEE0>)),
                                       ('lg', LogisticRegression())]),
             param_grid={'lg__solver': ['saga'], 'tf__max_df': [0.4],
                         'tf__max_features': [13000], 'tf__min_df': [1]})

In [28]:
# best params:
gs_tf_lg.best_params_

{'lg__solver': 'saga',
 'tf__max_df': 0.4,
 'tf__max_features': 13000,
 'tf__min_df': 1}

In [29]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross validation score: {gs_tf_lg.best_score_}') 

# Score model on training set.
print(f'score on train set: {gs_tf_lg.score(X_train, y_train)}')

# Score model on testing set.
print(f'score on test set: {gs_tf_lg.score(X_test, y_test)}')

best cross validation score: 0.9275977856915434
score on train set: 0.962463264760887
score on test set: 0.9423076923076923


- best cross validation score: 0.9270633835539348
- score on train set: 0.9619289340101523
- score on test set: 0.9423076923076923

In [30]:
y_pred_tf_lg = gs_tf_lg.predict(X_test)
print(classification_report(y_test, y_pred_tf_lg))

              precision    recall  f1-score   support

           0       0.95      0.93      0.94      1248
           1       0.93      0.95      0.94      1248

    accuracy                           0.94      2496
   macro avg       0.94      0.94      0.94      2496
weighted avg       0.94      0.94      0.94      2496



#### TF-IDF + KNN + Data Preprocessing

In [31]:
# Setting up pipeline with 2 stages:
# 1. TF-IDF (transformer) + clean text
# 2. KNN (estimator)

pipe_tf_knn = Pipeline([
    ("tf", TfidfVectorizer(analyzer=clean_text)),
    ('knn', KNeighborsClassifier())
])

In [32]:
pipe_tf_knn.get_params()

{'memory': None,
 'steps': [('tf',
   TfidfVectorizer(analyzer=<function clean_text at 0x000001C454A1BEE0>)),
  ('knn', KNeighborsClassifier())],
 'verbose': False,
 'tf': TfidfVectorizer(analyzer=<function clean_text at 0x000001C454A1BEE0>),
 'knn': KNeighborsClassifier(),
 'tf__analyzer': <function __main__.clean_text(text)>,
 'tf__binary': False,
 'tf__decode_error': 'strict',
 'tf__dtype': numpy.float64,
 'tf__encoding': 'utf-8',
 'tf__input': 'content',
 'tf__lowercase': True,
 'tf__max_df': 1.0,
 'tf__max_features': None,
 'tf__min_df': 1,
 'tf__ngram_range': (1, 1),
 'tf__norm': 'l2',
 'tf__preprocessor': None,
 'tf__smooth_idf': True,
 'tf__stop_words': None,
 'tf__strip_accents': None,
 'tf__sublinear_tf': False,
 'tf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tf__tokenizer': None,
 'tf__use_idf': True,
 'tf__vocabulary': None,
 'knn__algorithm': 'auto',
 'knn__leaf_size': 30,
 'knn__metric': 'minkowski',
 'knn__metric_params': None,
 'knn__n_jobs': None,
 'knn__n_neighbors': 5,


In [33]:
#hyperparameter tuning
params_tf_knn = {
    'tf__max_features': [13_000],
    'tf__min_df': [0],
    'tf__max_df': [0.4],
    'knn__n_neighbors': [19]
}

In [34]:
# Instantiate GridSearchCV
gs_tf_knn = GridSearchCV(pipe_tf_knn, # what object are we optimizing?
                  param_grid=params_tf_knn, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [35]:
# Fit GridSearch to training data.
gs_tf_knn.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tf',
                                        TfidfVectorizer(analyzer=<function clean_text at 0x000001C454A1BEE0>)),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'knn__n_neighbors': [19], 'tf__max_df': [0.4],
                         'tf__max_features': [13000], 'tf__min_df': [0]})

In [36]:
# best params:
gs_tf_knn.best_params_

{'knn__n_neighbors': 19,
 'tf__max_df': 0.4,
 'tf__max_features': 13000,
 'tf__min_df': 0}

In [37]:
# Best score: Mean cross-validated score of the best_estimator (we did a cv = 5)
print(f'best cross validation score: {gs_tf_knn.best_score_}') 

# Score model on training set.
print(f'score on train set: {gs_tf_knn.score(X_train, y_train)}')

# Score model on testing set.
print(f'score on test set: {gs_tf_knn.score(X_test, y_test)}')

best cross validation score: 0.8939358913643932
score on train set: 0.8676195565054768
score on test set: 0.8725961538461539


- best cross validation score: 0.8939358913643932
- score on train set: 0.8676195565054768
- score on test set: 0.8725961538461539

In [38]:
y_pred_tf_knn = gs_tf_knn.predict(X_test)
print(classification_report(y_test, y_pred_tf_knn))

              precision    recall  f1-score   support

           0       0.98      0.76      0.86      1248
           1       0.81      0.98      0.89      1248

    accuracy                           0.87      2496
   macro avg       0.89      0.87      0.87      2496
weighted avg       0.89      0.87      0.87      2496



In [39]:
%notify

<IPython.core.display.Javascript object>

## Summary on Model Accuracy

|**Vectorization Method**|**Model**|**Train Results**|**Test Results**|
|:---|:---|:---:|:---:|
|TF-IDF|Naive Bayes - Bernoulli|0.95698|0.92548|
|TF-IDF|Naive Bayes - Multinomial|0.95431|0.92748|
|TF-IDF|Logistic Regression|0.96193|0.94231|
|TF-IDF|K-Nearest Neighbours|0.86762|0.87260|