# Project 3: Web APIs & NLP
### Notebook 3: Pre-processing, modelling, assessments and conclusions

**Objective of Notebook**: Pre-processing, modelling, analysis and evaluation, then express related findings, conclusions and recommendations.

### Notebook organization
1. [Introduction and scraping](./P03_01_data_collection.ipynb) (Generate raw datasets)
2. [Data Cleaning and Exploratory Data Analysis (EDA)](./P03_02_data_cleaning_and_eda.ipynb) (Generate cleaned datasets)
3. Pre-processing, modelling, assessments and conclusions <br> (**CURRENT**: Pre-processing, model, evaluate and summarize conclusions & recommendations)

### Methodology: Modelling
We now have cleaned data (from Notebook 2) saved in `./data/combined_data.csv`.

We don't yet know which models will work best as classifiers, nor do we know which is a better vectorizer to use.

Hence, we will run CountVectorizer and TfidfVectorizer per model, using imported stop words from `sklearn`: `ENGLISH_STOP_WORDS`.

Conceptually, we shall iteratively test the models and tune focusing on `f1_score`.

Between iterations, we will review instances of misclassification and tune the model further.

#### In this notebook:
1. [Imports](#Imports)
1. [Train-test-split and Metrics](#Train-test-split-and-Metrics)
1. [Vectorizers and Models Used](#Vectorizers-and-Models-Used)
1. [Modelling and Hyperparameter Tuning](#Modelling-and-Hyperparameter-Tuning)
1. [Model Evaluation](#Model-Evaluation)
1. [Conclusions and recommendations](#Conclusions-and-Recommendations)

#### Imports

In [1]:
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
# from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve

In [2]:
# manage warnings
import warnings
warnings.filterwarnings('ignore')
# warnings.filterwarnings(action='once')

In [3]:
# read in data
data = pd.read_csv('./data/combined_data.csv', index_col=0)
data

Unnamed: 0,title,selftext,subreddit_jobs,combined_text
0,Got a offer at one of the first jobs I applied...,"30 years old, BA degree, lot of sales work his...",1,got a offer at one of the first i applied too...
1,How does educational history show up on a back...,Most of the time my resume will be discarded d...,1,how does educational history show up on a back...
2,I was about to quit then I got a bonus and a r...,"So long story short, been with my company for ...",1,i was about to quit then i got a bonus and a r...
3,What jobs do I qualify for with these skills?,Currently a rising sophomore in college so no ...,1,what do i qualify for with these skills? curr...
4,Age requirement?,How old do I have to be to get a job? Any job ...,1,age requirement? how old do i have to be to ge...
...,...,...,...,...
4701,I'm in a favorable negotiating position for th...,"I'm in a position that is OK, making about $80...",0,im in a favorable negotiating position for the...
4702,How to go back to college after years…need advice,[removed],0,how to go back to college after years…need adv...
4703,Is this a good career change resume summary?,"""Technical Sales Engineer with 2+ years of exp...",0,is this a good change resume summary? technic...
4704,"How Do Remote Jobs ""Work""?",So as someone who lives in a really small town...,0,how do remote work? so as someone who lives i...


[Back to top](#Notebook-organization)

### Train-test-split and Metrics

#### Train-test-split

In [4]:
# Define features, X, and y
X = data['combined_text']
y = data['subreddit_jobs']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=42,
                                                    shuffle=True,
                                                    stratify=y)

In [6]:
display(
    X_train.shape,
    y_train.shape,
    X_test.shape,
    y_test.shape
)

(3294,)

(3294,)

(1412,)

(1412,)

#### Metrics

In [7]:
# Define scores function
def get_scores(y_test, y_pred, rounding_dp=3):
    print(f'Accuracy: {round(accuracy_score(y_test, y_pred), rounding_dp)}')
    print(f'F1: {round(f1_score(y_test, y_pred), rounding_dp)}')
    print(f'Precision: {round(precision_score(y_test, y_pred), rounding_dp)}')
    print(f'Recall: {round(recall_score(y_test, y_pred), rounding_dp)}')
    print(f'ROC-AUC: {round(roc_auc_score(y_test, y_pred), rounding_dp)}')

[Back to top](#Notebook-organization)

### Vectorizers and Models Used
1. Baseline model, DummyClassifier (dummyc)
1. CountVectorizer (cvec)
1. TfidfVectorizer (tvec)
1. Multinomial Naive Bayes (nb)
1. Random Forest (rf)
1. GridSearchCV (gs; for tuning)

For vectorizers (cvec, tvec):
- We will default to using a fixed set of stop words from `sklearn`
- Tuning needed for: 
    - `ngram_range`, 
    - `min_df` (ignore if not common enough), 
    - `max_df` (ignore if too common), and
    - `max_features` (top max features).

For classifiers (nb, rf):
- `nb` will need tuning for
    - alpha
- `rf` will need tuning for 
    - max_depth
    - n_estimators
    - max_leaf_nodes

In [8]:
%%time

#Dummy Classifier
dummyc = DummyClassifier(strategy= 'most_frequent').fit(X_train, y_train)
y_pred = dummyc.predict(X_test)
y_pred_train = dummyc.predict(X_train)

print('Baseline model: Train scores', '\n')
get_scores(y_train, y_pred_train)
print()
print('Baseline model:Test scores', '\n')
get_scores(y_test, y_pred)

Baseline model: Train scores 

Accuracy: 0.515
F1: 0.0
Precision: 0.0
Recall: 0.0
ROC-AUC: 0.5

Baseline model:Test scores 

Accuracy: 0.516
F1: 0.0
Precision: 0.0
Recall: 0.0
ROC-AUC: 0.5
Wall time: 20 ms


Warning occurs because this is a dummy classifier. (Can ignore.)

In [9]:
# y.value_counts(normalize=True)

In [10]:
# Instantiate
cvec = CountVectorizer(stop_words=ENGLISH_STOP_WORDS)
tvec = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS)
nb = MultinomialNB()
rf = RandomForestClassifier()

[Back to top](#Notebook-organization)

### Modelling and Hyperparameter Tuning

Tuning is done iteratively and for simplicity, we do not include all iterations in this notebook. Given more time, the models can be tuned more, and more different models can be tested (e.g. logistic regression).

We see, in the image below, an instance where we took over an hour to tune!

![Long Tuning Time!](./images/long_tuning_time.png)

#### Model cvec-rf

In [11]:
%%time
# tune for best_params_cvec_rf

# set up pipe
pipe = Pipeline(
    steps=[
        ('cvec', cvec),
        ('rf', rf)
    ]
)

# set up gridsearch
params = {
    'cvec__ngram_range': ((1,1), (1,3)),
    'cvec__min_df': [0.01, 0.1, 0.2],
    'cvec__max_df': [0.1, 0.5, 0.9],
    'cvec__max_features': [None, 5, 100],
    'rf__max_depth': [1, 5, 10, 15, 20],
    'rf__n_estimators': [1, 2, 8, 16, 64, 100]
}

grid_cvec_rf = GridSearchCV(pipe, params, n_jobs=-1)

# execute
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipe.steps])
print("parameters:")
print(params)
grid_cvec_rf.fit(X_train, y_train)
print()
print("Best score: %0.3f" % grid_cvec_rf.best_score_)
print("Best parameters set:")
best_params_cvec_rf = grid_cvec_rf.best_estimator_.get_params()
for param_name in sorted(params.keys()):
    print("\t%s: %r" % (param_name, best_params_cvec_rf[param_name]))
    
# Predict and score
y_pred_train = grid_cvec_rf.predict(X=X_train)
y_pred = grid_cvec_rf.predict(X=X_test)

print()
print('-----Train scores-----')
get_scores(y_train, y_pred_train)
print()
print('-----Validation scores-----')
get_scores(y_test, y_pred)

Performing grid search...
pipeline: ['cvec', 'rf']
parameters:
{'cvec__ngram_range': ((1, 1), (1, 3)), 'cvec__min_df': [0.01, 0.1, 0.2], 'cvec__max_df': [0.1, 0.5, 0.9], 'cvec__max_features': [None, 5, 100], 'rf__max_depth': [1, 5, 10, 15, 20], 'rf__n_estimators': [1, 2, 8, 16, 64, 100]}

Best score: 0.660
Best parameters set:
	cvec__max_df: 0.5
	cvec__max_features: None
	cvec__min_df: 0.01
	cvec__ngram_range: (1, 1)
	rf__max_depth: 20
	rf__n_estimators: 64

-----Train scores-----
Accuracy: 0.862
F1: 0.842
Precision: 0.943
Recall: 0.76
ROC-AUC: 0.859

-----Validation scores-----
Accuracy: 0.667
F1: 0.629
Precision: 0.683
Recall: 0.583
ROC-AUC: 0.665
Wall time: 20min 46s


In [12]:
# read features
df_cvec_rf_features = pd.DataFrame({
    'feature names': grid_cvec_rf.best_estimator_['cvec'].get_feature_names(),
    'feature importances': grid_cvec_rf.best_estimator_['rf'].feature_importances_
})

# display top 10 features
df_cvec_rf_features.sort_values(by='feature importances', ascending=False).head(10)

Unnamed: 0,feature names,feature importances
383,interview,0.019979
548,path,0.017012
341,hiring,0.012209
206,email,0.011349
29,application,0.0102
847,weeks,0.007973
858,work,0.007561
673,science,0.007522
31,applied,0.007355
299,got,0.007331


In [13]:
# prep df for misclassfied lines in training set
# do not prep for test set as model should not "see" that data
df_misclassified_cvec_rf = pd.DataFrame({
    'text': X_train,
    'predicted': grid_cvec_rf.predict(X_train),
    'true': y_train
})

In [14]:
# view misclassified lines only
df_misclassified_cvec_rf[
    df_misclassified_cvec_rf['predicted'] != df_misclassified_cvec_rf['true']
]

Unnamed: 0,text,predicted,true
1455,asking for a raise,0,1
725,"for uiux designers this is for the future, but...",0,1
752,expecting rejection is good of mental health e...,0,1
741,anyone get nervous when starting a new ? anxie...,0,1
1421,"m any ideas? i am looking for a new , so look...",0,1
...,...,...,...
859,so you feel like “excellent communication skil...,0,1
1240,upcoming govt,0,1
3074,applied for an internal position in my same de...,1,0
3973,your opinion on boss who constantly brags abou...,1,0


#### Model tvec-rf

In [15]:
%%time
# tune for best_params_tvec_rf
pipe = Pipeline(
    steps=[
        ('tvec', tvec),
        ('rf', rf)
    ]
)

params = {
    'tvec__ngram_range': ((1,1), (1,3)),
    'tvec__min_df': [0.01, 0.1, 0.2],
    'tvec__max_df': [0.1, 0.5, 0.9],
    'tvec__max_features': [None, 5, 100],
    'rf__max_depth': [1, 5, 10, 15, 20],
    'rf__n_estimators': [1, 2, 8, 16, 64, 100]
}

grid_tvec_rf = GridSearchCV(pipe, params, n_jobs=-1)

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipe.steps])
print("parameters:")
print(params)
# t0 = time()
grid_tvec_rf.fit(X_train, y_train)
# print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_tvec_rf.best_score_)
print("Best parameters set:")
best_params_tvec_rf = grid_tvec_rf.best_estimator_.get_params()
for param_name in sorted(params.keys()):
    print("\t%s: %r" % (param_name, best_params_tvec_rf[param_name]))

# Predict and score
y_pred_train = grid_tvec_rf.predict(X=X_train)
y_pred = grid_tvec_rf.predict(X=X_test)

print()
print('-----Train scores-----')
get_scores(y_train, y_pred_train)
print()
print('-----Validation scores-----')
get_scores(y_test, y_pred)

Performing grid search...
pipeline: ['tvec', 'rf']
parameters:
{'tvec__ngram_range': ((1, 1), (1, 3)), 'tvec__min_df': [0.01, 0.1, 0.2], 'tvec__max_df': [0.1, 0.5, 0.9], 'tvec__max_features': [None, 5, 100], 'rf__max_depth': [1, 5, 10, 15, 20], 'rf__n_estimators': [1, 2, 8, 16, 64, 100]}

Best score: 0.662
Best parameters set:
	rf__max_depth: 15
	rf__n_estimators: 100
	tvec__max_df: 0.5
	tvec__max_features: None
	tvec__min_df: 0.01
	tvec__ngram_range: (1, 1)

-----Train scores-----
Accuracy: 0.853
F1: 0.828
Precision: 0.96
Recall: 0.728
ROC-AUC: 0.849

-----Validation scores-----
Accuracy: 0.666
F1: 0.617
Precision: 0.695
Recall: 0.556
ROC-AUC: 0.663
Wall time: 20min 43s


In [16]:
# read features
df_tvec_rf_features = pd.DataFrame({
    'feature names': grid_tvec_rf.best_estimator_['tvec'].get_feature_names(),
    'feature importances': grid_tvec_rf.best_estimator_['rf'].feature_importances_
})

# display top 10 features
df_tvec_rf_features.sort_values(by='feature importances', ascending=False).head(10)

Unnamed: 0,feature names,feature importances
383,interview,0.034256
548,path,0.019541
341,hiring,0.01496
29,application,0.012297
299,got,0.012207
206,email,0.011157
673,science,0.010867
845,week,0.010184
662,said,0.009849
13,advice,0.009744


In [17]:
# prep df for misclassfied lines in training set
# do not prep for test set as model should not "see" that data
df_misclassified_tvec_rf = pd.DataFrame({
    'text': X_train,
    'predicted': grid_tvec_rf.predict(X_train),
    'true': y_train
})

In [18]:
# view misclassified lines only
df_misclassified_tvec_rf[
    df_misclassified_tvec_rf['predicted'] != df_misclassified_tvec_rf['true']
]

Unnamed: 0,text,predicted,true
1455,asking for a raise,0,1
725,"for uiux designers this is for the future, but...",0,1
752,expecting rejection is good of mental health e...,0,1
741,anyone get nervous when starting a new ? anxie...,0,1
1122,received a offer - times what i am currently ...,0,1
...,...,...,...
1695,i feel stuck,0,1
859,so you feel like “excellent communication skil...,0,1
1240,upcoming govt,0,1
156,how to tell the difference between a head hunt...,0,1


#### Model cvec-nb

In [19]:
%%time
# tune for best_params_cvec_nb
pipe = Pipeline(
    steps=[
        ('cvec', cvec),
        ('nb', nb)
    ]
)

params = {
    'cvec__ngram_range': ((1,1), (1,3)),
    'cvec__min_df': [0.01, 0.1, 0.2],
    'cvec__max_df': [0.1, 0.5, 0.9],
    'cvec__max_features': [None, 5, 100],
    'nb__alpha': [1, 2, 3, 5]
}

grid_cvec_nb = GridSearchCV(pipe, params, n_jobs=-1)

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipe.steps])
print("parameters:")
print(params)
# t0 = time()
grid_cvec_nb.fit(X_train, y_train)
# print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_cvec_nb.best_score_)
print("Best parameters set:")
best_params_cvec_nb = grid_cvec_nb.best_estimator_.get_params()
for param_name in sorted(params.keys()):
    print("\t%s: %r" % (param_name, best_params_cvec_nb[param_name]))

# Predict and score
y_pred_train = grid_cvec_nb.predict(X=X_train)
y_pred = grid_cvec_nb.predict(X=X_test)

print()
print('-----Train scores-----')
get_scores(y_train, y_pred_train)
print()
print('-----Validation scores-----')
get_scores(y_test, y_pred)

Performing grid search...
pipeline: ['cvec', 'nb']
parameters:
{'cvec__ngram_range': ((1, 1), (1, 3)), 'cvec__min_df': [0.01, 0.1, 0.2], 'cvec__max_df': [0.1, 0.5, 0.9], 'cvec__max_features': [None, 5, 100], 'nb__alpha': [1, 2, 3, 5]}

Best score: 0.672
Best parameters set:
	cvec__max_df: 0.5
	cvec__max_features: None
	cvec__min_df: 0.01
	cvec__ngram_range: (1, 3)
	nb__alpha: 5

-----Train scores-----
Accuracy: 0.705
F1: 0.691
Precision: 0.702
Recall: 0.681
ROC-AUC: 0.705

-----Validation scores-----
Accuracy: 0.684
F1: 0.67
Precision: 0.679
Recall: 0.661
ROC-AUC: 0.683
Wall time: 2min 32s


In [20]:
# prep df for features analysis
df_cvec_nb_features = pd.DataFrame({
    'feature names': grid_cvec_nb.best_estimator_['cvec'].get_feature_names(),
    'feature_log_prob_0': grid_cvec_nb.best_estimator_['nb'].feature_log_prob_[0],
    'feature_log_prob_1': grid_cvec_nb.best_estimator_['nb'].feature_log_prob_[1]
})

# add log_prob_diff for ranking
df_cvec_nb_features['log_prob_diff'] = abs(
    df_cvec_nb_features['feature_log_prob_1'] - df_cvec_nb_features['feature_log_prob_0']
)

# view df
df_cvec_nb_features.sort_values(
    by='log_prob_diff', ascending=False
).head(20)

Unnamed: 0,feature names,feature_log_prob_0,feature_log_prob_1,log_prob_diff
609,path,-6.272593,-8.313492,2.040899
102,choose,-6.973288,-8.507648,1.53436
282,finance,-6.547822,-8.055663,1.507841
654,programming,-7.291741,-8.74881,1.457068
700,references,-8.390354,-6.95705,1.433303
199,direction,-7.525356,-8.949481,1.424124
606,passion,-7.291741,-8.661798,1.370057
738,science,-6.139062,-7.509119,1.370057
807,study,-6.732126,-8.102183,1.370057
829,teaching,-6.768493,-8.102183,1.333689


In [21]:
# prep df for misclassfied lines in training set
# do not prep for test set as model should not "see" that data
df_misclassified_cvec_nb = pd.DataFrame({
    'text': X_train,
    'predicted': grid_cvec_nb.predict(X_train),
    'true': y_train
})

# view misclassified lines only
df_misclassified_cvec_nb[
    df_misclassified_cvec_nb['predicted'] != df_misclassified_cvec_nb['true']
]

Unnamed: 0,text,predicted,true
735,"out of curiosity, what types of can i apply f...",0,1
2423,struggling to find meaning and drive office en...,1,0
3676,how to convince recruiter to transfer me to ny...,1,0
725,"for uiux designers this is for the future, but...",0,1
2625,"new requires travel - concerns? hola, so i j...",1,0
...,...,...,...
1955,"i start my first post-graduation - tomorrow, ...",0,1
1240,upcoming govt,0,1
3074,applied for an internal position in my same de...,1,0
477,"if i enlist in the reserves, is my place of em...",0,1


#### Model tvec-nb

In [22]:
%%time

# tune for best_params_tvec_nb
pipe = Pipeline(
    steps=[
        ('tvec', tvec),
        ('nb', nb)
    ]
)

params = {
    'tvec__ngram_range': ((1,1), (1,3)),
    'tvec__min_df': [0.01, 0.1, 0.2],
    'tvec__max_df': [0.1, 0.5, 0.9],
    'tvec__max_features': [None, 5, 100],
    'nb__alpha': [1, 2, 3, 5]
}

grid = GridSearchCV(pipe, params, n_jobs=-1)

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipe.steps])
print("parameters:")
print(params)
# t0 = time()
grid.fit(X_train, y_train)
# print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid.best_score_)
print("Best parameters set:")
best_params_tvec_nb = grid.best_estimator_.get_params()
for param_name in sorted(params.keys()):
    print("\t%s: %r" % (param_name, best_params_tvec_nb[param_name]))

# Predict and score
y_pred_train = grid.predict(X=X_train)
y_pred = grid.predict(X=X_test)

print()
print('-----Train scores-----')
get_scores(y_train, y_pred_train)

print()
print('-----Validation scores-----')
get_scores(y_test, y_pred)

Performing grid search...
pipeline: ['tvec', 'nb']
parameters:
{'tvec__ngram_range': ((1, 1), (1, 3)), 'tvec__min_df': [0.01, 0.1, 0.2], 'tvec__max_df': [0.1, 0.5, 0.9], 'tvec__max_features': [None, 5, 100], 'nb__alpha': [1, 2, 3, 5]}

Best score: 0.678
Best parameters set:
	nb__alpha: 3
	tvec__max_df: 0.5
	tvec__max_features: None
	tvec__min_df: 0.01
	tvec__ngram_range: (1, 3)

-----Train scores-----
Accuracy: 0.729
F1: 0.714
Precision: 0.73
Recall: 0.698
ROC-AUC: 0.728

-----Validation scores-----
Accuracy: 0.689
F1: 0.674
Precision: 0.685
Recall: 0.664
ROC-AUC: 0.688
Wall time: 2min 29s


In [23]:
# prep df for misclassfied lines in training set
# do not prep for test set as model should not "see" that data
df_misclassified_tvec_nb = pd.DataFrame({
    'text': X_train,
    'predicted': grid.predict(X_train),
    'true': y_train
})

# view misclassified lines only
df_misclassified_tvec_nb[
    df_misclassified_tvec_nb['predicted'] != df_misclassified_tvec_nb['true']
].head()

Unnamed: 0,text,predicted,true
735,"out of curiosity, what types of can i apply f...",0,1
3676,how to convince recruiter to transfer me to ny...,1,0
725,"for uiux designers this is for the future, but...",0,1
2625,"new requires travel - concerns? hola, so i j...",1,0
741,anyone get nervous when starting a new ? anxie...,0,1


In [24]:
# prep df for features analysis
df_tvec_nb_features = pd.DataFrame({
    'feature names': grid.best_estimator_['tvec'].get_feature_names(),
    'feature_log_prob_0': grid.best_estimator_['nb'].feature_log_prob_[0],
    'feature_log_prob_1': grid.best_estimator_['nb'].feature_log_prob_[1]
})

# add log_prob_diff for ranking
df_tvec_nb_features['log_prob_diff'] = abs(
    df_tvec_nb_features['feature_log_prob_1'] - df_tvec_nb_features['feature_log_prob_0']
)

# view df
df_tvec_nb_features.sort_values(
    by='log_prob_diff', ascending=False
).head(10)

Unnamed: 0,feature names,feature_log_prob_0,feature_log_prob_1,log_prob_diff
609,path,-6.077097,-7.743109,1.666012
102,choose,-6.431733,-7.675785,1.244052
282,finance,-6.394366,-7.539469,1.145103
224,email,-7.120348,-5.9971,1.123248
738,science,-5.986337,-7.096724,1.110387
31,application,-7.310175,-6.208768,1.101407
128,computer science,-6.565098,-7.620112,1.055013
337,guidance,-6.490284,-7.543298,1.053014
521,masters,-6.179064,-7.231109,1.052045
700,references,-7.716201,-6.688933,1.027269


[Back to top](#Notebook-organization)

### Model Evaluation

#### Metrics

|No.|Predictor|Model|Train accuracy|Validation accuracy|Train F1 Score|Validation F1 Score|Wall Time|
|-|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|1|combined_text|DummyClassifier (Most frequent)|0.515|0.515|n.a.|n.a.|<1s|
|2|combined_text|RandomForest  (cvec)|0.933|0.687|0.929|0.670|18min 1s|
|3|combined_text|RandomForest  (tvec)|0.902|0.685|0.896|0.669|18min 7s|
|4|combined_text|MultinomialNB (cvec)|0.731|0.667|0.727|0.655|2min 48s|
|5|combined_text|<font color='blue'><b>MultinomialNB (tvec)</b></font>|0.747|0.672|0.743|<font color='blue'><b>0.661</b></font>|<font color='blue'><b>2min 43s</b></font>|

### Conclusions and Recommendations

**Model selected**: Multinomial Naive Bayes with tf-idf vectorizer

```
Best parameters set:
	nb__alpha: 1
	tvec__max_df: 0.9
	tvec__max_features: None
	tvec__min_df: 0.01
	tvec__ngram_range: (1, 3)
```

We see from the model scores that while random forest provides high train scores, their validations scores with the test corpus are much lower, meaning they likely over-fit and do not generalize well.

Also, they take far longer to tune and run, with an average of 18 minutes.

On the other hand, while our Multinomial Naive Bayes model show slightly lower F1 score, they do generalize better and take less than 3 minutes to run.

For our use case, we plan to demo this feature within our internal team first, and deploy to production after augmenting with new findings. As such, a lower tuning run time is preferred for now.

#### Further steps
- Preprocessing: try lemmatization and augment stop words
- Review misclassification iteratively and add relevant steps in preprocessing to improve model scores
- Try other models and vectorizers

[Back to top](#Notebook-organization)