# Model Selection

In this notebook, we'll evaluate several models and select the most beneficial model for our usecase. Within this analysis, we'll keep in mind our problem statement of helping our client better tailor advertisments to prospective customers.

In [1]:
#imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import pickle

from pandas import json_normalize
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, precision_score, recall_score, f1_score
from sklearn.preprocessing import normalize,StandardScaler
from sklearn.svm import SVC,LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB



from sklearn.compose import ColumnTransformer


In [2]:
with open('pickles/df_text.pkl', 'rb') as f:
    df_text = pickle.load(f)

## Model Testing

Now that we have clean, processed, and normalized the data, let's model the data using a several different NLP algorithms. We'll create a pipeline for each model bundle and use GridSearchCV to optimize the input parameters, selecting the optimum model for our data.

#### PreProcessing

To use the sentiment analysis data that was collected in the previous notebook, I'll setup a preprocessor that will store my sentiment analysis data. I'll pass this preprocessor into my model pipeline to shield the sentiment analysis data from the TF-IDF that I will run on my text data.

In [3]:
# Data variables
text_col = df_text['text']
sentiment_cols = df_text[['neg', 'neu', 'pos', 'compound']]

In [4]:
# Preprocessors
text_preprocessor = TfidfVectorizer(max_features=1000, stop_words='english',use_idf=True)
sentiment_preprocessor = 'passthrough'

In [5]:
#define the column transformer to apply different preprocessing to each column
preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_preprocessor, 'text'),
        ('sentiment', sentiment_preprocessor, ['neg', 'neu', 'pos', 'compound'])
    ])

### Naive Bayes

The first pipeline will consist of a Bernoulli Naive Bayes with a TF-IDF vectorizer for the text data. We're using a Bernouilli NB because of its ability to handle negative numbers as an input. The sentiment analysis data is normalized on a -1 to 1 scale, so this is an important feature of this model.

In [6]:
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('nb', BernoulliNB())
])

In [7]:
# define the hyperparameter grid to search over
params = {
    'preprocessor__text__max_features': [5000, 10000, 15000],
    'preprocessor__text__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'nb__alpha': [0.1, 0.5, 1.0],
    'nb__fit_prior': [True, False],
}

# define the grid search object
gs = GridSearchCV(pipe, params, cv=5)

In [8]:
X = df_text[['text', 'neg', 'neu', 'pos', 'compound']]
y = df_text['subreddit']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
y_test_nb = y_test

In [10]:
# fit the model
gs.fit(X_train, y_train)

#### Refitting Naive Bayes Pipe with Best Parameters

We can now apply the best parameters found during our gridsearch to our Bernoulli Naive Bayes model.

In [11]:
# Get best hyperparameters
best_params_nb = gs.best_params_
pipe.set_params(**best_params_nb)

# Refit pipeline on entire training set
pipe.fit(X_train, y_train)

#### Generating Naive Bayes Model Predictions

In [12]:
y_pred_nb = pipe.predict(X_test)
y_score_nb = pipe.predict_proba(X_test)[:, 1]
accuracy = pipe.score(X_test, y_test)
precision = precision_score(y_test, y_pred_nb)
recall = recall_score(y_test, y_pred_nb)
f1 = f1_score(y_test, y_pred_nb)

#### Naive Bayes Results

In [41]:
print(f"NB Accuracy: {accuracy}")
print(f"NB Precision: {precision}")
print(f"NB Recall: {recall}")
print(f"NB F1 score: {f1}")

NB Accuracy: 0.8702549575070821
NB Precision: 0.9362549800796812
NB Recall: 0.7957110609480813
NB F1 score: 0.8602806589383771


In [14]:
best_params_nb

{'nb__alpha': 0.1,
 'nb__fit_prior': True,
 'preprocessor__text__max_features': 15000,
 'preprocessor__text__ngram_range': (1, 2)}

### Random Forest


Random forests are a type of ensemble learning method that combine multiple decision trees to make predictions. They are a popular choice for classification tasks in machine learning, including those involving natural language processing (NLP). Tree-based models are also exceptional at identifying non-linear trends in the data that other models may miss.

Below we'll evaluate our data using a Random Forest:

In [15]:
# Setting preprocessors
preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_preprocessor, 'text'),
        ('sentiment', sentiment_preprocessor, ['neg', 'neu', 'pos', 'compound'])
    ])

In [16]:
pipe_rf = Pipeline([
    ('preprocessor', preprocessor),
    ('rfc', RandomForestClassifier())
])

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=420)
y_test_rf= y_test

In [18]:
param_grid = {
    'preprocessor__text__max_features': [10000, 15000, 20000],
    'preprocessor__text__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'rfc__n_estimators': [50, 100, 200],
    'rfc__max_depth': [None, 5, 10],
    'rfc__min_samples_split': [2, 5, 10]
}

In [19]:
gs = GridSearchCV(pipe_rf, param_grid, cv=5, n_jobs=-1,scoring = 'accuracy')
gs.fit(X_train, y_train)



#### Refitting Random Forest Pipe with Best Parameters

In [20]:
best_params_rf = gs.best_params_
pipe_rf.set_params(**best_params_rf)
pipe_rf.fit(X_train, y_train)

#### Generating Random Forest Predictions

In [21]:
y_pred_rf = pipe_rf.predict(X_test)
y_score = pipe_rf.predict_proba(X_test)[:, 1]
accuracy_rf = pipe_rf.score(X_test, y_test_rf)
precision_rf = precision_score(y_test_rf, y_pred_rf)
recall_rf = recall_score(y_test_rf, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)

In [22]:
print(f"Accuracy: {accuracy_rf}")
print(f"Precision: {precision_rf}")
print(f"Recall: {recall_rf}")
print(f"F1 score: {f1_rf}")

Accuracy: 0.8985835694050992
Precision: 0.9085580304806565
Recall: 0.884703196347032
F1 score: 0.8964719491035281


In [23]:
best_params_rf

{'preprocessor__text__max_features': 10000,
 'preprocessor__text__ngram_range': (1, 2),
 'rfc__max_depth': None,
 'rfc__min_samples_split': 10,
 'rfc__n_estimators': 200}

### Support Vector Machine

The first pipeline will consist of a Support Vector Machine with a TF-IDF vectorizer for the text data. A Support Vector Machine (SVM) may be a good choice in comparison to other models such as a random forest or Naive Bayes for an NLP classification problem:

1. Better performance on smaller datasets: SVMs can perform better than other models, such as random forests or neural networks, on smaller datasets.

2. Robustness to noisy and irrelevant features: SVMs are effective at filtering out noisy or irrelevant features, making them a good choice for NLP problems where the feature space may be high-dimensional and sparse.

3. Handling of non-linear relationships: SVMs are capable of modeling non-linear relationships between features and the target variable.


We'll apply an SVM to our model and see if it provides the benefits stated above.

#### Setting up Preprocessors

We'll setup our preprocessor in a similar way as before, but this time we'll have to shield our sentiment data from the scaler. Since we're using an SVM, scaling our data is paramount, as SVMs rely on distance-based measurements, such as the Euclidean distance, to calculate the distance between data points and to identify the optimal decision boundary. If the features are not on the same scale, those with larger values can dominate the calculation of distances and the optimization of the decision boundary, leading to suboptimal results.

Since we are intending to scale only our text data, we'll have to add an additional step to sheild the sentiment data.

In [24]:
text_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('scaler', StandardScaler(with_mean=False))
])

sentiment_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('scaler', StandardScaler(with_mean=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_pipeline, 'text'),
        ('sentiment', sentiment_preprocessor, ['neg', 'neu', 'pos', 'compound']),
        ('ss', StandardScaler(with_mean=False), ['neg', 'neu', 'pos', 'compound'])
    ])

#### SVC Pipe

In [25]:
pipe_svc = Pipeline([    
    ('preprocessor', preprocessor),
    ('svc', LinearSVC(class_weight='balanced',max_iter=1000))
])

#### Train Test Split

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=2023)
y_test_svm = y_test

In [27]:
param_grid = {
    'svc__C': [0.1, 1, 10],
    'svc__class_weight': [None, 'balanced'],
    'svc__loss': ['hinge', 'squared_hinge']
}

In [28]:
pipe_svc.fit(X_train, y_train)



#### Instantiating and fitting GridSearch

In [29]:
gs = GridSearchCV(pipe_svc, param_grid=param_grid, cv=5, n_jobs=-1,scoring = 'accuracy')

In [30]:
# fitting gs to X,y
gs.fit(X_train, y_train)






#### Storing the Best Parameters and Refitting

We'll take the best parameters from our gridsearch and apply them to the SVC pipe created previously. 

In [31]:
best_params_svm = gs.best_params_

# Set best hyperparameters in pipeline
pipe_svc.set_params(**best_params_svm)

# Refit pipeline on entire training set
pipe_svc.fit(X_train, y_train)



#### Generating SVC Model Predictions

In [32]:
y_pred_svm = pipe_svc.predict(X_test)
distances = pipe_svc.decision_function(X_test)
y_score_svm = 1 / (1 + np.exp(-distances))
accuracy_svm = pipe_svc.score(X_test, y_test)
precision_svm = precision_score(y_test, y_pred_svm)
recall_svm = recall_score(y_test, y_pred_svm)
f1_svm = f1_score(y_test, y_pred_svm)

#### Model Results

In [44]:
print(f"SVC Accuracy: {accuracy_svm}")
print(f"SVC Precision: {precision_svm}")
print(f"SVC Recall: {recall_svm}")
print(f"SVC F1 score: {f1_svm}")

SVC Accuracy: 0.8203966005665723
SVC Precision: 0.8211111111111111
SVC Recall: 0.8256983240223463
SVC F1 score: 0.8233983286908078


In [34]:
best_params_svm

{'svc__C': 0.1, 'svc__class_weight': None, 'svc__loss': 'squared_hinge'}

### Selected Model (RandomForrest)

The model that performed the best for our dataset was the RandomForrest (RF) Model. The RF model provides a highly accuract model, while acheiving a reasonable trade-off between Precision and Recall, as shown by the high F1 score below.

| Model         | Accuracy | Precision | Recall | F1 Score |
|---------------|----------|-----------|--------|----------|
| Random Forrest| 0.904    | 0.910     | 0.897  | 0.903  |
| LinearSVC     | 0.820    | 0.821     | 0.826  | 0.823    |
| NB            | 0.872    | 0.944     | 0.792  | 0.861    |

On the next worksheet, we'll summarize the findings of the RF model in more detail and provide a summary of our analysis.

In [35]:
with open('pickles/y_pred_rf.pkl', 'wb') as f:
    pickle.dump(y_pred_rf, f)
      
with open('pickles/accuracy_rf.pkl', 'wb') as f:
    pickle.dump(accuracy_rf, f)
    
with open('pickles/y_test_rf.pkl', 'wb') as f:
    pickle.dump(y_test_rf, f)
      
with open('pickles/y_score.pkl', 'wb') as f:
    pickle.dump(y_score, f)
    
with open('pickles/y_score_nb.pkl', 'wb') as f:
    pickle.dump(y_score_nb, f)

with open('pickles/y_score_svm.pkl', 'wb') as f:
    pickle.dump(y_score_svm, f)
    
with open('pickles/y_pred_svm.pkl', 'wb') as f:
    pickle.dump(y_pred_svm, f)
          
with open('pickles/y_pred_nb.pkl', 'wb') as f:
    pickle.dump(y_pred_nb, f)
    
with open('pickles/y_test_svm.pkl', 'wb') as f:
    pickle.dump(y_test_svm, f)
          
with open('pickles/y_test_nb.pkl', 'wb') as f:
    pickle.dump(y_test_nb, f)

### Model Score Exports

In [36]:
with open('pickles/precision_rf.pkl', 'wb') as f:
    pickle.dump(precision_rf, f)
    
with open('pickles/recall_rf.pkl', 'wb') as f:
    pickle.dump(recall_rf, f)
          
with open('pickles/f1_rf.pkl', 'wb') as f:
    pickle.dump(f1_rf, f)

In [37]:
with open('pickles/accuracy_svm.pkl', 'wb') as f:
    pickle.dump(accuracy_svm, f)

with open('pickles/precision_svm.pkl', 'wb') as f:
    pickle.dump(precision_svm, f)
    
with open('pickles/recall_svm.pkl', 'wb') as f:
    pickle.dump(recall_svm, f)
          
with open('pickles/f1_svm.pkl', 'wb') as f:
    pickle.dump(f1_svm, f)

In [38]:
with open('pickles/accuracy.pkl', 'wb') as f:
    pickle.dump(accuracy, f)
    
with open('pickles/precision.pkl', 'wb') as f:
    pickle.dump(precision, f)
    
with open('pickles/recall.pkl', 'wb') as f:
    pickle.dump(recall, f)
          
with open('pickles/f1.pkl', 'wb') as f:
    pickle.dump(f1, f)