# Parameter tuning 

All the models we have trained have parameters which must be tuned. It's common practice to train models using a few different parameter values, and see how they impact a model's performance. 

It can also be necessary to retune your model's parameters after your model has been running in production for a while - perhaps the data has drifted, and as such the model should be updated.

In this notebook we load in the feature engineering and model training pipeline stages developed in the previous notebooks, and implement a parameter sweep to identify the best model parameters from a candidate set. 

We start by loaing in the training and testing sets:

In [None]:
import pandas as pd
import numpy as np

import os.path

training_data = pd.read_parquet(os.path.join("data", "training.parquet"))
testing_data = pd.read_parquet(os.path.join("data", "testing.parquet"))

Next, we load in the feature engineering and model pipeline stages which were developed in the previous notebooks. We will then combine them into one pipeline, which takes in raw data and returns a prediction.

Note: If you didn't run atleast one feature engineering notebook and one model training notebook fully, this next cell will return an error. 

In [None]:
## loading in feature extraction pipeline
import cloudpickle as cp
feature_pipeline = cp.load(open('feature_pipeline.sav', 'rb'))

model = cp.load(open('model.sav', 'rb'))

In [None]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('features',feature_pipeline),
    ('model',model)
])

The pipeline can be fit to data (in the same way that we fit the individual feature engineering and model training techniques to the data in the previous notebooks). We can also evaluate the model using the test set, as we did previously. 

In [None]:
pipeline.fit(training_data["Text"], training_data["Category"])

In [None]:
from mlworkflows import plot
df, chart = plot.confusion_matrix(testing_data.Category, pipeline.predict(testing_data["Text"]))
chart

In [None]:
from sklearn.metrics import classification_report
print(classification_report(testing_data.Category, pipeline.predict(testing_data["Text"]) ))

We can also easily retrain the pipeline, using different values for the parameters. `pipeline.named_steps` states the steps in the pipeline which we can refer to by name. We will then use if/else statements to select a parameter grid to sweep over for the different types of models.


✅ The parameter sweep below only supports three of the four models you could train in the previous notebooks. Add support for the XGBoost model below. 

In [None]:
pipeline.named_steps

In [None]:

param_grid = {}

if 'MultinomialNB' in str(pipeline.named_steps['model']):
    # we trained the naive Bayes model. 
    print("Parameter sweep for the Multinomial Naive Bayes Model")
    param_grid = { 'model__alpha' : [0.1,0.25,0.5,0.75,1] }
    print(param_grid)
elif 'LinearSVC' in str(pipeline.named_steps['model']):
    # We trained the Support vector classifier. 
    print("Parameter sweep for the Linear Support Vector Classifier")
    param_grid = {'model__multi_class' : ['ovr', 'crammer_singer'], 
                  'model__C': [0.3, 0.6, 1], 
                  'model__max_iter': [20000]}
elif 'RandomForestClassifier' in str(pipeline.named_steps['model']):
    print("Prameter sweep for the Random Forest Classifier")
    param_grid = {'model__max_depth': [3, 4, 5, 6], 
                  'model__n_estimators': [100, 250, 500]}
else:
    # we haven't dealt with this model yet 
    print("Parameter grid not defined for this model")


In [None]:
%%time
from sklearn.model_selection import GridSearchCV

search = None


search = GridSearchCV(pipeline, param_grid, cv=3, return_train_score=True)
search.fit(training_data["Text"], training_data["Category"])

print("Best parameters were %s" % str(search.best_params_))

You can use `.cv_results` to see more information about the training performance at each of the candidate sets of parameter values: 

In [None]:
search.cv_results_