All the models we have trained have parameters which must be tuned. It's common practice to train models using a few different parameter values, and see how they impact the model. 

It can also be necessary to retune your model's parameters after your model has been running in production for a while - perhaps the data has drifted, and as such the model should be updated.

In this notebook we load in the feature engineering and model training pipeline stages developed in the previous notebooks, and do a parameter sweep to identify the best model parameters from a candidate set. 

We start by loaing in the training and testing sets. 

In [1]:
import pandas as pd
import numpy as np

import os.path

training_data = pd.read_parquet(os.path.join("data", "training.parquet"))
testing_data = pd.read_parquet(os.path.join("data", "testing.parquet"))

Next, we load in the feature engineering and model pipeline stages which were developed in the previous notebooks. We will then combine them into one pipeline, which takes in raw data and returns a prediction.

Note: If you didn't run atleast one feature engineering notebook and one model training notebook fully, this next cell will return an error when it is run. 

In [2]:
## loading in feature extraction pipeline
import cloudpickle as cp
feature_pipeline = cp.load(open('feature_pipeline.sav', 'rb'))

model = cp.load(open('model.sav', 'rb'))

In [3]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('features',feature_pipeline),
    ('model',model)
])

The pipeline can be fit to data (in the same way that we fit the individual feature engineering and model training techniquest to the data in the previous notebooks). We can also evaluate the model using the test set, as we did previously. 

In [4]:
pipeline.fit(training_data["Text"], training_data["Category"])

Pipeline(steps=[('features',
                 Pipeline(steps=[('vect',
                                  HashingVectorizer(alternate_sign=False,
                                                    n_features=2048, norm=None,
                                                    stop_words='english',
                                                    token_pattern='(?u)\\b[A-Za-z]\\w+\\b')),
                                 ('tfidf', TfidfTransformer())])),
                ('model', MultinomialNB())])

In [None]:
from mlworkflows import plot
df, chart = plot.confusion_matrix(testing_data.Category, nb.predict(testing_vecs))
chart