<center>
<h1>The Full Machine Learning Lifecycle - How to Use Machine Learning in Production (MLOps)</h1>
<hr>
<h2>Model tuning - Improve your model performance</h2>
<hr>
 </center>

If you opened this notebook, your model probably needs improvements. Use this notebook to find a set-up that produces a better model. You might experiment with hyper-parameter tuning, or even try different algorithms.

First, similar to the MLflow exercise, lets import the relevant libraries ...

In [1]:
import sys
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)
import os
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

# Load plugin
sys.path.append('/cd4ml/plugins/')

# functions needed for data pre processing
from cd4ml.data_processing.ingest_data import get_data
from cd4ml.data_processing.split_train_test import get_train_test_split
from cd4ml.data_processing.transform_data import get_transformed_data

... set the paths and variables ...

In [2]:
#_raw_data_dir = '/mnt/raw/winji/day2'
_raw_data_dir = '/data/batch1'
_root_dir = './' #os.environ.get('PROJECT_PATH')

if _root_dir is None:
    raise ValueError('PROJECT_PATH environment variable not set')

_data_dir = os.path.join(_root_dir, 'data')
_data_dir

'./data'

... and prepare the validation data

In [3]:
# get the data
n_days_validation_set = 20

df_raw = get_data(_raw_data_dir)
df_all_train_data, _ = get_train_test_split(df_raw, n_days_test=20)
df_train, df_val = get_train_test_split(df_all_train_data, n_days_validation_set)
x_train, y_train = get_transformed_data(df_train)
x_val, y_val = get_transformed_data(df_val)

Now see how well the model performs on your training and validation data and try to adjust and improve it. Make sure you set the ```_experiment_name``` and use the MLflow UI to inspect your model runs. Log all the relevant metrics and parameters as you have seen in the MLflow exercise.

In [10]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, make_scorer

# DEFINE YOUR MODEL HERE:
def get_model():
    model = LogisticRegression()
    return model

# SET THE MLFLOW EXPERIMENT HERE:
_experiment_name = "finetuning_tracking_exercise"

if not _experiment_name:
    raise ValueError('_experiment_name not set')

mlflow.set_experiment(_experiment_name)
mlflow.autolog()

with mlflow.start_run() as run:
    print(f"\nActive run_id: {run.info.run_id}")

    # fit your classifier
    clf = get_model()

    # define the grid of parameters to search
    param_grid = {'C': [0.1, 1.0, 10.0], 'max_iter': [50, 100, 200, 1000]}

    # create a scoring object for f1_score (needed for GridSearchCV)
    f1_scorer = make_scorer(f1_score, average='macro', greater_is_better=True)

    # create a GridSearchCV object
    grid_search = GridSearchCV(clf, param_grid, scoring={'f1_macro': f1_scorer, 'accuracy': 'accuracy'}, refit='f1_macro', cv=3)

    # fit the GridSearchCV object to your data
    grid_search.fit(x_train, y_train)

    y_val_pred = grid_search.predict(x_val)

    # CALCULATE YOUR METRICS HERE:
    val_accuracy = accuracy_score(y_val, y_val_pred)
    val_f1 = f1_score(y_val, y_val_pred, average='macro')
    print('Accuracy validation set:', val_accuracy)
    print('F1-score validation set:', val_f1)

    # LOG COMMANDS HERE:
    mlflow.log_metric('val_acc', val_accuracy)
    mlflow.log_metric('val_f1', val_f1)

    # log the best parameters found by grid search
    mlflow.log_params({'best_params': grid_search.best_params_})

    # log the best model found by grid search
    best_model = grid_search.best_estimator_
    mlflow.sklearn.log_model(best_model, 'best_model')

    # log model metrics on the validation set
    mlflow.log_metric('best_model_val_acc', accuracy_score(y_val, best_model.predict(x_val)))
    mlflow.log_metric('best_model_val_f1', f1_score(y_val, best_model.predict(x_val), average='macro'))

mlflow.end_run()

2024/05/15 15:31:14 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.



Active run_id: 2e353be099754aa7ad7cad0192dceaea


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Accuracy validation set: 0.9902654054550464
F1-score validation set: 0.6503094679350301


In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, make_scorer

# DEFINE YOUR MODEL HERE:
def get_model():
    model = RandomForestClassifier(random_state=42)
    return model

# SET THE MLFLOW EXPERIMENT HERE:
_experiment_name = "finetuning_tracking_exercise"

if not _experiment_name:
    raise ValueError('_experiment_name not set')

mlflow.set_experiment(_experiment_name)
mlflow.autolog()

with mlflow.start_run() as run:
    print(f"\nActive run_id: {run.info.run_id}")

    # fit your classifier
    clf = get_model()

    # define the grid of parameters to search
    param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}

    # create a scoring object for f1_score (needed for GridSearchCV)
    f1_scorer = make_scorer(f1_score, average='macro', greater_is_better=True)

    # create a GridSearchCV object
    grid_search = GridSearchCV(clf, param_grid, scoring={'f1_macro': f1_scorer, 'accuracy': 'accuracy'}, refit='f1_macro', cv=3)

    # fit the GridSearchCV object to your data
    grid_search.fit(x_train, y_train)

    y_val_pred = grid_search.predict(x_val)

    # CALCULATE YOUR METRICS HERE:
    val_accuracy = accuracy_score(y_val, y_val_pred)
    val_f1 = f1_score(y_val, y_val_pred, average='macro')
    print('Accuracy validation set:', val_accuracy)
    print('F1-score validation set:', val_f1)

    # LOG COMMANDS HERE:
    mlflow.log_metric('val_acc', val_accuracy)
    mlflow.log_metric('val_f1', val_f1)

    # log the best parameters found by grid search
    mlflow.log_params({'best_params': grid_search.best_params_})

    # log the best model found by grid search
    best_model = grid_search.best_estimator_
    mlflow.sklearn.log_model(best_model, 'best_model')

    # log model metrics on the validation set
    mlflow.log_metric('best_model_val_acc', accuracy_score(y_val, best_model.predict(x_val)))
    mlflow.log_metric('best_model_val_f1', f1_score(y_val, best_model.predict(x_val), average='macro'))

mlflow.end_run()

2024/05/15 15:22:25 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.



Active run_id: 535c710765424d219d695622eb809d9d


2024/05/15 15:23:17 INFO mlflow.sklearn.utils: Logging the 5 best runs, 4 runs will be omitted.


Accuracy validation set: 0.9905409128478281
F1-score validation set: 0.511688673397201


## Adding scaler step

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, make_scorer

# DEFINE YOUR MODEL HERE:
def get_model():
    steps = [('scaler', StandardScaler()),
             ('classifier', LogisticRegression())]
    model = Pipeline(steps)
    return model

# SET THE MLFLOW EXPERIMENT HERE:
_experiment_name = "finetuning_tracking_exercise"

if not _experiment_name:
    raise ValueError('_experiment_name not set')

mlflow.set_experiment(_experiment_name)
mlflow.autolog()

with mlflow.start_run() as run:
    print(f"\nActive run_id: {run.info.run_id}")

    # fit your classifier
    clf = get_model()

    # define the grid of parameters to search
    param_grid = {'classifier__C': [0.1, 1.0, 10.0], 'classifier__max_iter': [50, 100, 200, 1000]}

    # create a scoring object for f1_score (needed for GridSearchCV)
    f1_scorer = make_scorer(f1_score, average='macro', greater_is_better=True)

    # create a GridSearchCV object
    grid_search = GridSearchCV(clf, param_grid, scoring={'f1_macro': f1_scorer, 'accuracy': 'accuracy'}, refit='f1_macro', cv=3)

    # fit the GridSearchCV object to your data
    grid_search.fit(x_train, y_train)

    y_val_pred = grid_search.predict(x_val)

    # CALCULATE YOUR METRICS HERE:
    val_accuracy = accuracy_score(y_val, y_val_pred)
    val_f1 = f1_score(y_val, y_val_pred, average='macro')
    print('Accuracy validation set:', val_accuracy)
    print('F1-score validation set:', val_f1)

    # LOG COMMANDS HERE:
    mlflow.log_metric('val_acc', val_accuracy)
    mlflow.log_metric('val_f1', val_f1)

    # log the best parameters found by grid search
    mlflow.log_params({'best_params': grid_search.best_params_})

    # log the best model found by grid search
    best_model = grid_search.best_estimator_
    mlflow.sklearn.log_model(best_model, 'best_model')

    # log model metrics on the validation set
    mlflow.log_metric('best_model_val_acc', accuracy_score(y_val, best_model.predict(x_val)))
    mlflow.log_metric('best_model_val_f1', f1_score(y_val, best_model.predict(x_val), average='macro'))

mlflow.end_run()

2024/05/15 15:36:12 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.



Active run_id: 8cfca1d8e52e42c6b4aaee31fdf987f3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Accuracy validation set: 0.9904490770502342
F1-score validation set: 0.6513565627186337


Once you are satisfied with the model performance, adjust the ```get_model()``` function in the code (located in ```cd4ml-workshop/cd4ml/model_training/train_model.py```) accordingly and (re-)run the CI-pipeline with Apache Airflow