# Model Development Example with MLFlow

This notebook serves as an example of how MLFlow can be used in the development of a machine learning model.

The rest of the notebook is organized as follows
- 1) The Problem;
- 2) Setup Development Environment;
- 3) Training different Models;
- 4) Setting the Best Model to Production and the Second best to stage (so that we can deploy an API later).

In [1]:
!which python

/media/vinicius/Dados/poetry/virtualenvs/ml-pipes-VBbH4xSK-py3.8/bin/python


In [3]:
# Import Libs
import os
from dotenv import load_dotenv
import pandas as pd
import mlflow
from mlflow.tracking import MlflowClient
from mlflow.entities import ViewType

from dao.CreditCardDefault import load_creditcard_dataset
from trainers.h2o_automl import H2OClassifier
from trainers.pycaret import PycaretClassifier
from trainers.spark import SparkClassifier

load_dotenv(dotenv_path='../ml-pipes/')
pd.set_option('display.max_columns', 500)

EXPERIMENT_NAME = "CreditCardDefault"
TRACKING_URI = 'http://127.0.0.1:5000'
CREDIT_CARD_MODEL_NAME = EXPERIMENT_NAME
THRESHOLD = 0.5
CHAMPION_METRIC = 'ks'
FOLDS = 5

## 1) The problem

The first thing we need to have to build a model is a problem to solve. Here it is used as example the [Credit Card Default from Kagle](https://www.kaggle.com/mlg-ulb/creditcardfraud), where basically the objective if to predict based on a few features whether or not a client will default on its credit card. The taret variable can assume the values 1, for default, and 0 for non default. Therefore it is a binary classification problem.

Bellow the dataset is imported and the first rows of the dataset. Note that the Time column has been removed from the original dataset.

In [4]:
dataset = load_creditcard_dataset()
dataset.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
80863,1.245674,0.166975,0.488306,0.635322,-0.562777,-1.011073,0.014953,-0.160211,0.170362,-0.044575,-0.356749,-0.07346,-0.51776,0.406969,1.124147,0.34247,-0.374656,-0.438992,-0.116091,-0.13208,-0.262581,-0.816264,0.140304,0.357827,0.186423,0.096544,-0.035866,0.018495,8.99,0
35521,-0.882173,-1.725795,1.780052,-0.960257,-0.873474,-0.734802,-1.257349,0.059632,-1.980021,1.447301,-0.094626,-1.06587,0.517367,-0.598702,1.456061,-0.932074,1.086826,0.142174,0.189911,-0.371194,0.075282,0.945697,0.918384,0.384133,-0.237373,0.068078,0.12596,-0.042035,11.0,0
11717,-0.63282,-0.080301,2.062261,-0.07848,-0.960507,0.567219,0.519078,0.08712,1.732191,-1.488289,0.359036,-1.766809,2.373478,0.650105,-1.077059,-0.530218,1.107549,-0.67387,-0.007732,0.371242,-0.162851,-0.264086,0.426659,0.091449,-0.288418,0.884637,-0.080061,0.025521,197.84,0
222254,0.218703,0.937836,-0.669641,-0.154342,0.388547,-0.668786,-0.143498,-2.980654,-0.38959,-0.979103,-1.353385,-0.016372,-1.125787,1.011845,-0.750857,-0.274635,-0.088236,-0.618918,-0.271763,0.402438,-1.290661,0.412819,-0.144995,-0.083913,0.754766,0.401134,0.056396,0.267991,1.0,0
221679,1.913739,-0.802655,-0.78327,-1.028325,-0.05854,0.952319,-0.907332,0.457399,1.631972,-0.522016,-0.240486,0.123593,-0.877798,0.265922,2.147953,-0.523533,0.364879,-1.530706,-0.804835,-0.282553,-0.111758,-0.201889,0.521655,-0.450327,-0.871367,0.409314,0.000881,-0.052602,15.02,0


## 2) Setup Development Environment

Now that a problem has been stated and some data to help solving the problem has been gathered, the next step is to setup our environment to make use of the mlflow tracking module. In order to do that we need to (i) make sure that our enviroment (the python session that is running this notebook) has acces to the bucket that we created and (ii) Setup an MLFlow Experiment.

In [5]:
# Setting credentials to bucket (here is harder  coded for )
# Here you set the credentials created in the .env file
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://127.0.0.1:9000'
os.environ['AWS_ACCESS_KEY_ID'] = os.getenv('MINIO_ACCESS_KEY')
os.environ['AWS_SECRET_ACCESS_KEY'] = os.getenv('MINIO_SECRET_KEY')

**MLFlow is built upon the concept of experiments. A experiment is a series of fits, where parameters, metrics, models and artifacts can be associated with the respective fit (in an machine learning package agnostic way).**

The code bellow tries to create an experiment, if that experiments already existis then it sets the experiment to the active one.

In [6]:
mlflow.set_tracking_uri(TRACKING_URI)
try:
    experiment = mlflow.create_experiment(EXPERIMENT_NAME)
except Exception:
    client = MlflowClient()
    experiment = client.get_experiment_by_name(EXPERIMENT_NAME)

mlflow.set_experiment(EXPERIMENT_NAME)

## 3) Training different Models

The next step if to train, evaluate and log a few different models. In order to demonstrate that MLFlow allows us to use different machine learning packages we will train an H2O autoML, SkLearn models (using pycaret) and spark. Now is the time where MLFlow is put into action: For each model that if fitted it will be logged a few parameters, metrics, artifacts and the models it self. To understand how this is done it checkout the classifiers definitions in the `./trainers` folder and the [MLFlow Logging Documentaion](https://www.mlflow.org/docs/latest/tracking.html#logging-data-to-runs), it all happens inside the `mlflow.start_run()` context manager. 

In [8]:
H2OClassifier(
    run_name='H2O',
    max_mem_size='3G',
    threshold=THRESHOLD,
    df=dataset,
    target_col='Class',
    sort_metric='aucpr',
    max_models=8,
    max_runtime_secs=10,
    nfolds=FOLDS,
    seed=90
)

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,1 min 28 secs
H2O_cluster_timezone:,America/Sao_Paulo
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.1.1
H2O_cluster_version_age:,1 month and 25 days
H2O_cluster_name:,H2O_from_python_vinicius_ai63is
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,2.993 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |
19:12:49.777: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.

████████████████████████████████████████████████████████| 100%
Could not find exact threshold 0.5; using closest threshold found 0.36810806231534454.
Could not find exact threshold 0.5; using closest threshold found 0.36810806231534454.
Could not find exact threshold 0.5; using closest threshold found 0.36810806231534454.
Could not find exact threshold 0.5; using closest threshold found 0.36810806231534454.
Could not find exact threshold 0.5; using closest threshold found 0.36810806231534454.


<trainers.h2o_automl.H2OClassifier at 0x7faf526d1e50>

In [9]:
PycaretClassifier(
        experiment_name=EXPERIMENT_NAME,
        run_name='Pycaret2',
        sort_metric='precision',
        df=dataset,
        target='Class',
        threshold=THRESHOLD,
        n_best_models=3,
        data_split_stratify=True,
        nfolds=FOLDS,
        normalize=True,
        transformation=True,
        ignore_low_variance=True,
        remove_multicollinearity=True,
        multicollinearity_threshold=0.95,
        session_id=54321
)

Unnamed: 0,Parameters
n_components,
priors,
shrinkage,
solver,svd
store_covariance,False
tol,0.0001


Error in logging parameter for                                 pycaret_precision_2
[Errno 2] No such file or directory: 'Hyperparameters.png'


<trainers.pycaret.PycaretClassifier at 0x7faf34558d60>

In [10]:
SparkClassifier(
    df = dataset,
    target_col = 'Class',
    run_name = 'spark_classifier',
    max_mem_size = 4,
    n_cores = 4,
    seed = 90
)

<trainers.spark.SparkClassifier at 0x7fe32bb66310>

If everything runned as expected you can now check the MLFlow Server at [http://127.0.0.1:5000](http://127.0.0.1:5000) to compare and explore the models runs.

## 4) Setting the Best Model to Production

The final step in this notebook if to set to production the model with the best selected metric, imported as `CHAMPION_METRIC`. This is done to show is is possible to create an automated workflow using MLFlow to deplot a model. However it is also possible to deplot the model using the [UI server](https://www.mlflow.org/docs/latest/model-registry.html#ui-workflow).

Once this is done you can return to the README file to check how the model is now deployed.

In [10]:
# Getting The best Model according to CHAMPION_METRIC
champion = MlflowClient().search_runs(
    experiment_ids=[
        str(
            mlflow.get_experiment_by_name(name=EXPERIMENT_NAME).experiment_id
        )
    ],
    run_view_type=ViewType.ALL,
    order_by=[f"metrics.{CHAMPION_METRIC} DESC"],
    max_results=1
)
run_id = champion[0].info.run_id

# Registering Model in model registery
model = mlflow.register_model(
    model_uri=f"runs:/{run_id}/model",
    name=CREDIT_CARD_MODEL_NAME
)

# Setting version 1
MlflowClient().update_model_version(
    name=CREDIT_CARD_MODEL_NAME,
    version=model.version,
    description='Deploying model with model registery'
)

# Setting it to production
MlflowClient().transition_model_version_stage(
    name=CREDIT_CARD_MODEL_NAME,
    version=model.version,
    stage="Staging"
)

Registered model 'CreditCardDefault' already exists. Creating a new version of this model...
2021/05/19 09:02:38 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: CreditCardDefault, version 3
Created version '3' of model 'CreditCardDefault'.


<ModelVersion: creation_timestamp=1621425758707, current_stage='Staging', description='Deploying model with model registery', last_updated_timestamp=1621425758825, name='CreditCardDefault', run_id='c325b227a2d24a3994ca8e75b0201117', run_link='', source='/media/vinicius/Dados/projects/ml-pipes/mlflow_artifact_store/1/c325b227a2d24a3994ca8e75b0201117/artifacts/model', status='READY', status_message='', tags={}, user_id='', version='3'>