# Ml-Pipes

The objective of this project is to present a gentle introduction of how a machine learning model can be trained and deployed. And how [MLFlow](https://www.mlflow.org/), [FastAPI](https://fastapi.tiangolo.com/) and [Docker](https://docs.docker.com/) can facilite a couple aspects of such a task. Within this context, this notebook represents the development part of the model, where a data scientist would create a few models, evaluate it using some metrics and select the best one based in a metric. There are a lot of different machine learning tools that can be used in order to create models. It is important to note that the objective of this notebook is to give a broad overview of and end to end machine learning process, and therefore it is recommended to have the documentation of the frameworks used here as a companion. Also, the README file from the project explains how to reproduce the whole project.

The rest of the notebook is organized as follows
- 1) The Problem;
- 2) Setting up an MLFlow Experiment;
- 3) Training different Models;
- 4) Setting the Best Model to Production.

The code bellow imports everything that will be used throughout the notebook.

In [1]:
# Import Libs
import pandas as pd
pd.set_option('display.max_columns', 500)

import mlflow
from mlflow.tracking import MlflowClient
from mlflow.entities import ViewType

from settings import EXPERIMENT_NAME, FOLDS, CREDIT_CARD_MODEL_NAME,\
     CHAMPION_METRIC, THRESHOLD  # pylint: disable=import-error
from dao.CreditCardDefault \
    import load_creditcard_dataset  # pylint: disable=import-error

from trainers.h2o_automl import H2OClassifier  # pylint: disable=import-error
from trainers.pycaret import PycaretClassifier  # pylint: disable=import-error
from trainers.spark import SparkClassifier


## 1) The problem

The first thing we need to have to build a model is a problem to solve. Here it is used as example the [Credit Card Default from Kagle](https://www.kaggle.com/mlg-ulb/creditcardfraud), where basically the objective if to predict based on a few features whether or not a client will default on its credit card. The taret variable can assume the values 1, for default, and 0 for non default. Therefore it is a binary classification problem.

Bellow the dataset is imported and the first rows of the dataset. Note that the Time column has been removed from the original dataset.

In [2]:
dataset = load_creditcard_dataset()
dataset.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
68845,-0.998343,0.641091,1.088884,-0.738617,0.957973,-0.38251,0.931253,0.033828,-0.861026,-0.498659,1.120652,0.460317,-0.506587,0.588127,-0.235785,0.201504,-0.576826,-0.595309,-0.354798,-0.203423,-0.244445,-0.86216,-0.091341,-0.3207,-0.166933,0.009657,-0.074248,0.06486,38.9,0
247541,-0.716504,-0.413574,2.192305,-2.448089,-0.088301,1.349923,-0.744641,0.56954,-0.262896,-0.4964,-0.7998,-0.273611,0.105582,-1.248839,-2.212797,1.592573,-0.374872,-0.336119,0.970103,0.129017,0.121044,0.400631,-0.511535,-0.31787,0.667286,-0.137734,0.057048,0.046901,5.8,0
19085,1.176214,0.14343,0.800902,0.663136,-0.576966,-0.573517,-0.16522,-0.055699,0.113294,-0.12833,0.307778,0.733804,0.653405,0.096494,1.314621,-0.009988,-0.101657,-0.896489,-0.73099,-0.099048,-0.049434,-0.071165,0.12733,0.427297,0.163114,0.238571,0.000802,0.02195,5.37,0
174248,2.097365,0.002373,-1.886971,0.441321,0.60084,-0.772248,0.472642,-0.345949,0.3656,0.124068,-1.201862,0.195007,-0.075479,0.396852,-0.392231,-0.46074,-0.228219,-0.664436,0.266854,-0.192546,-0.049559,0.018541,0.059929,0.52329,0.274317,0.361685,-0.089921,-0.068673,12.99,0
2784,-2.106211,0.577057,1.717694,1.444458,-0.29129,1.200384,0.382507,-0.27132,1.875879,1.899204,0.92097,1.155478,-0.267004,-1.663355,-1.837417,-1.697223,0.482872,-0.786058,1.53424,-0.004477,-0.57159,-0.227883,0.050134,0.017689,-0.338141,-0.687573,-1.906314,-0.920435,35.99,0


## 2) Setting up an MLFlow Experiment

Now that a problem has been stated and some data to help solving the problem has been gathered, the next step is to setup a MLFlow experiment to log our models. **MLFlow is built upon the concept of experiments. A experiment is a series of fits, where parameters, metrics, models and artifacts can be associated with the respective fit (in an machine learning package agnostic way).**

The code bellow tries to create an experiment, if that experiments already existis then it sets the experiment to the active one.
 

In [3]:
mlflow.set_tracking_uri("sqlite:///mlruns.db")
try:
    experiment = mlflow.create_experiment(EXPERIMENT_NAME)
except Exception:
    client = MlflowClient()
    experiment = client.get_experiment_by_name(EXPERIMENT_NAME)
mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.set_tracking_uri("sqlite:///mlruns.db")

## 3) Training different Models

The ext step if to train, evaluate and log a few different models. In order to demonstrate that MLFlow allows us to use different machine learning packages we will train an H2O autoML and SkLearn models (using pycaret). Now is the time where MLFlow is put into action: For each model that if fitted it will be logged a few parameters, metrics, artifacts and the models it self. To understand how this is done it checkout the classifiers definitions in `src/trainers/` folder and the [MLFlow Logging Documentaion](https://www.mlflow.org/docs/latest/tracking.html#logging-data-to-runs), ot all happens inside the `mlflow.start_run()` context manager. 

The next cells will train different classifiers. Once they finish running you can deploy the [MLFlow Tracking UI](https://www.mlflow.org/docs/latest/tracking.html#tracking-ui) by executing `mlflow ui -p 5000 --backend-store-uri sqlite:///mlruns.db` in the terminal inside the `src/` folder. and see the results at [127.0.0.1:5000](127.0.0.1:5000).

In [5]:
H2OClassifier(
    run_name='H2O',
    max_mem_size='3G',
    threshold=THRESHOLD,
    df=dataset,
    target_col='Class',
    sort_metric='aucpr',
    max_models=8,
    max_runtime_secs=60,
    nfolds=FOLDS,
    seed=90
)

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.11" 2021-04-20; OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04); OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)
  Starting server from /media/vinicius/Dados/poetry/virtualenvs/ml-pipes-VBbH4xSK-py3.8/lib/python3.8/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpm4g3x_95
  JVM stdout: /tmp/tmpm4g3x_95/h2o_vinicius_started_from_python.out
  JVM stderr: /tmp/tmpm4g3x_95/h2o_vinicius_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,America/Sao_Paulo
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.1.1
H2O_cluster_version_age:,1 month and 17 days
H2O_cluster_name:,H2O_from_python_vinicius_0nznbh
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |
19:36:14.309: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.

████████████████████████████████████████████████████████| 100%
Could not find exact threshold 0.5; using closest threshold found 0.6850710478928808.
Could not find exact threshold 0.5; using closest threshold found 0.6850710478928808.
Could not find exact threshold 0.5; using closest threshold found 0.6850710478928808.
Could not find exact threshold 0.5; using closest threshold found 0.6850710478928808.
Could not find exact threshold 0.5; using closest threshold found 0.6850710478928808.


<trainers.h2o_automl.H2OClassifier at 0x7f895bc87af0>

In [6]:
PycaretClassifier(
        experiment_name=EXPERIMENT_NAME,
        run_name='Pycaret',
        sort_metric='precision',
        df=dataset,
        target='Class',
        threshold=THRESHOLD,
        n_best_models=3,
        data_split_stratify=True,
        nfolds=FOLDS,
        normalize=True,
        transformation=True,
        ignore_low_variance=True,
        remove_multicollinearity=True,
        multicollinearity_threshold=0.95,
        session_id=54321
)

Unnamed: 0,Parameters
bootstrap,True
ccp_alpha,0.0
class_weight,
criterion,gini
max_depth,
max_features,auto
max_leaf_nodes,
max_samples,
min_impurity_decrease,0.0
min_impurity_split,


<trainers.pycaret.PycaretClassifier at 0x7f89d89b2a00>

In [4]:
SparkClassifier(
    df = dataset,
    target_col = 'Class',
    run_name = 'spark_classifier',
    max_mem_size = 4,
    n_cores = 4,
    seed = 90
)

<trainers.spark.SparkClassifier at 0x7f9e204f4a60>

## 4) Setting the Best Model to Production

The final step in this notebook if to set to production the model with the best selected metric, imported as `CHAMPION_METRIC`. This is done to show is is possible to create an automated workflow using MLFlow to deplot a model. However it is also possible to deplot the model using the [UI server](https://www.mlflow.org/docs/latest/model-registry.html#ui-workflow).

Once this is done you can return to the README file to check how the model is now deployed.

In [5]:
# Getting The best Model according to CHAMPION_METRIC
champion = MlflowClient().search_runs(
    experiment_ids=[
        str(
            mlflow.get_experiment_by_name(name=EXPERIMENT_NAME).experiment_id
        )
    ],
    run_view_type=ViewType.ALL,
    order_by=[f"metrics.{CHAMPION_METRIC} DESC"],
    max_results=1
)
run_id = champion[0].info.run_id

# Registering Model in model registery
model = mlflow.register_model(
    model_uri=f"runs:/{run_id}/model",
    name=CREDIT_CARD_MODEL_NAME
)

# Setting version 1
MlflowClient().update_model_version(
    name=CREDIT_CARD_MODEL_NAME,
    version=model.version,
    description='Deploying model with model registery'
)

# Setting it to production
MlflowClient().transition_model_version_stage(
    name=CREDIT_CARD_MODEL_NAME,
    version=model.version,
    stage="Production"
)

Registered model 'CreditCardDefault' already exists. Creating a new version of this model...
2021/05/12 19:40:19 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: CreditCardDefault, version 3
Created version '3' of model 'CreditCardDefault'.


<ModelVersion: creation_timestamp=1620859218737, current_stage='Production', description='Deploying model with model registery', last_updated_timestamp=1620859219580, name='CreditCardDefault', run_id='498b06c18dbd4d79b3229ccddd7ed9c4', run_link=None, source='./mlruns/1/498b06c18dbd4d79b3229ccddd7ed9c4/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=3>