# Ml-Pipes

The objective of this project is to present a gentle introduction of how a machine learning model can be trained and deployed. And how [MLFlow](https://www.mlflow.org/), [FastAPI](https://fastapi.tiangolo.com/) and [Docker](https://docs.docker.com/) can facilite a couple aspects of such a task. Within this context, this notebook represents the development part of the model, where a data scientist would create a few models, evaluate it using some metrics and select the best one based in a metric. There are a lot of different machine learning tools that can be used in order to create models. It is important to note that the objective of this notebook is to give a broad overview of and end to end machine learning process, and therefore it is recommended to have the documentation of the frameworks used here as a companion. Also, the README file from the project explains how to reproduce the whole project.

The rest of the notebook is organized as follows
- 1) The Problem;
- 2) Setting up an MLFlow Experiment;
- 3) Training different Models;
- 4) Setting the Best Model to Production.

The code bellow imports everything that will be used throughout the notebook.

In [1]:
# Import Libs
import pandas as pd
pd.set_option('display.max_columns', 500)

import mlflow
from mlflow.tracking import MlflowClient
from mlflow.entities import ViewType

from settings import EXPERIMENT_NAME, FOLDS, CREDIT_CARD_MODEL_NAME,\
     CHAMPION_METRIC, THRESHOLD  # pylint: disable=import-error
from dao.CreditCardDefault \
    import load_creditcard_dataset  # pylint: disable=import-error
from trainers.h2o_automl import H2OClassifier  # pylint: disable=import-error
from trainers.pycaret import PycaretClassifier  # pylint: disable=import-error

## 1) The problem

The first thing we need to have to build a model is a problem to solve. Here it is used as example the [Credit Card Default from Kagle](https://www.kaggle.com/mlg-ulb/creditcardfraud), where basically the objective if to predict based on a few features whether or not a client will default on its credit card. The taret variable can assume the values 1, for default, and 0 for non default. Therefore it is a binary classification problem.

Bellow the dataset is imported and the first rows of the dataset. Note that the Time column has been removed from the original dataset.

In [2]:
dataset = load_creditcard_dataset()
dataset.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
182875,2.008438,0.162598,-2.218728,0.954302,0.196069,-1.433587,0.097916,-0.19385,1.029088,-1.008799,-0.59545,-1.173956,-2.300362,-2.225345,-0.122508,0.37276,2.212081,0.782419,-0.332493,-0.280114,-0.033334,0.139415,-0.016894,-0.206881,0.11962,0.735991,-0.047406,-0.016285,12.31,0
28462,-0.763624,1.027956,0.739337,-0.6385,0.204582,-0.541075,0.570384,0.284018,-0.616243,-0.961992,-1.336262,0.291892,1.086139,0.205848,0.240597,0.72697,-0.837433,-0.035104,-0.421264,-0.08355,0.068373,0.007377,-0.092307,-0.404129,-0.211563,0.243181,-0.049706,0.066394,33.99,0
187056,1.486471,-0.83732,-0.732272,1.629191,-0.233813,0.313185,0.078676,-0.027234,0.988416,-0.165763,-1.209107,1.230106,0.906739,-0.436578,-1.199565,-0.437594,-0.114801,-0.974397,0.117529,0.283902,-0.376984,-1.305845,0.243207,0.557332,-0.35021,-1.157152,0.014695,0.018601,245.23,0
126572,-2.868247,-1.139031,-0.708736,0.191714,-1.597907,0.124855,1.662722,0.128932,-1.338376,0.121968,-0.993437,0.135899,0.503096,0.248301,-0.505767,-1.577313,0.223838,0.577544,-1.885893,-1.037271,-0.39494,-0.084426,0.046102,0.112023,-0.337455,-0.527738,0.232477,-0.366651,502.89,0
113013,1.205762,0.013247,0.951732,1.138075,-0.479702,0.307368,-0.500346,0.082905,0.785942,-0.233807,-1.357437,0.776953,1.255903,-0.6231,0.122262,0.272778,-0.61451,0.080188,0.096515,-0.03056,-0.103054,-0.041808,-0.124058,-0.415429,0.562038,-0.361177,0.077714,0.032766,12.99,0


## 2) Setting up an MLFlow Experiment

Now that a problem has been stated and some data to help solving the problem has been gathered, the next step is to setup a MLFlow experiment to log our models. **MLFlow is built upon the concept of experiments. A experiment is a series of fits, where parameters, metrics, models and artifacts can be associated with the respective fit (in an machine learning package agnostic way).**

The code bellow tries to create an experiment, if that experiments already existis then it sets the experiment to the active one.
 

In [3]:
mlflow.set_tracking_uri("sqlite:///mlruns.db")
try:
    experiment = mlflow.create_experiment(EXPERIMENT_NAME)
except Exception:
    client = MlflowClient()
    experiment = client.get_experiment_by_name(EXPERIMENT_NAME)
mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.set_tracking_uri("sqlite:///mlruns.db")

2021/05/11 20:41:36 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2021/05/11 20:41:37 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
INFO  [alembic.runtime.migration] Running upgrade 7ac759974ad8 -> 89d4b8295536, create latest metrics table
INFO  [89d4b8295536_create_latest_metrics_table_py] Migration complete!
INFO  

## 3) Training different Models

The ext step if to train, evaluate and log a few different models. In order to demonstrate that MLFlow allows us to use different machine learning packages we will train an H2O autoML and SkLearn models (using pycaret). Now is the time where MLFlow is put into action: For each model that if fitted it will be logged a few parameters, metrics, artifacts and the models it self. To understand how this is done it checkout the classifiers definitions in `src/trainers/` folder and the [MLFlow Logging Documentaion](https://www.mlflow.org/docs/latest/tracking.html#logging-data-to-runs). 

The next cells will train different classifier. Once they finish running you can deploy the [MLFlow Tracking UI](https://www.mlflow.org/docs/latest/tracking.html#tracking-ui) by executing `mlflow ui -p 5000 --backend-store-uri sqlite:///mlruns.db` in the terminal inside the `src/` folder. and see the results at [127.0.0.1:5000](127.0.0.1:5000).

In [4]:
H2OClassifier(
    run_name='H2O',
    max_mem_size='3G',
    threshold=THRESHOLD,
    df=dataset,
    target_col='Class',
    sort_metric='aucpr',
    max_models=8,
    max_runtime_secs=60,
    nfolds=FOLDS,
    seed=90
)

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.11" 2021-04-20; OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04); OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)
  Starting server from /media/vinicius/Dados/poetry/poetry-envs/ml-pipes-VBbH4xSK-py3.8/lib/python3.8/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp1znk3ipq
  JVM stdout: /tmp/tmp1znk3ipq/h2o_vinicius_started_from_python.out
  JVM stderr: /tmp/tmp1znk3ipq/h2o_vinicius_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,America/Sao_Paulo
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.1.1
H2O_cluster_version_age:,1 month and 16 days
H2O_cluster_name:,H2O_from_python_vinicius_sr66om
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


Parse progress: |█████████████████████████████████████████████████████████| 100%
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
AutoML progress: |
20:41:44.359: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.

████████████████████████████████████████████████████████| 100%
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
Could not find exact threshold 0.5; using closest threshold found 0.48254385590553284.
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] 

<trainers.h2o_automl.H2OClassifier at 0x7fa72d0f86a0>

In [5]:
PycaretClassifier(
        experiment_name=EXPERIMENT_NAME,
        run_name='Pycaret',
        sort_metric='precision',
        df=dataset,
        target='Class',
        threshold=THRESHOLD,
        n_best_models=3,
        data_split_stratify=True,
        nfolds=FOLDS,
        normalize=True,
        transformation=True,
        ignore_low_variance=True,
        remove_multicollinearity=True,
        multicollinearity_threshold=0.95,
        session_id=54321
)

Unnamed: 0,Parameters
n_components,
priors,
shrinkage,
solver,svd
store_covariance,False
tol,0.0001


INFO  [logs] Visual Rendered Successfully
INFO  [logs] plot_model() succesfully completed......................................
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.


<trainers.pycaret.PycaretClassifier at 0x7fa7a3d69a60>

## 4) Setting the Best Model to Production

The final step in this notebook if to set to production the model with the best selected metric, imported as `CHAMPION_METRIC`. This is done to show is is possible to create an automated workflow using MLFlow to deplot a model. However it is also possible to deplot the model using the [UI server](https://www.mlflow.org/docs/latest/model-registry.html#ui-workflow).

Once this is done you can return to the README file to check how the model is now deployed.

In [6]:
# Getting The best Model according to CHAMPION_METRIC
champion = MlflowClient().search_runs(
    experiment_ids=[
        str(
            mlflow.get_experiment_by_name(name=EXPERIMENT_NAME).experiment_id
        )
    ],
    run_view_type=ViewType.ALL,
    order_by=[f"metrics.{CHAMPION_METRIC} DESC"],
    max_results=1
)
run_id = champion[0].info.run_id

# Registering Model in model registery
model = mlflow.register_model(
    model_uri=f"runs:/{run_id}/model",
    name=CREDIT_CARD_MODEL_NAME
)

# Setting version 1
MlflowClient().update_model_version(
    name=CREDIT_CARD_MODEL_NAME,
    version=model.version,
    description='Deploying model with model registery'
)

# Setting it to production
MlflowClient().transition_model_version_stage(
    name=CREDIT_CARD_MODEL_NAME,
    version=model.version,
    stage="Production"
)

INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
Successfully registered model 'CreditCardDefault'.
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
2021/05/11 20:43:40 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: CreditCardDefault, version 1
Created version '1' of model 'CreditCardDefault'.
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [ale

<ModelVersion: creation_timestamp=1620776620726, current_stage='Production', description='Deploying model with model registery', last_updated_timestamp=1620776621049, name='CreditCardDefault', run_id='da26388cb5d04ec78d027cd0010855a3', run_link=None, source='./mlruns/1/da26388cb5d04ec78d027cd0010855a3/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=1>