# ML Lifecycle Management

> _"Hardest Part of ML isn’t ML, it’s Data"_

![Hardest part of ML](images/mlflow_tech_icon.png)

[Hidden Technical Debt in Machine Learning Systems](http://papers.neurips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf) (Google NIPS, 2015)  

## Objective

This module is going to introduce you to [MLflow](https://mlflow.org/docs/latest/index.html) for machine learning lifecycle management. We will introduce you to MLflow for... 

1. managing model experiments,
1. tracking hyperparameter tuning,
1. registering and serving models.

MLflow is a very comprehensive tool so we are just going to touch the surface. However, we will provide you with resources to dig deeper into MLflow.

## Problem

The ML process can be tedious, difficult, and result in lots of technical debt.

* How do we ***track***...
  - model runs
  - hyperparameter experimentations
  - performance metrics

* How do we manage ML ***projects*** for...
  - reproducibility
  - collaboration

* How do we package ML ***models*** for downstream deployment

* How do we ***register*** models for managing...
  - model lineage
  - model versioning
  - stage transitions (i.e. dev to stage to prod)

## Intro to MLflow

MLflow is an open source platform designed to manage the complete Machine Learning Lifecycle.

![](images/mlflow_capabilities.png)

* Used heavily --> 1.7M+ monthly downloads
* Well supported --> 170+ contributors & 40 contributing organizations
* Well documented
   - [mlflow.org](https://mlflow.org/)
   - [github.com/mlflow](https://github.com/mlflow)
   - [Slack channel](https://join.slack.com/t/mlflow-users/shared_invite/zt-g6qwro5u-odM7pRnZxNX_w56mcsHp8g)
   - [stackoverflow.com/questions/tagged/mlflow](https://stackoverflow.com/questions/tagged/mlflow)
   - [twitter.com/MLflow](https://twitter.com/MLflow)
   - [databricks.com/mlflow](https://databricks.com/product/managed-mlflow)

## Model Tracking

Key concepts in tracking:

* __Date/time:__ Start and end time of each model run.
* __Paramaters:__ Key-value inputs to your code.
* __Metrics:__ Numeric values to track how your model’s loss function is converging.
* __Artifacts:__ Output files in any format, including models.
* __Source:__ what code ran?

There are several ways to record your modeling experiment:

- SQLAlchemy compatible database
- remotely to a tracking server
- __locally__

Let's create an experiment:

In [1]:
import mlflow

experiment = mlflow.set_experiment("Predicting income")

2021/12/29 10:58:43 INFO mlflow.tracking.fluent: Experiment with name 'Predicting income' does not exist. Creating a new experiment.


<Experiment: artifact_location='file:///Users/bradley.boehmke/Desktop/workspace/trainings/advanced-python-datasci/notebooks/mlruns/1', experiment_id='1', lifecycle_stage='active', name='Predicting income', tags={}>

<div class="admonition tip alert alert-warning">
    <p class="first admonition-title" style="font-weight: bold;"><b>Tip</b></p>
    <p class="last"><tt class="docutils literal">set_experiment</tt> will create and set an experiment if the experiment does not already exist.</p>
</ul>
</div>

Note the new local `mlruns/` directory:
    
![mlruns directory](images/mlruns_directory.png)

Before we record a model run let's import and prepare our data:

In [2]:
# packages used
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# import data
adult_census = pd.read_csv('../data/adult-census.csv')

# separate feature & target data
target = adult_census['class']
features = adult_census.drop(columns='class')

# drop the duplicated column `"education-num"` as stated in the data exploration notebook
features = features.drop(columns='education-num')

# split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=123)

# create selector object based on data type
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

# get columns of interest
numerical_columns = numerical_columns_selector(features)
categorical_columns = categorical_columns_selector(features)

# preprocessors to handle numeric and categorical features
numerical_preprocessor = StandardScaler()
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard_scaler', numerical_preprocessor, numerical_columns)
])

To log a model run we use `start_run()` along with various other logging functions:

- `log_param(s)`: to log parameters of interest
- `log_metrics(s)`: to log model metrics of interest
- `set_tag(s)`: to log decsriptive information (version number, platform ran on, etc.)
- `log_artifact(s)`: allows you to log items such as data, models, files, etc.

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline


mlflow.start_run(run_name='first_mlflow_run')

mlflow.log_param('max_iter', 500)
model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))

_ = model.fit(X_train, y_train)

accuracy = model.score(X_test, y_test)
mlflow.log_metric('accuracy', accuracy)

mlflow.end_run()

A more common approach you'll see is to use `start_run()` as a [context manager](https://book.pythontips.com/en/latest/context_managers.html).

In [4]:
with mlflow.start_run(run_name='run_as_context_mgr') as run:

    mlflow.log_param('max_iter', 500)
    model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))

    _ = model.fit(X_train, y_train)

    accuracy = model.score(X_test, y_test)
    mlflow.log_metric('accuracy', accuracy)

<div class="admonition tip alert alert-warning">
    <p class="first admonition-title" style="font-weight: bold;"><b>Question?</b></p>
<p class="last">
What other useful information could we log?
</p>
</div>

The following also logs the model type and the model itself as an artifact

In [5]:
with mlflow.start_run(run_name='baseline_model') as run:

    mlflow.set_tag('Estimator', 'LogisticRegression')
    mlflow.log_param('max_iter', 500)
    model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))

    _ = model.fit(X_train, y_train)
    mlflow.sklearn.log_model(model, 'baseline_model')

    accuracy = model.score(X_test, y_test)
    mlflow.log_metric('accuracy', accuracy)

<div class="admonition tip alert alert-warning">
    <p class="first admonition-title" style="font-weight: bold;"><b>Your Turn</b></p>
<p class="last">
    Log a model run using <tt class="docutils literal">KNeighborsClassifier()</tt> as a classifier. Pick one or more parameters to log (i.e. <tt class="docutils literal">n_neighbors</tt>. Record the ROC AUC metric using <tt class="docutils literal">sklearn.metrics.roc_auc_score</tt>.
</p>
</div>

## MLflow UI

We can programmatically retrieve our model run information but MLflow also provides a very nice UI that displays information.

- Remove the `#` from the following line of code
- Click on the local URL provided

![Launch MLflow UI](images/local_url.png)

In [7]:
#!mlflow ui

<div class="admonition warning alert alert-danger">
    <p class="first admonition-title" style="font-weight: bold;"><b>Warning</b></p>
<p class="last">You'll need to stop the previous code cell when you are done viewing the MLflow UI.
</p>
</div>

## Auto logging

MLflow has built-in [auto logging](https://www.mlflow.org/docs/latest/tracking.html#automatic-logging) for many common model libraries:

- Scikit-learn
- TensorFlow & Keras
- XGBoost
- Spark ML
- Pytorch
- etc.

This can simplify our logging.

For example, the built in `sklearn.autolog` functionality will automatically log:

- Training score obtained by `estimator.score`
- Parameters obtained by `estimator.get_params`
- Model class name
- Fitted estimator as an artifact

In [8]:
# enable autologging
mlflow.sklearn.autolog()

with mlflow.start_run(run_name='autolog_run') as run:

    model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
    _ = model.fit(X_train, y_train)

    mlflow.log_metric('test_accuracy', model.score(X_test, y_test))

                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['workclass', 'education', 'marital-status',
                     ...`
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['workclass', 'education', 'marital-status',
                                  'occupatio...`


Let's check out this auto-logged run in the UI. You'll notice some additional information logged.

In [11]:
#!mlflow ui

<div class="admonition tip alert alert-warning">
    <p class="first admonition-title" style="font-weight: bold;"><b>Your Turn</b></p>
<p class="last">
    Re-run the <tt class="docutils literal">KNeighborsClassifier()</tt> classifier model from the previous your turn; however, use <tt class="docutils literal">mlflow.sklearn.autolog</tt> for autologging.
</p>
</div>

## Hyperparameter tuning

So far, we've just been logging individual runs.

However, MLflow makes it easy to track hyperparameter search experiments:

In [9]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

# basic model object
knn = KNeighborsClassifier()

# Create grid of hyperparameter values
hyper_grid = {'knn__n_neighbors': [5, 10, 15, 20]}

# create preprocessor & modeling pipeline
pipeline = Pipeline([('preprocessor', preprocessor), ('knn', knn)])

# enable autologging
mlflow.sklearn.autolog()

with mlflow.start_run(run_name='knn_grid_search') as run:
    # Tune a knn model using grid search
    grid_search = GridSearchCV(pipeline, hyper_grid, cv=5, scoring='roc_auc', n_jobs=-1)
    results = grid_search.fit(X_train, y_train)

                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['wor...`
2021/12/29 11:04:41 INFO mlflow.sklearn.utils: Logging the 5 best runs, no runs will be omitted.
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['workclass', 'education', 'marital-status',
                          ...`
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['workclass', 'education', 'marital-status',
                                  'occupatio...`
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['workclass', 'education', 'marital-status',
                          ...`
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['workclass', 

If we look at the MLflow UI we'll notice that autologging a parameter search results in a single parent run and nested child runs, which contains:

* Parent
   - Training score
   - Best parameter combination
   - Fitted best estimator
   - and more
* Child
   - CV test score for each parameter combination

In [19]:
#!mlflow ui

<div class="admonition tip alert alert-warning">
    <p class="first admonition-title" style="font-weight: bold;"><b>Your Turn</b></p>
<p class="last">
    Based on the results we found, run another grid search with adjusted <tt class="docutils literal">n_neighbors</tt> and see if the results improve.
</p>
</div>

## Accessing run information

We can programmatically access our mlflow run information:

In [11]:
df = mlflow.search_runs(experiment_ids=experiment.experiment_id)
df

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.std_score_time,metrics.rank_test_score,metrics.mean_score_time,metrics.mean_fit_time,...,tags.mlflow.autologging,tags.mlflow.parentRunId,tags.mlflow.user,tags.mlflow.source.name,tags.estimator_name,tags.mlflow.source.type,tags.estimator_class,tags.mlflow.log-model.history,tags.mlflow.runName,tags.Estimator
0,448506ac88724441ae908796af21c51b,1,FINISHED,file:///Users/bradley.boehmke/Desktop/workspac...,2021-12-29 15:59:55.663000+00:00,2021-12-29 16:04:41.955000+00:00,0.241726,2.0,8.534959,0.066297,...,sklearn,658f56509691443e91201f10cc71bbd9,bradley.boehmke,/Users/bradley.boehmke/Downloads/ENTER/envs/uc...,Pipeline,LOCAL,sklearn.pipeline.Pipeline,,,
1,44e2b9f26941456c904cbed7ae31de56,1,FINISHED,file:///Users/bradley.boehmke/Desktop/workspac...,2021-12-29 15:59:55.663000+00:00,2021-12-29 16:04:41.955000+00:00,0.282919,4.0,14.442268,0.066455,...,sklearn,658f56509691443e91201f10cc71bbd9,bradley.boehmke,/Users/bradley.boehmke/Downloads/ENTER/envs/uc...,Pipeline,LOCAL,sklearn.pipeline.Pipeline,,,
2,53c4bf41200042a2a23b60a84f1f8642,1,FINISHED,file:///Users/bradley.boehmke/Desktop/workspac...,2021-12-29 15:59:55.663000+00:00,2021-12-29 16:04:41.955000+00:00,0.257934,1.0,8.672409,0.060784,...,sklearn,658f56509691443e91201f10cc71bbd9,bradley.boehmke,/Users/bradley.boehmke/Downloads/ENTER/envs/uc...,Pipeline,LOCAL,sklearn.pipeline.Pipeline,,,
3,5bdb217baec94c8093208fb5c4a103f5,1,FINISHED,file:///Users/bradley.boehmke/Desktop/workspac...,2021-12-29 15:59:55.663000+00:00,2021-12-29 16:04:41.955000+00:00,0.255885,3.0,14.387484,0.075267,...,sklearn,658f56509691443e91201f10cc71bbd9,bradley.boehmke,/Users/bradley.boehmke/Downloads/ENTER/envs/uc...,Pipeline,LOCAL,sklearn.pipeline.Pipeline,,,
4,658f56509691443e91201f10cc71bbd9,1,FINISHED,file:///Users/bradley.boehmke/Desktop/workspac...,2021-12-29 15:59:55.663000+00:00,2021-12-29 16:04:42.021000+00:00,,,,,...,,,bradley.boehmke,/Users/bradley.boehmke/Downloads/ENTER/envs/uc...,GridSearchCV,LOCAL,sklearn.model_selection._search.GridSearchCV,"[{""run_id"": ""658f56509691443e91201f10cc71bbd9""...",knn_grid_search,
5,0645840a3bfc473f9cbd4db95a51ae14,1,FINISHED,file:///Users/bradley.boehmke/Desktop/workspac...,2021-12-29 15:59:45.657000+00:00,2021-12-29 15:59:48.416000+00:00,,,,,...,,,bradley.boehmke,/Users/bradley.boehmke/Downloads/ENTER/envs/uc...,Pipeline,LOCAL,sklearn.pipeline.Pipeline,"[{""run_id"": ""0645840a3bfc473f9cbd4db95a51ae14""...",autolog_run,
6,13c968b931f84ccda1cd1d890e065f71,1,FINISHED,file:///Users/bradley.boehmke/Desktop/workspac...,2021-12-29 15:58:58.381000+00:00,2021-12-29 15:58:59.875000+00:00,,,,,...,,,bradley.boehmke,/Users/bradley.boehmke/Downloads/ENTER/envs/uc...,,LOCAL,,"[{""run_id"": ""13c968b931f84ccda1cd1d890e065f71""...",baseline_model,LogisticRegression
7,e2cd789a7e4846ecae1cfc7d5234483a,1,FINISHED,file:///Users/bradley.boehmke/Desktop/workspac...,2021-12-29 15:58:54.157000+00:00,2021-12-29 15:58:54.656000+00:00,,,,,...,,,bradley.boehmke,/Users/bradley.boehmke/Downloads/ENTER/envs/uc...,,LOCAL,,,run_as_context_mgr,
8,7fdd4ab7845d4cbb8d8de6d97100855f,1,FINISHED,file:///Users/bradley.boehmke/Desktop/workspac...,2021-12-29 15:58:52.394000+00:00,2021-12-29 15:58:52.873000+00:00,,,,,...,,,bradley.boehmke,/Users/bradley.boehmke/Downloads/ENTER/envs/uc...,,LOCAL,,,first_mlflow_run,


This allows us to query for the KNN grid search run and access the run ID:

In [31]:
model_filter = df['tags.mlflow.runName'] == 'knn_grid_search'
run_id = df.loc[model_filter, 'run_id'].item()
run_id

'658f56509691443e91201f10cc71bbd9'

Which we can use to load the the model:

In [32]:
model_path = f'mlruns/{experiment.experiment_id}/{run_id}/artifacts/best_estimator'
model = mlflow.sklearn.load_model(model_path)
model

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['workclass', 'education',
                                                   'marital-status',
                                                   'occupation', 'relationship',
                                                   'race', 'sex',
                                                   'native-country']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['age', 'capital-gain',
                                                   'capital-loss',
                                                   'hours-per-week'])])),
                ('knn', KNeighborsClassifier(n_neighbors=20))])

However, using the ___Model Registry___ is a more sophisticated approach for saving and accessing models.

## Registering models

MLflow provides a ___Model Registry___ that provides a centralized and collaborative approach to model lifecycle management.

<div class="admonition note alert alert-info">
    <p class="first admonition-title" style="font-weight: bold;"><b>Note</b></p>
<p class="last">You can register models programmatically or via the MLflow UI. However, to do so locally requires additional setup that we don't have time for. If using one of the main cloud providers (i.e. Databricks on AWS, Azure, or GCP) the setup is already done and model registration is very straightforward. </p>
</div>

__One Collaborative Hub__: The Model Registry provides a central hub for making models discoverable, improving collaboration and knowledge sharing across the organization.

![](images/registered_models.png)

__Manage the entire Model Lifecycle (MLOps)__: The Model Registry provides lifecycle management for models from experimentation to deployment, improving reliability and robustness of the model deployment process.

1. Overview of active model versions and their deployment stage
2. Request/Approval workflow for transitioning deployment stages

![](images/model_registry_mlops.png)

__Visibility and Governance__: The Model Registry provides full visibility into the deployment stage of all models, who requested and approved changes, allowing for full governance and auditability.

1. Full activity log of stage transition requests, approvals, etc.

![](images/model_registry_visibility.png)

Full provenance from Model marked production in the Registry to ...
1. Run that produced the model
2. Notebook that produced the run
3. Exact revision history of the notebook that produced the run

<img src='images/model_registry_governance.png' id="logo" height="70%" width="70%"/>

## Wrapping up

This module introduced you to [MLflow](https://mlflow.org/docs/latest/index.html) for machine learning lifecycle management. We provided a very brief introduction to MLflow for... 

1. managing model experiments,
1. tracking hyperparameter tuning,
1. registering and serving models.

MLflow provides so much more than we have time to cover; however, this should give you a decent foundation to build upon. 

We recommend the following resources to learn more:

   - [mlflow.org](https://mlflow.org/)
   - [github.com/mlflow](https://github.com/mlflow)
   - [Slack channel](https://join.slack.com/t/mlflow-users/shared_invite/zt-g6qwro5u-odM7pRnZxNX_w56mcsHp8g)
   - [stackoverflow.com/questions/tagged/mlflow](https://stackoverflow.com/questions/tagged/mlflow)
   - [twitter.com/MLflow](https://twitter.com/MLflow)
   - [databricks.com/mlflow](https://databricks.com/product/managed-mlflow)