d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Model Management

An MLflow model is a standard format for packaging models that can be used on a variety of downstream tools.  This lesson provides a generalizable way of handling machine learning models created in and deployed to a variety of environments.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Introduce model management best practices
 - Store and use different flavors of models for different deployment environments
 - Apply models combined with arbitrary pre and post-processing code using Python models

<iframe  
src="//fast.wistia.net/embed/iframe/bbyhkgxzoz?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/bbyhkgxzoz?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Managing Machine Learning Models

Once a model has been trained and bundled with the environment it was trained in, the next step is to package the model so that it can be used by a variety of serving tools.  The current deployment options include Docker-based REST servers, Spark using streaming or batch, and cloud platforms such as Azure ML and AWS SageMaker.  Packaging the final model in a platform-agnostic way offers the most flexibility in deployment options and allows for model reuse across a number of platforms.

**MLflow models is a tool for deploying models that's agnostic to both the framework the model was trained in and the environment it's being deployed to.  It's convention for packaging machine learning models that offers self-contained code, environments, and models.**  The main abstraction in this package is the concept of **flavors,** which are different ways the model can be used.  For instance, a TensorFlow model can be loaded as a TensorFlow DAG or as a Python function: using the MLflow model convention allows for the model to be used regardless of the library that was used to train it originally.

The primary difference between MLflow projects and models is that models are geared more towards inference and serving.  The `python_function` flavor of models gives a generic way of bundling models regardless of whether it was `sklearn`, `keras`, or any other machine learning library that trained the model.  We can thereby deploy a python function without worrying about the underlying format of the model.  **MLflow therefore maps any training framework to any deployment environment**, massively reducing the complexity of inference.

Finally, arbitrary pre and post-processing steps can be included in the pipeline such as data loading, cleansing, and featurization.  This means that the full pipeline, not just the model, can be preserved.

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-4/mlflow-models-enviornments.png" style="height: 400px; margin: 20px"/></div>

Run the following cell to set up our environment.

In [6]:
%run "./Includes/Classroom-Setup"

-sandbox
### Model Flavors

Flavors offer a way of saving models in a way that's agnostic to the training development, making it significantly easier to be used in various deployment options.  Some of the most popular built-in flavors include the following:<br><br>

* <a href="https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#module-mlflow.pyfunc" target="_blank">mlflow.pyfunc</a>
* <a href="https://mlflow.org/docs/latest/python_api/mlflow.keras.html#module-mlflow.keras" target="_blank">mlflow.keras</a>
* <a href="https://mlflow.org/docs/latest/python_api/mlflow.pytorch.html#module-mlflow.pytorch" target="_blank">mlflow.pytorch</a>
* <a href="https://mlflow.org/docs/latest/python_api/mlflow.sklearn.html#module-mlflow.sklearn" target="_blank">mlflow.sklearn</a>
* <a href="https://mlflow.org/docs/latest/python_api/mlflow.spark.html#module-mlflow.spark" target="_blank">mlflow.spark</a>
* <a href="https://mlflow.org/docs/latest/python_api/mlflow.tensorflow.html#module-mlflow.tensorflow" target="_blank">mlflow.tensorflow</a>

Models also offer reproducibility since the run ID and the timestamp of the run are preserved as well.  

<a href="https://mlflow.org/docs/latest/python_api/index.html" target="_blank">You can see all of the flavors and modules here.</a>

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-4/mlflow-models.png" style="height: 400px; margin: 20px"/></div>

To demonstrate the power of model flavors, let's first create two models using different frameworks.

Import the data.

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv")
X_train, X_test, y_train, y_test = train_test_split(df.drop(["price"], axis=1), df[["price"]].values.ravel(), random_state=42)

In [10]:
display(df)

host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
1.0,0,0,37.76931037734077,-122.43385634489,0,0,3.0,1.0,1.0,2.0,0,1.0,127.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,170.0
2.0,1,1,37.745112331410034,-122.42101788836888,0,0,5.0,1.0,2.0,3.0,0,30.0,112.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,235.0
10.0,2,0,37.766689597862175,-122.45250461761628,0,1,2.0,4.0,1.0,1.0,0,32.0,17.0,85.0,8.0,8.0,9.0,9.0,9.0,8.0,65.0
4.0,3,2,37.73074592978503,-122.44840862635228,1,1,1.0,2.0,1.0,1.0,0,3.0,76.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,60.0
10.0,2,0,37.76487219421756,-122.45182799146508,1,1,2.0,4.0,1.0,1.0,0,32.0,7.0,91.0,9.0,9.0,9.0,9.0,9.0,9.0,65.0
2.0,0,0,37.77524858589268,-122.43637374831292,1,0,5.0,1.5,2.0,2.0,0,5.0,26.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,575.0
1.0,0,3,37.78470745496072,-122.44555431261593,0,0,7.0,1.0,2.0,1.0,0,2.0,27.0,88.0,9.0,7.0,10.0,10.0,9.0,9.0,255.0
2.0,4,1,37.75918889708064,-122.42236687240562,0,1,3.0,1.0,1.0,2.0,0,1.0,559.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,139.0
1.0,4,1,37.75174004606522,-122.4094205953428,0,0,4.0,2.5,3.0,3.0,0,3.0,24.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,285.0
1.0,5,4,37.76258885144137,-122.40543055237004,1,1,2.0,1.0,1.0,1.0,0,1.0,386.0,93.0,9.0,9.0,10.0,10.0,9.0,9.0,135.0


Train a random forest model.

In [12]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

rf = RandomForestRegressor(n_estimators=100, max_depth=5)
rf.fit(X_train, y_train)

rf_mse = mean_squared_error(y_test,rf.predict(X_test))

rf_mse

Train a neural network.

In [14]:
import tensorflow as tf
tf.set_random_seed(42) # For reproducibility

from keras.models import Sequential
from keras.layers import Dense

nn = Sequential([
  Dense(40, input_dim=21, activation='relu'),
  Dense(20, activation='relu'),
  Dense(1, activation='linear')
])

nn.compile(optimizer="adam", loss="mse")
nn.fit(X_train,y_train,validation_split=.2, epochs=40, verbose=2)

nn_mse = mean_squared_error(y_test,nn.predict(X_test)) 

nn_mse

Now log the two models.

In [16]:
import mlflow.sklearn

with mlflow.start_run(run_name="RF Model") as run:
  mlflow.sklearn.log_model(rf,"model")
  mlflow.log_metric("mse",rf_mse)

  sklearnRunID = run.info.run_uuid 
  sklearnURI = run.info.artifact_uri 
  
  experimentID = run.info.experiment_id 

In [17]:
import mlflow.keras

with mlflow.start_run(run_name="NN Model") as run:
  mlflow.keras.log_model(nn,"model")
  mlflow.log_metric("mse",nn_mse)

  kerasRunID = run.info.run_uuid 
  kerasURI = run.info.artifact_uri 

Look at the model flavors.  Both have their respective `keras` or `sklearn` flavors as well as a `python_function` flavor.

In [19]:
print(dbutils.fs.head(sklearnURI+"/model/MLmodel"))

In [20]:
print(dbutils.fs.head(kerasURI+"/model/MLmodel"))

Now we can use both of these models in the same way, even though they were trained by different packages. For full documentation:
https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html

In [22]:
import mlflow.pyfunc

rf_pyfunc_model = mlflow.pyfunc.load_model(model_uri=(sklearnURI+"/model").replace("dbfs:","/dbfs")) 
type(rf_pyfunc_model)

In [23]:
import mlflow.pyfunc

nn_pyfunc_model = mlflow.pyfunc.load_model(model_uri=(kerasURI+"/model").replace("dbfs:","/dbfs"))
type(nn_pyfunc_model)

Both will implement a predict method.  The `sklearn` model is still of type `sklearn` because this package natively implements this method.

In [25]:
rfOutput = rf_pyfunc_model.predict(X_test) 
rfOutput

In [26]:
nnOutput = nn_pyfunc_model.predict(X_test) 
nnOutput

Unnamed: 0,0
1210,144.393906
1729,129.312897
4428,144.466171
3720,439.941681
2970,131.841064
291,91.732430
4222,107.979340
4622,147.816971
4477,523.152283
960,126.100952


In [27]:
print('rfOutput: {}; nnOutput: {}'.format(type(rfOutput), type(nnOutput)))

-sandbox
### Pre and Post Processing Code using `pyfunc`

A `pyfunc` is a generic python model that can define any model, regardless of the libraries used to train it.  As such, it's defined as a directory structure with all of the dependencies.  It is then "just an object" with a predict method.  Since it makes very few assumptions, it can be deployed using MLflow, SageMaker, a Spark UDF or in any other environment.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Check out <a href="https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#pyfunc-create-custom" target="_blank">the `pyfunc` documentation for details</a><br>
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Check out <a href="https://github.com/mlflow/mlflow/blob/master/docs/source/models.rst#example-saving-an-xgboost-model-in-mlflow-format" target="_blank">this README for generic example code and integration with `XGBoost`</a>

To demonstrate how `pyfunc` works, create a basic class that adds `n` to the input values.

Define a model class.

In [30]:
import mlflow.pyfunc

class AddN(mlflow.pyfunc.PythonModel):

    def __init__(self, n):
        self.n=n 

    def predict(self, context, model_input):
        return model_input.apply(lambda column: column + self.n) 

Construct and save the model.

In [32]:
from mlflow.exceptions import MlflowException

model_path = userhome + "/add_n_model2"
add5_model = AddN(n=5) 

dbutils.fs.rm(model_path, True) # Allows you to rerun the code multiple times

mlflow.pyfunc.save_model(path=model_path.replace("dbfs:","/dbfs"), python_model=add5_model)

Load the model in `python_function` format.

In [34]:
loaded_model = mlflow.pyfunc.load_model(model_path)

Evaluate the model.

In [36]:
import pandas as pd

model_input = pd.DataFrame([range(10)])
model_output = loaded_model.predict(model_input)

assert model_output.equals(pd.DataFrame([range(5, 15)]))

model_output

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,5,6,7,8,9,10,11,12,13,14


## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Lab


### [Click here to start the lab for this lesson.]($./Labs/05-Lab)

## Review
**Question:** How do MLflow projects differ from models?  
**Answer:** The focus of MLflow projects is reproducibility of runs and packaging of code.  MLflow models focuses on various deployment environments.

**Question:** What is a ML model flavor?  
**Answer:** Flavors are a convention that deployment tools can use to understand the model, which makes it possible to write tools that work with models from any ML library without having to integrate each tool with each library.  Instead of having to map each training environment to a deployment environment, ML model flavors manages this mapping for you.

**Question:** How do I add pre and post processing logic to my models?  
**Answer:** A model class that extends `mlflow.pyfunc.PythonModel` allows you to have load, pre-processing, and post-processing logic.

## Additional Topics & Resources

**Q:** Where can I find out more information on MLflow Models?  
**A:** Check out <a href="https://www.mlflow.org/docs/latest/models.html" target="_blank">the MLflow documentation</a>

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>