# Patch MLflow
Patch MLflow with `whylogs.enable_mlflow()`

In [1]:
import whylogs
whylogs.enable_mlflow()

True

# Import MLFlow

Set tracking URI to local host.

We already have an MLflow server running at `http://localhost:5000`

In [2]:
import mlflow
print(mlflow.__version__)
mlflow.set_tracking_uri("http://localhost:5000")

1.13.1


# Build a model

We build a simple ElasticNet model using the wine quality model.

This is adopted from the MLflow example at: https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html

First we need to load the data and split it into train and test datasets.

In [5]:
import os
import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn

import logging

logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)

def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

warnings.filterwarnings("ignore")
np.random.seed(40)
# Read the wine-quality csv file from the URL
csv_url = (
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
)
try:
    data = pd.read_csv(csv_url, sep=";")
except Exception as e:
    logger.exception(
        "Unable to download training & test CSV, check your internet connection. Error: %s", e
    )
# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)
# The predicted column is "quality" which is a scalar from [3, 9]
train_x = train.drop(["quality"], axis=1)
test_x = test.drop(["quality"], axis=1)
train_y = train[["quality"]]
test_y = test[["quality"]]

# Train the model

We train the model and register them with MLflow.

Since we already enable whylogs integration, whylogs will pick up the `.whylogs.yaml` in the current directory and add it as a model artifact

In [9]:
!cat .whylogs.yaml

# .whylogs.yaml

# Example WhyLogs YAML configuration
project: demo-project
pipeline: sagemaker-pipeline
verbose: false
writers:
# Save out the full protobuf datasketches data locally
- formats:
    - protobuf
  output_path: s3://whylabs-demo-artifacts-us-west-2/sagemaker
  path_template: $name/dataset_profile
  filename_template: datase_profile-$dataset_timestamp
  type: s3


In [10]:
alpha = 0.5
l1_ratio = 0.5
with mlflow.start_run():
    lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
    lr.fit(train_x, train_y)
    predicted_qualities = lr.predict(test_x)
    (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

    print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
    print("  RMSE: %s" % rmse)
    print("  MAE: %s" % mae)
    print("  R2: %s" % r2)
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)
    tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
    # Model registry does not work with file store
    if tracking_url_type_store != "file":
        # Register the model
        # There are other ways to use the Model Registry, which depends on the use case,
        # please refer to the doc for more information:
        # https://mlflow.org/docs/latest/model-registry.html#api-workflow
        mlflow.sklearn.log_model(lr, "model", registered_model_name="ElasticnetWineModel")
    else:
        mlflow.sklearn.log_model(lr, "model")

Elasticnet model (alpha=0.500000, l1_ratio=0.500000):
  RMSE: 0.7931640229276851
  MAE: 0.6271946374319586
  R2: 0.10862644997792614


Registered model 'ElasticnetWineModel' already exists. Creating a new version of this model...
2021/01/21 21:37:05 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: ElasticnetWineModel, version 8
Created version '8' of model 'ElasticnetWineModel'.


# Examine run artifact

We can go to MLflow interfact to see the run artifacts or examine the run info using MLflow API

In [15]:
latest_run = mlflow.list_run_infos(experiment_id="0")[0]
latest_run

<RunInfo: artifact_uri='data/0/c67968018081464eabf1bece1dde68f3/artifacts', end_time=1611293825505, experiment_id='0', lifecycle_stage='active', run_id='c67968018081464eabf1bece1dde68f3', run_uuid='c67968018081464eabf1bece1dde68f3', start_time=1611293824555, status='FINISHED', user_id='andy'>

## Model artifacts
You should see that `.whylogs.yaml` is appended with the model artifacts.

We also add `whylogs` as a dependency in `conda.yaml`

In [33]:
client = mlflow.tracking.MlflowClient()
info = client.list_artifacts(run_id=latest_run.run_id, path="model")
info

[<FileInfo: file_size=378, is_dir=False, path='model/.whylogs.yaml'>,
 <FileInfo: file_size=357, is_dir=False, path='model/MLmodel'>,
 <FileInfo: file_size=178, is_dir=False, path='model/conda.yaml'>,
 <FileInfo: file_size=645, is_dir=False, path='model/model.pkl'>]

## Hosting in SageMaker

## Push the base container

Before you start, you'll need to push the base MLflow container to your AWS account. 
See https://www.mlflow.org/docs/latest/cli.html#mlflow-sagemaker-build-and-push-container

In [None]:
!mlflow sagemaker build-and-push-container

## Deploy the model

MLflow makes it really easy to deploy the model.

We have setup a Sagemaker execution role that has the S3 write permission to the `whylabs-demo-artifacts-us-west-2` bucket (set in `.whylogs.yaml`)

Note that it takes a while for a model deployment to reach Stable state.

In [37]:
import mlflow.sagemaker

mlflow.sagemaker.deploy(
    app_name="whylogs", 
    model_uri=f"runs:/{latest_run.run_id}/model", 
    execution_role_arn="arn:aws:iam::207285235248:role/service-role/AmazonSageMaker-ExecutionRole-20200529T103706",
    mode="replace")

2021/01/21 21:50:59 INFO mlflow.sagemaker: Using the python_function flavor for deployment!
2021/01/21 21:50:59 INFO mlflow.sagemaker: No model data bucket specified, using the default bucket
2021/01/21 21:51:01 INFO mlflow.sagemaker: Default bucket `mlflow-sagemaker-us-west-2-207285235248` already exists. Skipping creation.
2021/01/21 21:51:01 INFO mlflow.sagemaker: tag response: {'ResponseMetadata': {'RequestId': '55E816D19DDB953D', 'HostId': 't7zVriBj3Gdtwn7nXaIMGWfsLBpZqTnzBWF+oEjdnAP7g79bEgJ9XtTwFNNAn8AgQDcFkiQGX5Y=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 't7zVriBj3Gdtwn7nXaIMGWfsLBpZqTnzBWF+oEjdnAP7g79bEgJ9XtTwFNNAn8AgQDcFkiQGX5Y=', 'x-amz-request-id': '55E816D19DDB953D', 'date': 'Fri, 22 Jan 2021 05:51:02 GMT', 'content-length': '0', 'server': 'AmazonS3'}, 'RetryAttempts': 0}}
2021/01/21 21:51:01 INFO mlflow.sagemaker: Creating new endpoint with name: whylogs ...
2021/01/21 21:51:02 INFO mlflow.sagemaker: Created model with arn: arn:aws:sagemaker:us-west-2:2072852

## Sagemaker API Input

The input of SageMaker API is based on Pandas JSON wplit `split` orient. Example is below:

In [39]:
train_x.head(3).to_json(orient='split')

'{"columns":["fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol"],"index":[1316,1507,849],"data":[[5.4,0.74,0.0,1.2,0.041,16.0,46.0,0.99258,4.01,0.59,12.5],[7.5,0.38,0.57,2.3,0.106,5.0,12.0,0.99605,3.36,0.55,11.4],[6.4,0.63,0.21,1.6,0.08,12.0,32.0,0.99689,3.58,0.66,9.8]]}'

## Calling the endpoint

First we create a Sagemaker client

In [40]:
import boto3
client = boto3.client('runtime.sagemaker')

Now we can call the endpoint and see the output:

In [41]:
res = client.invoke_endpoint(EndpointName="hackathon", ContentType="application/json", Body=test_x.head(10).to_json(orient="split"))
print(res['Body'].read().decode())

[5.731344540042413, 5.247960704489776, 5.754717372954283, 5.751114228483034, 5.635968698470214, 5.650958816069663, 5.6375782441532145, 5.781663284991655, 5.578837218114763, 5.603618905770178]


## Run our data for a while
To demonstrate the ability to run whylogs in a live inference environment, we'll run the above request a few time.

By default, whylogs writes to S3 every minutes (configurable). The short interval is for demo and testing purpose.

In [45]:
import time
for i in range(1, 30):
    res = client.invoke_endpoint(EndpointName="whylogs", ContentType="application/json", Body=test_x.to_json(orient="split"))
    res['Body'].read().decode()
    time.sleep(5)

## Verify the data in the bucket

We verify that whylogs sends data in a period manner to the bucket.

In [46]:
!aws s3 ls --recursive s3://whylabs-demo-artifacts-us-west-2/sagemaker/live/

2021-01-21 22:16:27      38653 sagemaker/live/dataset_profile/protobuf/datase_profile-1611296040000.2021-01-22_06-14.bin
2021-01-21 22:17:57      41127 sagemaker/live/dataset_profile/protobuf/datase_profile-1611296100000.2021-01-22_06-15.bin
2021-01-21 22:18:29      38653 sagemaker/live/dataset_profile/protobuf/datase_profile-1611296160000.2021-01-22_06-16.bin
2021-01-21 22:18:50      45135 sagemaker/live/dataset_profile/protobuf/datase_profile-1611296220000.2021-01-22_06-17.bin


You should be able to analyze this data using `whylogs` visualization. Check out our example repo for more notebooks!