In this notebook, we will explore the basic functionality of the MLFlow integration of WhyLogs Python library.

# MLFlow + WhyLogs Integration Example
We will first read in raw data into Pandas from file and explore that data briefly. To run WhyLogs, we will then import the WhyLogs library, initialize a logging session with WhyLogs, and create a profile that data -- resulting in a WhyLogs profile summary. Finally, we'll explore some of the features of the profile summary content.

First, we will import a few standard data science Python libraries.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import glob
import random
import time

import pandas as pd
import numpy as np
import mlflow
import whylogs

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet

Enable WhyLogs in MLFlow to allow storage of WhyLogs statistical profiles. This can be disabled using `whylogs.disable_mlflow()`.

In [None]:
whylogs.enable_mlflow()

Download and prepare the UCI wine quality dataset. We sample test dataset further to represent batches of datasets produced every second.

In [None]:
# Load wine quality dataset
data_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(data_url, sep=";")

# Split the data into training and test sets
train, test = train_test_split(data)

# Relocate predicted variable "quality" to y vectors
train_x = train.drop(["quality"], axis=1).reset_index(drop=True)
test_x = test.drop(["quality"], axis=1).reset_index(drop=True)
train_y = train[["quality"]].reset_index(drop=True)
test_y = test[["quality"]].reset_index(drop=True)

# Sample from test data to collect "daily" data
subset_test_x = []
subset_test_y = []
for i in range(20):
    indices = random.sample(range(len(test)), 5)
    subset_test_x.append(test_x.loc[indices, :])
    subset_test_y.append(test_y.loc[indices, :])

Train an ElasticNet model using scikit-learn.

We then run this model for each of the batches of data, logging the model parameters, MAE evaluation metric, and WhyLogs dataset (from Pandas DataFrame).

In [None]:
model_params = {"alpha": 1.0,
                "l1_ratio": 2.0}

lr = ElasticNet(**model_params)
lr.fit(train_x, train_y)
print("Elasticnet model (%s):" % (model_params))

for i in range(20):
    predicted_output = lr.predict(subset_test_x[i])
    
    mae = mean_absolute_error(subset_test_y[i], predicted_output)
    print("  Subset %.0f, mean absolute error: %s" % (i, mae))
    
    mlflow.log_params(model_params)
    mlflow.log_metric("mae", mae)
    
    mlflow.whylogs.log_pandas(train)
    mlflow.end_run()
    
    time.sleep(1)

Let's now collect the `experiment_id` from MLFlow for the previous experiment.

In [None]:
client = mlflow.tracking.MlflowClient()
experiment = client.list_experiments()[0]

Inside of MLFlow, the profiles are stored as *artifacts*. These can be retrieved in the same way you store MLFlow projects, parameters, and metrics. Here is one example using MLFlow's Python API.

In [None]:
runs = client.list_run_infos(experiment.experiment_id)

for run in runs:
    artifacts = client.list_artifacts(run.run_id)
    for artifact in artifacts:
        if artifact.path == "whylogs":
            print(artifact)

Our integration allows you to quickly collect the statistical profiles produced during experimentation.

In [None]:
mlflow_profiles = whylogs.mlflow.get_experiment_profiles(experiment.experiment_id)
mlflow_profiles

You can then use `whylogs.viz` to easily produce visualizations for the WhyLogs profile data.

In [None]:
from whylogs.viz import ProfileVisualizer

viz = ProfileVisualizer()
viz.set_profiles(mlflow_profiles)

In [None]:
viz.plot_distribution("free sulfur dioxide", ts_format="%d-%b-%y %H:%M:%S")

MLFlow provides a command line interface allows can start an HTTP server where we can examine the experiment information, including artifacts like WhyLogs. We will use `!` to start that server here in Jupyter notebook instead of returning to the command line.

In [None]:
!mlflow ui