# Using MLFlow and Evidently to Evaluate Data Drift

In this example, we will explore the MLflow integration with Evidently.

This notebook shows how you can use the Evidently and MLflow to:
* calculate data drift for the model, performed as batch checks 
* log data drift using MLflow Tracking
* explore the result using MLflow UI

Acknowledgments:
* The dataset used in the example is from:  https://www.kaggle.com/c/bike-sharing-demand/data?select=train.csv
* Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
* More information about the dataset can be found in UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

## Getting Started¶
To run this tutorial:

1. Install MLflow
You can install MLflow with the following command `pip install mlflow` or install MLflow with scikit-learn via `pip install mlflow[extras]`
More details:https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html#id5

2. Install Evidently
You can install Evidently with the following command `pip install evidently`
In case you are also interested in Evidently Dashboard visualization in Jupyter install jupyter nbextention:
`jupyter nbextension install --sys-prefix --symlink --overwrite --py evidently`
And activate it:
`jupyter nbextension enable evidently --py --sys-prefix`
More details: https://docs.evidentlyai.com/install-evidently 

3. Load data from https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset and save in locally

In [8]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [9]:
import json
import pandas as pd

from evidently.model_profile import Profile
from evidently.profile_sections import DataDriftProfileSection

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient

More information about the dataset can be found in Kaggle Playground Competition: https://www.kaggle.com/c/bike-sharing-demand/data?select=train.csv

Acknowledgement: Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

In [2]:
#load data
raw_data = pd.read_csv('bike_demand_prediction_data.csv', header=0, 
                       sep=',', parse_dates=['datetime'], index_col='datetime')

In [3]:
#observe data structure
raw_data.head()

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [12]:
#set column mapping for Evidently Profile
data_columns = {}
data_columns['numerical_features'] = ['weather', 'temp', 'atemp', 'humidity', 'windspeed']
data_columns['categorical_features'] = ['holiday', 'workingday']

In [13]:
#evaluate data drift with Evidently Profile
def eval_drift(reference, production, column_mapping):
    data_drift_profile = Profile(sections=[DataDriftProfileSection])
    data_drift_profile.calculate(reference, production, column_mapping=column_mapping)
    report = data_drift_profile.json()
    json_report = json.loads(report)

    drifts = []
    for feature in column_mapping['numerical_features'] + column_mapping['categorical_features']:
        drifts.append((feature, json_report['data_drift']['data']['metrics'][feature]['p_value'])) 
    return drifts

In [16]:
#set reference dates
reference_dates = ('2011-01-01 00:00:00','2011-01-28 23:00:00')

#set experiment batches dates
experiment_batches = [
    ('2011-01-01 00:00:00','2011-01-29 23:00:00'),
    ('2011-01-29 00:00:00','2011-02-07 23:00:00'),
    ('2011-02-07 00:00:00','2011-02-14 23:00:00'),
    ('2011-02-15 00:00:00','2011-02-21 23:00:00'),  
]

In [17]:
#log into MLflow
client = MlflowClient()

#set experiment
mlflow.set_experiment('Data Drift Evaluation with Evidently')

#start new run
for date in experiment_batches:
    with mlflow.start_run() as run: #inside brackets run_name='test'
        
        # Log parameters
        mlflow.log_param("begin", date[0])
        mlflow.log_param("end", date[1])

        # Log metrics
        metrics = eval_drift(raw_data.loc[reference_dates[0]:reference_dates[1]], 
                             raw_data.loc[date[0]:date[1]], 
                             column_mapping=data_columns)
        for feature in metrics:
            mlflow.log_metric(feature[0], round(feature[1], 3))

        print(run.info)

<RunInfo: artifact_uri='file:///Users/emeli/Dev/evidently/mlflow/examples/evidently/mlruns/3/dafc7696e7ab4418b1ea3c77799bc0b6/artifacts', end_time=None, experiment_id='3', lifecycle_stage='active', run_id='dafc7696e7ab4418b1ea3c77799bc0b6', run_uuid='dafc7696e7ab4418b1ea3c77799bc0b6', start_time=1626195935151, status='RUNNING', user_id='emeli'>
<RunInfo: artifact_uri='file:///Users/emeli/Dev/evidently/mlflow/examples/evidently/mlruns/3/df9e6ff2ae7b4266b347d6071302842a/artifacts', end_time=None, experiment_id='3', lifecycle_stage='active', run_id='df9e6ff2ae7b4266b347d6071302842a', run_uuid='df9e6ff2ae7b4266b347d6071302842a', start_time=1626195935242, status='RUNNING', user_id='emeli'>
<RunInfo: artifact_uri='file:///Users/emeli/Dev/evidently/mlflow/examples/evidently/mlruns/3/d960a43851ec42549d48d644772538f8/artifacts', end_time=None, experiment_id='3', lifecycle_stage='active', run_id='d960a43851ec42549d48d644772538f8', run_uuid='d960a43851ec42549d48d644772538f8', start_time=162619593

In [19]:
#run MLflow UI (it will be more convinient to run it directly from the terminal)
!mlflow ui

[2021-07-13 17:26:11 +0300] [2234] [INFO] Starting gunicorn 20.1.0
[2021-07-13 17:26:11 +0300] [2234] [INFO] Listening at: http://127.0.0.1:5000 (2234)
[2021-07-13 17:26:11 +0300] [2234] [INFO] Using worker: sync
[2021-07-13 17:26:11 +0300] [2237] [INFO] Booting worker with pid: 2237
^C

Aborted!
[2021-07-13 17:27:01 +0300] [2234] [INFO] Handling signal: int
[2021-07-13 17:27:01 +0300] [2237] [INFO] Worker exiting (pid: 2237)
