# Data Drift Monitor
In this exercise we will create a Data Drift monitor on a target dataset in our Azure Machine Learning (AML) workspace. We'll then establish a pipeline for triggering the Data Drift monitor that we can orchestrate with the model retraining pipeline that was created in an earlier exercise.

The Data Drift monitor is one option for monitoring ML models in production and singling the need for a model retrain. The other option would comparing actuals to predictions. Monitoring models and pipelines is a core concept of MLOps. The result of this exercise will establish a data drift monitor pipeline that will be orchestrated with a retraining pipeline using Azure Data Factory.

## Step 1: Create Target and Baseline Datasets
We will first create a target dataset where we simulate data drift for the sake of an example. At the same time we will create a baseline dataset to use as the drift comparison.

In [None]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

In [None]:
#TODO: Name your artifacts
userid = ''

In [None]:
import os

# Create a folder for the experiment files
data_drift_folder = 'data_drift_folder'
os.makedirs(data_drift_folder + '/baseline', exist_ok=True)
os.makedirs(data_drift_folder + '/target', exist_ok=True)
print(data_drift_folder, 'folder created')

In [None]:
import datetime as dt
from datetime import timedelta
import pandas as pd
import numpy as np

df_baseline = pd.read_csv('../data/diabetes/diabetes2.csv')
df_target = df_baseline.copy()

# Modify data to ceate some drift
df_target['Pregnancies'] = df_target['Pregnancies'] + 1
df_target['Age'] = round(df_target['Age'] * 1.2).astype(int)
df_target['BMI'] = df_target['BMI'] * 1.1

row_count = df_baseline.shape[0]

baseline_date = dt.date(2022,1,1)
target_date = dt.date(2022,2,1)

baseline_date_column = []
target_date_column = []
for day in range(row_count):
    baseline_date_column.append(baseline_date)
    target_date_column.append(target_date)

df_baseline['Datetime'] = baseline_date_column
df_target['Datetime'] = target_date_column

df_baseline.head()

df_baseline.to_csv(data_drift_folder + '/baseline/diabetes_baseline.csv')
df_target.to_csv(data_drift_folder + '/target/diabetes_target.csv')

In [None]:
from azureml.core import Datastore, Dataset
from azureml.data.datapath import DataPath

datastore_name = 'workshop_datalake'

datastore = Datastore.get(ws, datastore_name)

# Register baseline dataset
Dataset.File.upload_directory(src_dir=data_drift_folder + '/baseline',
           target=DataPath(datastore,  '0-raw/diabetes/' + userid + '/baseline/'),
           show_progress=True)

diabetes_baseline_ds = Dataset.Tabular.from_delimited_files(path=(datastore,'0-raw/diabetes/' + userid + '/baseline'))
diabetes_baseline_ds.register(ws,name='diabetes-data-baseline-' + userid ,create_new_version=False)

# Register target dataset
Dataset.File.upload_directory(src_dir=data_drift_folder + '/target',
           target=DataPath(datastore,  '0-raw/diabetes/' + userid + '/target/'),
           show_progress=True)

diabetes_target_ds = Dataset.Tabular.from_delimited_files(path=(datastore,'0-raw/diabetes/' + userid + '/target'))
# Note the addition of .with_timestamp_columns. This is required for the dataset to be used as a Data Drift Monitor target dataset.
diabetes_target_ds.with_timestamp_columns('Datetime').register(ws,name='diabetes-data-target-' + userid,create_new_version=False)

## Step 2: Create Data Drift Monitor
In this step we will create a data drift monitor using the AML Studio.

1. In the AML Studio, navigate to <b>Datasets</b>.
1. Select the <b>Dataset monitors</b> tab.
1. Press the <b>+ Create</b> button.

![AML studio screen with Create button for Data Drift Monitor](./img/dfcreatebutton.png)

1. Select the target dataset registered in Step 1.
1. Press <b>Next</b>

![Data Drift Monitor wizard Select target dataset screen](./img/dfselecttarget.png)

1. Select the <b>Choose a baseline dataset</b> radio button.
1. Choose the baseline dataset registered in Step 1.
1. Press <b>Next</b>

![Data Drift Monitor wizard Select baseline dataset screen](./img/dfselectbaseline.png)

1. Name the drift monitor using the convention "diabetes-drift-monitor-\<userid\>"
1. In <b>Features</b> select <b>Pregnancies, Age, BMI</b>.
1. Choose the desired Compute Cluster for <b>Compute Target</b>.
1. Disable the monitor with the switch under <b>Enable or disable schedule monitor runs</b>.
1. Set <b>Threshold</b> to <b>10%</b>.
1. Press <b>Create</b>.

![Data Drift Monitor wizard Monitor settings screen](./img/dfmonitorsettings.png)

Finally, we will test the new Data Drift Monitor with a target date run of '2022-02-01', the date we added to the baseline dataset in Step 1.

In [None]:
from azureml.datadrift import DataDriftDetector
from datetime import datetime

monitor = DataDriftDetector.get_by_name(ws,name='diabetes-drift-monitor-' + userid)
print(monitor)

target_datetime = datetime.strptime('2022-02-01', '%Y-%m-%d')

monitor_run = monitor.run(target_date=target_datetime)

monitor_run.wait_for_completion()

drift_percent = monitor_run.get_metrics()['Datadrift percentage']['drift_percentage']

print(drift_percent)

## Step 3: Create Pipeline for Data Drift Monitor
In this step we create a pipeline that authenticates with the AML workspace and submits are target date run of the Data Drift Monitor created in Step 2. The drift percentage metric is then logged to the pipeline, so that it can be retrieved from the retraining pipeline.

In [None]:
%%writefile $data_drift_folder/diabetes_drift_monitor.py
import argparse
from azureml.core import Workspace, Run, Experiment
from azureml.pipeline.core import PipelineRun
from azureml.core.authentication import MsiAuthentication
from azureml.datadrift import DataDriftDetector
from datetime import datetime

parser = argparse.ArgumentParser()
parser.add_argument('--target-date', type=str, dest='target_date', default='', help='target date')
args = parser.parse_args()

#TODO: Provide userid and workspace values
userid = ''
subscription_id = ''
resource_group = ''
workspace_name = ''

experiment_name = 'diabetes-drift-monitor-' + userid + '-Monitor-Runs'

target_date = args.target_date
target_datetime = datetime.strptime(target_date, '%Y-%m-%d')

msi_auth = MsiAuthentication()

ws = Workspace(subscription_id=subscription_id,
               resource_group=resource_group,
               workspace_name=workspace_name,
               auth=msi_auth)

ex = Experiment(ws,name=experiment_name)

monitor = DataDriftDetector.get_by_name(ws,name='diabetes-drift-monitor-' + userid)
print(monitor)

monitor_run = monitor.run(target_date=target_datetime)
monitor_run.wait_for_completion()

# Log drift percent to pipeline run and tag with runid.
drift_percent = monitor_run.get_metrics()['Datadrift percentage']['drift_percentage']
pipeline_step_run = Run.get_context()
pipeline_run = PipelineRun(ex,run_id=pipeline_step_run.get_properties()['azureml.pipelinerunid'])
pipeline_run.log('drift_percent', str(drift_percent))
pipeline_run.tag('drift_percent', str(drift_percent))

print(drift_percent)

In [None]:
%%writefile $data_drift_folder/drift_environment.yml
name: train_environment
dependencies:
- python=3.6.2
- scikit-learn
- pandas
- numpy
- pip
- pip:
  - azureml-defaults
  - azureml-datadrift

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute

#TODO: Set compute cluster name
aml_compute_target = ""

aml_compute_drift_target = AmlCompute(ws, aml_compute_target)
print("found existing compute target.")
print("Azure Machine Learning Compute attached")

In [None]:
from azureml.core import Environment
from azureml.core import ScriptRunConfig
from azureml.pipeline.core.graph import PipelineParameter

drift_env = Environment.from_conda_specification(
    name="diabetes-drift-env", file_path="./data_drift_folder/drift_environment.yml"
)

drift_cfg = ScriptRunConfig(
    source_directory='data_drift_folder',
    script="diabetes_drift_monitor.py",
    compute_target=aml_compute_drift_target,
    environment=drift_env,
)

In [None]:
from azureml.pipeline.steps import PythonScriptStep

pipeline_param = PipelineParameter(
  name="target-date",
  default_value='')

drift_monitor_step = PythonScriptStep(name='drift_monitor_step',
                            source_directory=drift_cfg.source_directory,
                            script_name=drift_cfg.script,
                            runconfig=drift_cfg.run_config,
                            arguments = ['--target-date', pipeline_param],
                            allow_reuse=False
                            )

print("Step1 created")

In [None]:
from azureml.core import Experiment
from azureml.pipeline.core import Pipeline
from azureml.widgets import RunDetails

drift_pipeline = Pipeline(workspace=ws, steps=[drift_monitor_step])
drift_pipeline_run = drift_pipeline.submit(experiment_name='diabetes-drift-monitor-' + userid + '-Monitor-Runs'
                                            ,pipeline_parameters={'target-date':'2022-02-01'})
                                            
RunDetails(drift_pipeline_run).show()
drift_pipeline_run.wait_for_completion()

In [None]:
published_pipeline = drift_pipeline.publish(name='pipeline-drift-monitor-diabetes-' + userid + '-prod', description="Execute data drift monitor for diabetes.")

!Note: Run the below code once to create a consistent endpoint.

In [None]:
from azureml.pipeline.core import PipelineEndpoint, PublishedPipeline

pipeline_endpoint = PipelineEndpoint.publish(workspace=ws, name='endpoint-drift-monitor-diabetes-' + userid + '-prod',
                                            pipeline=published_pipeline, description="Endpoint")

pipeline_endpoint

### Clean-up local workspace
Remove files and directories created during exercise.

In [None]:
import os
import shutil

shutil.rmtree('data_drift_folder', ignore_errors=True)