# Monitoring Data Drift

Over time, models can become less effective at predicting accurately due to changing trends in feature data. This phenomenon is known as *data drift*, and it's important to monitor your machine learning solution to detect it so you can retrain your models if necessary.

In this lab, you'll configure data drift monitoring for datasets.

## Before You Start

Before you start this lab, ensure that you have completed the *Create an Azure Machine Learning Workspace* and *Create a Compute Instance* tasks in [Lab 1: Getting Started with Azure Machine Learning](./labdocs/Lab01.md). Then open this notebook in Jupyter on your Compute Instance.

## Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK.

> **Note**: You may be prompted to authenticate. Just copy the code and click the link provided to sign into your Azure subscription, and then return to this notebook.

In [14]:
import datetime
from azureml.core import Workspace, Dataset, ComputeTarget
import pandas as pd
from azureml.datadrift import DataDriftDetector
from azureml.widgets import RunDetails


In [3]:
ws = Workspace.from_config()
print('Ready to work with', ws.name)


Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code FYAQE5M3L to authenticate.
Interactive authentication successfully completed.
Ready to work with workspace


## Create a Baseline Dataset

To monitor a dataset for data drift, you must register a *baseline* dataset (usually the dataset used to train your model) to use as a point of comparison with data collected in the future. 

In [4]:
default_ds = ws.get_default_datastore()
default_ds.upload_files(
    files=['./data/diabetes.csv', './data/diabetes2.csv'],
    target_path='diabetes-baseline', overwrite=True, show_progress=True,
)

print('Registering baseline dataset...')
baseline_data_set = Dataset.Tabular.from_delimited_files(
    (default_ds, 'diabetes-baseline/*.csv')
)
baseline_data_set = baseline_data_set.register(
    ws, 'diabetes baseline', description='diabetes baseline data',
    tags = {'format':'CSV'}, create_new_version=True
)

print('Baseline dataset registered!')


Uploading an estimated of 2 files
Uploading ./data/diabetes.csv
Uploading ./data/diabetes2.csv
Uploaded ./data/diabetes2.csv, 1 files out of an estimated total of 2
Uploaded ./data/diabetes.csv, 2 files out of an estimated total of 2
Uploaded 2 files
Registering baseline dataset...
Baseline dataset registered!


## Create a Target Dataset

Over time, you can collect new data with the same features as your baseline training data. To compare this new data to the baseline data, you must define a target dataset that includes the features you want to analyze for data drift as well as a timestamp field that indicates the point in time when the new data was current -this enables you to measure data drift over temporal intervals. The timestamp can either be a field in the dataset itself, or derived from the folder and filename pattern used to store the data. For example, you might store new data in a folder hierarchy that consists of a folder for the year, containing a folder for the month, which in turn contains a folder for the day; or you might just encode the year, month, and day in the file name like this: *data_2020-01-29.csv*; which is the approach taken in the following code:

In [5]:
print('Generating simulated data...')

data = pd.read_csv('data/diabetes2.csv')

weeknos = reversed(range(6))

file_paths = []
for weekno in weeknos:
    data_date = datetime.date.today() - datetime.timedelta(weeks=weekno)
    
    data['Pregnancies'] = data['Pregnancies'] + 1
    data['Age'] = round(data['Age'] * 1.2)
    data['BMI'] = data['BMI'] * 1.1
    
    file_path = f'data/diabetes_{data_date.strftime("%Y-%m-%d")}.csv'
    data.to_csv(file_path)
    file_paths.append(file_path)

path_on_datastore = 'diabetes-target'
default_ds.upload_files(
    files=file_paths, target_path=path_on_datastore, overwrite=True,
    show_progress=True,
)

partition_format = path_on_datastore + '/diabetes_{date:yyyy-MM-dd}.csv'
target_data_set = Dataset.Tabular.from_delimited_files(
    (default_ds, path_on_datastore + '/*.csv'),
    partition_format=partition_format
)

print('Registering target dataset...')
target_data_set = (
    target_data_set.with_timestamp_columns('date')
        .register(
            ws, 'diabetes target', description='diabetes target data',
            tags = {'format':'CSV'}, create_new_version=True
    )
)

print('Target dataset registered!')


Generating simulated data...
Uploading an estimated of 6 files
Uploading data/diabetes_2020-07-25.csv
Uploading data/diabetes_2020-08-01.csv
Uploading data/diabetes_2020-08-08.csv
Uploading data/diabetes_2020-08-15.csv
Uploading data/diabetes_2020-08-22.csv
Uploading data/diabetes_2020-08-29.csv
Uploaded data/diabetes_2020-08-01.csv, 1 files out of an estimated total of 6
Uploaded data/diabetes_2020-07-25.csv, 2 files out of an estimated total of 6
Uploaded data/diabetes_2020-08-29.csv, 3 files out of an estimated total of 6
Uploaded data/diabetes_2020-08-15.csv, 4 files out of an estimated total of 6
Uploaded data/diabetes_2020-08-08.csv, 5 files out of an estimated total of 6
Uploaded data/diabetes_2020-08-22.csv, 6 files out of an estimated total of 6
Uploaded 6 files
Registering target dataset...
Target dataset registered!


## Create a Data Drift Monitor

Now you're ready to create a data drift monitor for the diabetes data. The data drift monitor will run periodicaly or on-demand to compare the baseline dataset with the target dataset, to which new data will be added over time.

### Create a Compute Target

To run the data drift monitor, you'll need a compute target. create an Azure Machine Learning compute cluster in your workspace (or use an existing one if you have created it previously).

> **Important**: Change *your-compute-cluster* to a unique name for your compute cluster in the code below before running it! Cluster names must be globally unique names between 2 to 16 characters in length. Valid characters are letters, digits, and the - character.

In [8]:
cluster_name = "susumu-cluster"
training_cluster = ComputeTarget(ws, cluster_name)
print('Found existing cluster, use it.')
    

Found existing cluster, use it.


### Define the Data Drift Monitor

Now you're ready to use a **DataDriftDetector** class to define the data drift monitor for your data. You can specify the features you want to monitor for data drift, the name of the compute target to be used to run the monitoring process, the frequency at which the data should be compared, the data drift threshold above which an alert should be triggered, and the latency (in hours) to allow for data collection.

In [11]:
features = ['Pregnancies', 'Age', 'BMI']
monitor = DataDriftDetector.create_from_datasets(
    ws, 'diabetes-drift-detector', baseline_data_set, target_data_set,
    compute_target=cluster_name, frequency='Week', feature_list=features, 
    drift_threshold=.3, latency=24
)
monitor


{'_workspace': Workspace.create(name='workspace', subscription_id='84170def-2683-47c0-91ed-1f34057afd69', resource_group='resources'), '_frequency': 'Week', '_schedule_start': None, '_schedule_id': None, '_interval': 1, '_state': 'Disabled', '_alert_config': None, '_type': 'DatasetBased', '_id': '66e584c9-9ef1-416e-a597-8875e80eccee', '_model_name': None, '_model_version': 0, '_services': None, '_compute_target_name': 'susumu-cluster', '_drift_threshold': 0.3, '_baseline_dataset_id': 'da146a64-fd0e-4421-9910-0ada1ebd26af', '_target_dataset_id': 'df71d6c9-cbd7-42e7-ae3c-74007a54bd7f', '_feature_list': ['Pregnancies', 'Age', 'BMI'], '_latency': 24, '_name': 'diabetes-drift-detector', '_latest_run_time': None, '_client': <azureml.datadrift._restclient.datadrift_client.DataDriftClient object at 0x7f3e599cdeb8>, '_logger': <_TelemetryLoggerContextAdapter azureml.datadrift._logging._telemetry_logger.azureml.datadrift.datadriftdetector (DEBUG)>}

## Backfill the Monitor

You have a baseline dataset and a target dataset that includes simulated weekly data collection for six weeks. You can use this to backfill the monitor so that it can analyze data drift between the original baseline and the target data.

> **Note** This may take some time to run, as the compute target must be started to run the backfill analysis. The widget may not always update to show the status, so click the link to observe the experiment status in Azure Machine Learning studio!

In [15]:
backfill = monitor.backfill(
    datetime.datetime.now() - datetime.timedelta(weeks=6),
    datetime.datetime.now(),
)

RunDetails(backfill).show()
backfill.wait_for_completion()


_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

{'runId': 'diabetes-drift-detector-Monitor-Runs_1598747869145',
 'target': 'susumu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2020-08-30T00:47:54.362379Z',
 'endTimeUtc': '2020-08-30T00:53:46.551898Z',
   'message': 'target dataset id:df71d6c9-cbd7-42e7-ae3c-74007a54bd7f do not contain sufficient amount of data after timestamp filteringMinimum needed: 50 rows.Skipping calculation for time slice 2020-08-30 00:00:00 to 2020-09-06 00:00:00.'}],
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '646cdd85-cb8b-49fb-8231-53d7c89b4879',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': 'da146a64-fd0e-4421-9910-0ada1ebd26af'}, 'consumptionDetails': {'type': 'Reference'}}, {'dataset': {'id': 'df71d6c9-cbd7-42e7-ae3c-74007a54bd7f'}, 'consumptionDetails': {'type': 'Reference'}}],
 'runDefinition': {'script': '_generate_script_datasets.py',
  'scriptType'

## Analyze Data Drift

You can use the following code to examine data drift for the points in time collected in the backfill run.

In [16]:
drift_metrics = backfill.get_metrics()
for metric in drift_metrics:
    print(metric, drift_metrics[metric])


start_date 2020-07-19
end_date 2020-09-06
frequency Week
Datadrift percentage {'days_from_start': [0, 7, 14, 21, 28, 35], 'drift_percentage': [74.19152901127207, 87.23985219136877, 91.74192122865539, 94.96492628559955, 97.58354951107833, 99.23199438682525]}


You can also visualize the data drift metrics in [Azure Machine Learning studio](https://ml.azure.com) by following these steps:

1. On the **Datasets** page, view the **Dataset monitors** tab.
2. Click the data drift monitor you want to view.
3. Select the date range over which you want to view data drift metrics (if the column chart does not show multiple weeks of data, wait a minute or so and click **Refresh**).
4. Examine the charts in the **Drift overview** section at the top, which show overall drift magnitude and the drift contribution per feature.
5. Explore the charts in the **Feature detail** section at the bottom, which enable you to see various measures of drift for individual features.

> **Note**: For help understanding the data drift metrics, see the [How to monitor datasets](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-datasets#understanding-data-drift-results) in the Azure Machine Learning documentation.

## Explore Further

This lab is designed to introduce you to the concepts and principles of data drift monitoring. To learn more about monitoring data drift using datasets, see the [Detect data drift on datasets](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-datasets) in the Azure machine Learning documentation.

You can also configure data drift monitoring for services deployed in an Azure Kubernetes Service (AKS) cluster. For more information about this, see [Detect data drift on models deployed to Azure Kubernetes Service (AKS)](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-data-drift) in the Azure Machine Learning documentation.
