# Introduction to Model Drift Workshop

TO ADD CREDITS

In [None]:
%load_ext autoreload
%autoreload 2

In [49]:
import pandas as pd

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import ClassificationPreset, DataDriftPreset, DataQualityPreset, TargetDriftPreset
from evidently.test_suite import TestSuite
from evidently.test_preset import NoTargetPerformanceTestPreset, DataQualityTestPreset, DataStabilityTestPreset, DataDriftTestPreset, MulticlassClassificationTestPreset

from pathlib import Path

from sklearn import datasets, ensemble, model_selection

## Workshop setup: datasets and models

### Classification model and Iris dataset

In [50]:
iris_data = datasets.load_iris(as_frame="auto")
iris = iris_data.frame

iris_ref, iris_cur = model_selection.train_test_split(iris, test_size=0.3)

clas_model = ensemble.RandomForestClassifier(random_state=42, n_estimators=3)
clas_model.fit(iris_ref[iris_data.feature_names], iris_ref.target)

iris_ref["prediction"] = clas_model.predict(iris_ref[iris_data.feature_names])
iris_cur["prediction"] = clas_model.predict(iris_cur[iris_data.feature_names])

iris_ref_input_data = iris_ref[iris_ref.columns[~iris_ref.columns.isin(["target", "prediction"])]]
iris_cur_input_data = iris_cur[iris_cur.columns[~iris_cur.columns.isin(["target", "prediction"])]]

### Regression model and California housing dataset

In [52]:
housing_data = datasets.fetch_california_housing(as_frame="auto")
housing = housing_data.frame

housing.rename(columns={"MedHouseVal": "target"}, inplace=True)
numerical_features_reg = [
    "MedInc",
    "HouseAge",
    "AveRooms",
    "AveBedrms",
    "Population",
    "AveOccup",
    "Latitude",
    "Longitude",
]
categorical_features_reg = []
features_reg = numerical_features_reg

housing_ref, housing_cur = model_selection.train_test_split(housing, test_size=0.3)

housing_ref = housing_ref.sample(n=5000, replace=False)
housing_cur = housing_cur.sample(n=1000, replace=False)

reg_model = ensemble.RandomForestRegressor(random_state=42)
reg_model.fit(housing_ref[features_reg], housing_ref.target)

housing_ref["prediction"] = reg_model.predict(housing_ref[features_reg])
housing_cur["prediction"] = reg_model.predict(housing_cur[features_reg])

housing_ref_input_data = housing_ref[housing_ref.columns[~housing_ref.columns.isin(["target", "prediction"])]]
housing_cur_input_data = housing_cur[housing_cur.columns[~housing_cur.columns.isin(["target", "prediction"])]]

## Model performance

Goal:
* understand a Column Mapping concept
* try out an Exercise 1 for a better Column Mapping concept understanding
* explore a pre-built report for the Classification model performance
* explore a pre-built report for the Regression model performance during the Exercise 2 

### Column mapping

Evidently expects a certain dataset structure and input column names. You can specify any differences by creating a `ColumnMapping` object. It works the same way for test suites and reports. Column mapping helps correctly process the input data. If the `column_mapping` is not specified or set as `None`, Evidently will use the default mapping strategy. We recommend specifying column mapping manually. Evidently applies different heuristics and rules to map the input data automatically. To avoid errors, it is always best to set column mapping manually.

In [None]:
clas_column_mapping = ColumnMapping()

clas_column_mapping.numerical_features = [
    "sepal length (cm)",
    "sepal width (cm)",
    "petal length (cm)",
    "petal width (cm)",
]

clas_column_mapping.target = "target"
clas_column_mapping.target_names = ["Setosa", "Versicolour", "Virginica"]
clas_column_mapping.prediction = "prediction"

clas_column_mapping.task = "classification"

#### Exercise 1

Map columns for the regression model


### Classification Performance report

**Classification Performance report** evaluates the quality of a classification model.
- Can be generated for a single dataset, or compare it against the reference (e.g. past performance or alternative model).
- Works for binary and multi-class, probabilistic and non-probabilistic classification.
- Displays a variety of metrics and plots related to the model performance.
- Helps explore regions where the model makes different types of errors.

#### When to use the report
These presets help evaluate and test the quality of classification models. You can use them:
1. To monitor the performance of a classification model in production. You can run the test suite as a regular job (e.g., weekly or when you get the labels) to contrast the model performance against the expectation. You can generate visual reports for documentation and sharing with stakeholders.
2. To trigger or decide on the model retraining. You can use the test suite to check if the model performance is below the threshold to initiate a model update.
3. To debug or improve model performance. If you detect a quality drop, you can use the visual report to explore the model errors and underperforming segments. By manipulating the input data frame, you can explore how the model performs on different data segments (e.g., users from a specific region). You can also combine it with the Data Drift report.
4. To analyze the results of the model test. You can explore the results of an online or offline test and contrast it to the performance in training. You can also use this report to compare the model performance in an A/B test or during a shadow model deployment.

To run performance checks as part of the pipeline, use the Test Suite. To explore and debug, use the Report.

#### How to run the report
To run this report, you need to have both target and prediction columns available. Input features are optional. Pass them if you want to explore the relations between features and target. Refer to the column mapping section to see how to pass model predictions and labels in different cases. The tool does not yet work for multi-label classification. It expects a single true label. To generate a comparative report, you will need two datasets. You can also run this report for a single dataset, with no comparison performed.

#### How it looks

The report includes multiple components. The composition might vary based on problem type (there are more plots in the case of probabilistic classification). All plots are interactive.

1. Model Quality Summary Metrics
Evidently calculates a few standard model quality metrics: Accuracy, Precision, Recall, F1-score, ROC AUC, and LogLoss. To support the model performance analysis, Evidently also generates interactive visualizations. They help analyze where the model makes mistakes and come up with improvement ideas.

2. Class Representation
Shows the number of objects of each class.

3. Confusion Matrix
Visualizes the classification errors and their type.

4. Quality Metrics by Class
Shows the model quality metrics for the individual classes. In the case of multi-class problems, it will also include ROC AUC.

5. Class Separation Quality
A scatter plot of the predicted probabilities shows correct and incorrect predictions for each class. It serves as a representation of both model accuracy and the quality of its calibration. It also helps visually choose the best probability threshold for each class.

6. Probability Distribution
A similar view as above, it shows the distribution of predicted probabilities.

7. ROC Curve
ROC Curve (receiver operating characteristic curve) shows the share of true positives and true negatives at different classification thresholds.

8. Precision-Recall Curve
The precision-recall curve shows the trade-off between precision and recall for different classification thresholds.

9. Precision-Recall Table
The table shows possible outcomes for different classification thresholds and prediction coverage. If you have two datasets, the table is generated for both. Each line in the table defines a case when only top-X% predictions are considered, with a 5% step. It shows the absolute number of predictions (Count) and the probability threshold (Prob) that correspond to this combination. The table then shows the quality metrics for a given combination. It includes Precision, Recall, the share of True Positives (TP), and False Positives (FP). This helps explore the quality of the model if you choose to act only on some of the predictions.

10. Classification Quality by Feature
In this table, we show a number of plots for each feature. To expand the plots, click on the feature name. In the tab “ALL”, you can see the distribution of classes against the values of the feature. If you compare the two datasets, it visually shows the changes in the feature distribution and in the relationship between the values of the feature and the target. For each class, you can see the predicted probabilities alongside the values of the feature. It visualizes the regions where the model makes errors of each type and reveals the low-performance segments. You can compare the distributions and see if the errors are sensitive to the values of a given feature.

#### Report customization
- You can perform the analysis of relations between features and target only for selected columns.
- You can pass relevant parameters to change the way some of the metrics are calculated, such as decision threshold or K to evaluate precision@K. 
- You can use a different color schema for the report.
- If you want to exclude some of the metrics, you can create a custom report by combining the chosen metrics. 


In [None]:
classification_performance_report = Report(
    metrics=[
        ClassificationPreset(),
    ]
)

classification_performance_report.run(
    reference_data=iris_ref, current_data=iris_cur, column_mapping=clas_column_mapping
)

In [None]:
classification_performance_report.show(mode="inline")

In [None]:
classification_performance_report.save_html(
    Path("reports", "clas_perf_report.html")
)
classification_performance_report.save_json(
    Path("reports", "clas_perf_report.json")
)

### Regression Performance report

**The Regression Performance report** evaluates the quality of a regression model. It can also compare it to the past performance of the same model, or the performance of an alternative model.
- Works for a single model or helps compare the two
- Displays a variety of plots related to the performance and errors
- Helps explore areas of under- and overestimation

#### When to use the report
These presets help evaluate and test the quality of classification models. You can use them in different scenarios.
1. To monitor the performance of a regression model in production. You can run the test suite as a regular job (e.g., weekly or every time you get the labels) to contrast the model performance against the expectation. You can generate visual reports for documentation and sharing with stakeholders.
2. To trigger or decide on the model retraining. You can use the test suite to check if the model performance is below the threshold to initiate a model update.
3. To debug or improve model performance. If you detect a quality drop, you can use the visual report to explore the model errors. You can use the Error Bias table to identify the groups with high error where the model under- or over-estimates the target function. By manipulating the input data frame, you can explore the performance on different data segments (e.g., users from a specific region). You can also combine it with the Data Drift report.
4. To analyze the results of the model test. You can explore the results of an online or offline test and contrast it to the performance in training. You can use this report to compare the model performance in an A/B test or during a shadow model deployment.

#### How to run the report
To run this report, you need to have input features, and both target and prediction columns available. Input features are optional. Pass them if you want to explore the relations between features and target.
To generate a comparative report, you will need two datasets. The reference dataset serves as a benchmark. Evidently analyzes the change by comparing the current production data to the reference data. You can also run this report for a single dataset, with no comparison performed.

#### How it looks
The report includes multiple components. All plots are interactive.

1. Model Quality Summary Metrics
Evidently calculate a few standard model quality metrics: Mean Error (ME), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE). For each quality metric, Evidently also shows one standard deviation of its value (in brackets) to estimate the stability of the performance. To support the model performance analysis, Evidently also generates interactive visualizations. They help analyze where the model makes mistakes and come up with improvement ideas.

2. Predicted vs Actual
Predicted versus actual values in a scatter plot.

3. Predicted vs Actual in Time
Predicted and Actual values over time or by index, if no datetime is provided.

4. Error (Predicted - Actual)
Model error values over time or by index, if no datetime is provided.

5. Absolute Percentage Error
Absolute percentage error values over time or by index, if no datetime is provided.

6. Error Distribution
Distribution of the model error values.

7. Error Normality
Quantile-quantile plot (Q-Q plot) to estimate value normality. Next, Evidently explore in detail the two segments in the dataset: 5% of predictions with the highest negative and positive errors. We refer to them as "underestimation" and "overestimation" groups. We refer to the rest of the predictions as "majority".

8. Mean Error per Group
A summary of the model quality metrics for each of the two segments: mean Error (ME), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE).

9. Predicted vs Actual per Group
Prediction plots that visualize the regions where the model underestimates and overestimates the target function.

10. Error Bias: Mean/Most Common Feature Value per Group
This table helps quickly see the differences in feature values between the 3 groups:
- OVER (top-5% of predictions with overestimation)
- UNDER (top-5% of the predictions with underestimation)
- MAJORITY (the rest 90%)
For the numerical features, it shows the mean value per group. For the categorical features, it shows the most common value. If you have two datasets, the table displays the values for both REF (reference) and CURR (current). If you observe a large difference between the groups, it means that the model error is sensitive to the values of a given feature. To search for cases like this, you can sort the table using the column "Range(%)". It increases when either or both of the "extreme" groups are different from the majority.

11. Error Bias per Feature
For each feature, Evidently shows a histogram to visualize the distribution of its values in the segments with extreme errors and in the rest of the data. You can visually explore if there is a relationship between the high error and the values of a given feature.

12. Predicted vs Actual per Feature
For each feature, Evidently also show the Predicted vs Actual scatterplot. It helps visually detect and explore underperforming segments which might be sensitive to the values of the given feature.

#### Report customization
- You can perform the Error bias analysis only for selected columns.
- You can use a different color schema for the report.
- If you want to exclude some of the metrics, you can create a custom report by combining the chosen metrics.

#### Exercise 2

Create a Regression Performance report for the regression model, run it, show it and save it as html and json

In [None]:
# create the report and run it

In [None]:
# show the report

In [None]:
# save the report as html and json

# "reports", "reg_perf_report.html"
# "reports", "reg_perf_report.json"

## Handling the drift

Goal:

* get familiar with pre-built Data Quality, Data Drift and Target Drift reports for the classification model
* explore pre-built Data Quality, Data Drift and Target Drift reports for the regression model during the Exercise 3

### Data Quality 
**The Data Quality report** provides detailed feature statistics and a feature behavior overview.
- The report works for a single dataset or compares the two.
- Calculates base statistics for numerical, categorical and datetime features
- Displays interactive plots with data distribution and behavior in time
- Plots interactions and correlations between features and target

#### When to use the report
You might need to track and evaluate data quality and integrity in different scenarios.
1. Data quality tests in production. You can check the quality and stability of the input data before you generate the predictions, every time you perform a certain transformation, add a new data source, etc.
2. Data profiling in production. You can log and store JSON snapshots of your production data stats for future analysis and visualization.
3. Exploratory data analysis. You can use the visual report to explore your training dataset and understand which features are stable and useful enough to use in modeling.
4. Dataset comparison. You can use the report to compare two datasets to confirm similarities or understand the differences. For example, you might compare training and test dataset, subgroups in the same dataset (e.g., customers from Region 1 and Region 2), or current production data against training.
5. Production model debugging. If your model is underperforming, you can use this report to explore and interpret the details of changes in the input data or debug the quality issues.

For production pipeline tests, use Test Suites. For exploratory analysis and debugging, use Report.

#### How to run the report
- **Input features**. You need to pass only the input features. Target and prediction are optional.
- **One or two datasets**. If you want to perform a side-by-side comparison, pass two datasets with identical schema. You can also pass a single dataset.
- **Column mapping**. Feature types (numerical, categorical, datetime) will be parsed based on pandas column type. If you want to specify a different feature mapping strategy, you can explicitly set the feature type using `column_mapping`.

You might also need to specify additional column mapping:
- If you have a **datetime** column and want to learn how features change with time, specify the datetime column in the `column_mapping`.
- If you have a **target** column and want to see features distribution by target, specify the target column in the `column_mapping`.
- Specify the **task** if you want to explore interactions between the features and the target. This section looks slightly different for classification and regression tasks. By default, if the target has a numeric type and has >5 unique values, Evidently will treat it as a regression problem. Everything else is treated as a classification problem. If you want to explicitly define your task as `regression` or `classification`, you should set the `task` parameter in the `column_mapping` object.

#### How it looks
The default report includes 3 widgets. All plots are interactive.
1. Summary widget
The table gives an overview of the dataset, including missing or empty features and other general information. It also shows the share of almost empty and almost constant features. This applies to cases when 95% or more features are missing or constant.
2. Features widget
For each feature, this widget generates a set of visualizations. They vary depending on the feature type. There are 3 components:
- 2.1. Feature overview table
The table shows relevant statistical summaries for each feature based on its type and a visualization of feature distribution.
- 2.2. Feature in time
If you click on "details", each feature would include additional visualization to show feature behavior in time.
- 2.3. Feature by target
Categorical and numerical features include an additional visualization that plots the interaction between a given feature and the target.
3. Correlation widget
This widget shows the correlations between different features.
- 3.1. Insights
This table shows a summary of pairwise feature correlations. For a single dataset, it lists the top-5 highly correlated variables from Cramer's v correlation matrix (categorical features) and from Spearman correlation matrix (numerical features). For two datasets, it lists the top-5 pairs of variables where correlation changes the most between the reference and current datasets. Similarly, it uses categorical features from Cramer's v correlation matrix and numerical features from Spearman correlation matrix.
- 3.2. Correlation heatmaps
This section includes four heatmaps. For categorical features, Evidently calculates the Cramer's v correlation matrix. For numerical features, Evidently calculates the Pearson, Spearman and Kendall matrices. If your dataset includes the target, the target will be also shown in the matrix according to its type.

#### Report customization
- You can use a different color schema for the report.
- You can create a different report from scratch taking this one as an inspiration by combining chosen metrics.
- You can apply the report only to selected columns, for example, the most important features.


In [None]:
clas_data_quality_report = Report(metrics=[DataQualityPreset()])
clas_data_quality_report.run(reference_data=iris_ref_input_data, current_data=iris_cur_input_data, column_mapping=clas_column_mapping)
clas_data_quality_report.save_html(Path("reports", "clas_data_quality_report.html"))
clas_data_quality_report.save_json(Path("reports", "clas_data_quality_report.json"))
clas_data_quality_report.show(mode='inline')

### Data Drift report
**The Data Drift report** helps detect and explore changes in the input data.
- Applies as suitable **drift detection method** for numerical and categorical features.
- Plots **feature values and distributions** for the two datasets.

#### When to use the report
You can evaluate data drift in different scenarios.
1. To monitor the model performance without ground truth. When you do not have true labels or actuals, you can monitor the feature drift to check if the model operates in a familiar environment. If you detect drift, you can trigger labeling and retraining, or decide to pause and switch to a different decision method.
2. When you are debugging the model quality decay. If you observe a drop in the model quality, you can evaluate Data Drift to explore the change in the feature patterns, e.g., to understand the change in the environment or discover the appearance of a new segment.
3. To understand model drift in an offline environment. You can explore the historical data drift to understand past changes in the input data and define the optimal drift detection approach and retraining strategy.
4. To decide on the model retraining. Before feeding fresh data into the model, you might want to verify whether it even makes sense. If there is no data drift, the environment is stable, and retraining might not be necessary.

To run drift checks as part of the pipeline, use the Test Suite. To explore and debug, use the Report.

#### How to run the report
- You will need two datasets. The reference dataset serves as a benchmark. Evidently analyzes the change by comparing the current production data to the reference data to detect distribution drift.
- Input features. The dataset should include the features you want to evaluate for drift. The schema of both datasets should be identical. If your dataset contains target or prediction column, they will also be analyzed for drift.
- Column mapping. Evidently can evaluate drift both for numerical and categorical features. You can explicitly specify the type of the column in column mapping. If it is not specified, Evidently will define the column type automatically.

#### How it looks
The default report includes 4 components. All plots are interactive.
1. Data Drift Summary
The report returns the share of drifting features and an aggregate Dataset Drift result. Dataset Drift sets a rule on top of the results of the statistical tests for individual features. By default, Dataset Drift is detected if at least 50% of features drift. Evidently uses the default data drift detection algorithm to select the drift detection method based on feature type and the number of observations in the reference dataset.
2. Data Drift Table
The table shows the drifting features first. You can also choose to sort the rows by the feature name or type.
3. Data Distribution by Feature
By clicking on each feature, you can explore the distributions.
4. Data Drift by Feature
For numerical features, you can also explore the values mapped in a plot.
- The dark green line is the mean, as seen in the reference dataset.
- The green area covers one standard deviation from the mean.

#### Report customization
- You can specify the drift detection methods and thresholds.
- You can add a custom drift detection method.
- You can use a different color schema for the report.
- You can create a different report from scratch taking this one as an inspiration.
- You can apply the report only to selected columns, for example, the most important features.

In [None]:
clas_data_drift_report = Report(metrics=[DataDriftPreset()])
clas_data_drift_report.run (reference_data=iris_ref_input_data, current_data=iris_cur_input_data, column_mapping=clas_column_mapping)
clas_data_drift_report.save_html(Path("reports", "clas_data_drift_report.html"))
clas_data_drift_report.save_json(Path("reports", "clas_data_drift_report.json"))
clas_data_drift_report.show(mode="inline")

### Target Drift report
**The Target Drift report** helps detect and explore changes in the target function and/or model predictions:
- Performs a suitable statistical test to compare target (prediction) distribution.
- For numerical targets, calculates the correlations between the feature and the target (prediction)
- Plots the relations between each individual feature and the target (prediction)
You can generate this preset both for numerical targets (e.g. if you have a regression problem) or categorical targets (e.g. if you have a classification problem). You can explicitly specify the type of the target column in column mapping. If it is not specified, Evidently will define the column type automatically.

#### When to use the report
You can analyze target or prediction drift:
1. To monitor the model performance without ground truth. When you do not have true labels or actuals, you can monitor Prediction Drift to react to meaningful changes. For example, to detect when there is a distribution shift in predicted values, probabilities, or classes. You can often combine it with the Data Drift analysis.
2. When you are debugging the model decay. If you observe a drop in performance, you can evaluate Target Drift to see how the behavior of the target changed and explore the shift in the relationship between the features and prediction.
3. Before model retraining. Before feeding fresh data into the model, you might want to verify whether it even makes sense. If there is no target drift, the concept is stable, and retraining might not be necessary.

To run drift checks as part of the pipeline, use the Test Suite. To explore and debug, use the Report.

#### How to run the report
To run this preset, you need to have target and/or prediction columns available. Input features are optional. Pass them if you want to analyze the correlations between the features and target (prediction). Evidently estimates the drift for the target and predictions in the same manner. If you pass both columns, Evidently will generate two sets of plots. If you pass only one of them (either target or predictions), Evidently will build one set of plots. You will need two datasets. The reference dataset serves as a benchmark. Evidently analyzes the change by comparing the current production data to the reference data.

#### How it looks
The report includes 4 components. All plots are interactive.
1. Target (Prediction) Drift
The report first shows the comparison of target (prediction) distributions in the current and reference datasets. You can see the result of the statistical test or the value of a distance metric. Evidently uses the default data drift detection algorithm to select the drift detection method based on target type and the number of observations in the reference dataset.
2. Target (Prediction) Correlations
For numerical targets, the report calculates the Pearson correlation between the target (prediction) and each individual feature in the two datasets to detect a change in the relationship. The report shows the correlations between individual features and the target (prediction) in the current and reference dataset. It helps detects shifts in the relationship.
3. Target (Prediction) Values
For numerical targets, the report visualizes the target (prediction) values by index or time (if the datetime column is available or defined in the `column_mapping` dictionary). This plot helps explore the target behavior and compare it between the datasets.
4. Target (Prediction) Behavior By Feature
Finally, it generates an interactive table with the visualizations of dependencies between the target and each feature. If you click on any feature in the table, you get an overview of its behavior. The plot shows how feature values relate to the target (prediction) values and if there are differences between the datasets. It helps explore if they can explain the target (prediction) shift. We recommend paying attention to the behavior of the most important features since significant changes might confuse the model and cause higher errors.

#### Report customization
- You can specify the drift detection methods and thresholds.
- You can add a custom drift detection method.
- You can use a different color schema for the report.
- You can create a different report or test suite from scratch taking this one as an inspiration.

In [None]:
clas_target_drift_report = Report(metrics=[TargetDriftPreset()])
clas_target_drift_report.run(reference_data=iris_ref, current_data=iris_cur, column_mapping=clas_column_mapping)
clas_target_drift_report.save_html(Path("reports", "clas_target_drift_report.html"))
clas_target_drift_report.save_json(Path("reports", "clas_target_drift_report.json"))
clas_target_drift_report.show(mode="inline")

#### Exercise 3

Create a Data Quality report, Data Drift report, Target Drift report for the regression model, run them, show them and save them as html and json

In [None]:
# create all 3 reports and run them

In [None]:
# show the reports

In [None]:
# save the reports as html and json

# "reports", "reg_drift_report.html"
# "reports", "reg_drift_report.json"

## Test-based monitoring

Goal:
* lorem ipsum
* lorem ipsum