# Introduction to Model Drift Workshop

TO ADD CREDITS

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import ClassificationPreset, DataDriftPreset, DataQualityPreset, TargetDriftPreset
from evidently.test_suite import TestSuite
from evidently.test_preset import NoTargetPerformanceTestPreset, DataQualityTestPreset, DataStabilityTestPreset, DataDriftTestPreset, MulticlassClassificationTestPreset

from pathlib import Path

from sklearn import datasets, ensemble, model_selection

## Workshop setup: datasets and models

### Classification model and Iris dataset

In [None]:
iris_data = datasets.load_iris(as_frame="auto")
iris = iris_data.frame

iris_ref, iris_cur = model_selection.train_test_split(iris, test_size=0.3)

clas_model = ensemble.RandomForestClassifier(random_state=42, n_estimators=3)
clas_model.fit(iris_ref[iris_data.feature_names], iris_ref.target)

iris_ref["prediction"] = clas_model.predict(iris_ref[iris_data.feature_names])
iris_cur["prediction"] = clas_model.predict(iris_cur[iris_data.feature_names])

### Regression model and California housing dataset

In [None]:
housing_data = datasets.fetch_california_housing(as_frame="auto")
housing = housing_data.frame

housing.rename(columns={"MedHouseVal": "target"}, inplace=True)
numerical_features_reg = [
    "MedInc",
    "HouseAge",
    "AveRooms",
    "AveBedrms",
    "Population",
    "AveOccup",
    "Latitude",
    "Longitude",
]
categorical_features_reg = []
features_reg = numerical_features_reg

housing_ref = housing.sample(n=5000, replace=False)
housing_cur = housing.sample(n=5000, replace=False)

reg_model = ensemble.RandomForestRegressor(random_state=42)
reg_model.fit(housing_ref[features_reg], housing_ref.target)

housing_ref["prediction"] = reg_model.predict(housing_ref[features_reg])
housing_cur["prediction"] = reg_model.predict(housing_cur[features_reg])

## Model performance

Goal:
* understand a Column Mapping concept
* try out an Exercise 1 for a better Column Mapping concept understanding
* explore a pre-built report for the Classification model performance
* explore a pre-built report for the Regression model performance during the Exercise 2 

### Column mapping

Evidently expects a certain dataset structure and input column names. You can specify any differences by creating a ColumnMapping object. It works the same way for test suites and reports. If the `column_mapping` is not specified or set as `None`, Evidently will use the default mapping strategy.

In [None]:
clas_column_mapping = ColumnMapping()

clas_column_mapping.numerical_features = [
    "sepal length (cm)",
    "sepal width (cm)",
    "petal length (cm)",
    "petal width (cm)",
]

clas_column_mapping.target = "target"
clas_column_mapping.target_names = ["Setosa", "Versicolour", "Virginica"]
clas_column_mapping.prediction = "prediction"

#### Exercise 1

Map columns for the regression model


### Explore a pre-built report for the Classification model performance

**Classification Performance report** evaluates the quality of a classification model. It works both for binary and multi-class classification. 

NOTE: There is a separate report for a probabilistic classification.

This report can be generated for a single model, or as a comparison. You can contrast your current production model performance against the past or an alternative model.

#### When to use the report

1. To analyze the results of the model test. You can explore the results of an online or offline test and contrast it to the performance in training. Though this is not the primary use case, you can use this report to compare the model performance in an A/B test, or during a shadow model deployment.

2. To generate regular reports on the performance of a production model. You can run this report as a regular job (e.g. weekly or at every batch model run) to analyze its performance and share it with other stakeholders.

3. To analyze the model performance on the slices of data. By manipulating the input data frame, you can explore how the model performs on different data segments (e.g. users from a specific region).

4. To trigger or decide on the model retraining. You can use this report to check if your performance is below the threshold to initiate a model update and evaluate if retraining is likely to improve performance.

5. To debug or improve model performance. You can use the Classification Quality table to identify underperforming segments and decide on ways to address them.

To run this report, you need to have **input features**, and **both target and prediction** columns available. You can use both **numerical labels** like "0", "1", "2" or **class names** like "virginica", "setoza", "versicolor" inside the target and prediction columns. The labels should be the same for the target and predictions. NOTE: Column order in Binary Classification. For binary classification, class order matters. The tool expects that the target (so-called positive) class is the first in the column_mapping['prediction'] list.

To generate a comparative report, you will need the **two** datasets. **The reference dataset** serves as a benchmark. We analyze the change by comparing **the current production data** to **the reference data**.

In [None]:
classification_performance_report = Report(
    metrics=[
        ClassificationPreset(),
    ]
)

classification_performance_report.run(
    reference_data=iris_ref, current_data=iris_cur, column_mapping=clas_column_mapping
)

In [None]:
classification_performance_report.show(mode="inline")

In [None]:
classification_performance_report.save_html(
    Path("reports", "clas_perf_report.html")
)
classification_performance_report.save_json(
    Path("reports", "class_perf_report.json")
)

#### How it looks

The report includes 5 components. All plots are interactive.

1. Model Quality Summary Metrics
We calculate a few standard model quality metrics: Accuracy, Precision, Recall, and F1-score. To support the model performance analysis, we also generate interactive visualizations. They help analyze where the model makes mistakes and come up with improvement ideas.

2. Class Representation
Shows the number of objects of each class.

3. Confusion Matrix
Visualizes the classification errors and their type.

4. Quality Metrics by Class
Shows the model quality metrics for the individual classes.

5. Classification Quality by Feature.

In this table, we show a number of plots for each feature. To expand the plots, click on the feature name. In the tab “ALL”, we plot the distribution of classes against the values of the feature. This is the “Target Behavior by Feature” plot from the Categorial Target Drift report. If you compare the two datasets, it visually shows the changes in the feature distribution and in the relationship between the values of the feature and the target. Then, for each class, we plot the distribution of the True Positive, True Negative, False Positive, and False Negative predictions alongside the values of the feature. It visualizes the regions where the model makes errors of each type and reveals the low-performance segments. This helps explore if a specific type of misclassification error is sensitive to the values of a given feature.

### Explore a pre-built report for the Regression model performance

**The Regression Performance report** evaluates the quality of a regression model. It can also compare it to the past performance of the same model, or the performance of an alternative model.

#### When to use the report

1. To analyze the results of the model test. You can explore the results of an online or offline test and contrast it to the performance in training. Though this is not the primary use case, you can use this report to compare the model performance in an A/B test, or during a shadow model deployment.

2. To generate regular reports on the performance of a production model. You can run this report as a regular job (e.g. weekly or at every batch model run) to analyze its performance and share it with other stakeholders.

3. To analyze the model performance on the slices of data. By manipulating the input data frame, you can explore how the model performs on different data segments (e.g. users from a specific region).

4. To trigger or decide on the model retraining. You can use this report to check if your performance is below the threshold to initiate a model update and evaluate if retraining is likely to improve performance.

5. To debug or improve model performance by identifying areas of high error. You can use the Error Bias table to identify the groups that contribute way more to the total error, or where the model under- or over-estimates the target function.

To run this report, you need to have **input features**, and **both target and prediction columns** available. To generate a comparative report, you will need **two** datasets. **The reference dataset** serves as a benchmark. We analyze the change by comparing **the current production data** to **the reference data**.

#### Exercise 2

Create a Regression model performance report for the regression model, run it, show it and save it as html and json

In [None]:
# create the report and run it

In [None]:
# show the report

In [None]:
# save the report as html and json

# "reports", "reg_perf_report.html"
# "reports", "reg_perf_report.json"

#### How it looks

The report includes 12 components. All plots are interactive.

1. Model Quality Summary Metrics
We calculate a few standard model quality metrics: Mean Error (ME), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE). For each quality metric, we also show one standard deviation of its value (in brackets) to estimate the stability of the performance. Next, we generate a set of plots. They help analyze where the model makes mistakes and come up with improvement ideas.

2. Predicted vs Actual
Predicted versus actual values in a scatter plot.

3. Predicted vs Actual in Time
Predicted and Actual values over time or by index, if no datetime is provided.

4. Error (Predicted - Actual)
Model error values over time or by index, if no datetime is provided.

5. Absolute Percentage Error
Absolute percentage error values over time or by index, if no datetime is provided.

6. Error Distribution
Distribution of the model error values.

7. Error Normality
Quantile-quantile plot (Q-Q plot) to estimate value normality. Next, we explore in detail the two segments in the dataset: 5% of predictions with the highest negative and positive errors. We refer to them as "underestimation" and "overestimation" groups. We refer to the rest of the predictions as "majority".

8. Mean Error per Group
We show a summary of the model quality metrics for each of the two groups: mean Error (ME), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE).

9. Predicted vs Actual per Group
We plot the predictions, coloring them by the group they belong to. It visualizes the regions where the model underestimates and overestimates the target function.

10. Error Bias: Mean/Most Common Feature Value per Group
This table helps quickly see the differences in feature values between the 3 groups:

OVER (top-5% of predictions with overestimation)
UNDER (top-5% of the predictions with underestimation)
MAJORITY (the rest 90%)
For the numerical features, it shows the mean value per group. For the categorical features, it shows the most common value. If you have two datasets, the table displays the values for both REF (reference) and CURR (current). If you observe a large difference between the groups, it means that the model error is sensitive to the values of a given feature. To search for cases like this, you can sort the table using the column "Range(%)". It increases when either or both of the "extreme" groups are different from the majority.

11. Error Bias per Feature
For each feature, we show a histogram to visualize the distribution of its values in the segments with extreme errors and in the rest of the data. You can visually explore if there is a relationship between the high error and the values of a given feature.

12. Predicted vs Actual per Feature
For each feature, we also show the Predicted vs Actual scatterplot. We use colors to show the distribution of the values of a given feature. It helps visually detect and explore underperforming segments which might be sensitive to the values of the given feature.

## Handling the drift

Goal:

* get familiarity with pre-built Data Quality, Data Drift and Target Drift reports for the classification model
* explore pre-built Data Quality, Data Drift abd Target Drift reports for the regression model during the Exercise 3

### Data Quality report

### Data Drift report

### Target Drift report

#### Exercise 3

Create a Data Quality report, Data Drift report, Target Drift report for the regression model, run them, show them and save them as html and json

In [None]:
# create all 3 reports and run them

In [None]:
# show the reports

In [None]:
# save the reports as html and json

# "reports", "reg_drift_report.html"
# "reports", "reg_drift_report.json"

## Test-based monitoring

Goal:
* lorem ipsum
* lorem ipsum