# Evidently

Evidently is an open-source Python library for evaluating, testing, and monitoring data and ML models. To understand how to use Evidently, please first go through the [core concepts](https://docs.evidentlyai.com/readme/core-concepts) of Evidently. 

This tutorial provides examples of Evidently Report, TestSuite, and Monitoring UI, using the familiar dataset of bike sharing demand. 

In [1]:
# Uninstall kserve as it causes version conflicts with Evidently and we don't need it anymore
%pip uninstall kserve -y 

# The latest release of Evidently (at the time when coding the assignment set) has some issues with the Monitoring UI. The fixes have been pushed to its Github repository, 
# so we directly install the package from Github.
# The repository used in this assignment is forked from the Evidently repository for immutability as the original repository is updated continuously.
%pip install git+https://github.com/yumoL/evidently.git 

Found existing installation: kserve 0.10.1
Uninstalling kserve-0.10.1:
  Successfully uninstalled kserve-0.10.1
Note: you may need to restart the kernel to use updated packages.
Collecting git+https://github.com/yumoL/evidently.git
  Cloning https://github.com/yumoL/evidently.git to /tmp/pip-req-build-hlvan4ew
  Running command git clone --filter=blob:none --quiet https://github.com/yumoL/evidently.git /tmp/pip-req-build-hlvan4ew
  Resolved https://github.com/yumoL/evidently.git to commit 2fb9dbe2cdfc619498d1804b7e52e9c95dd8d0e2
  Preparing metadata (setup.py) ... [?25ldone
Collecting nltk>=3.6.7 (from evidently==0.4.9)
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0m[36m0:00:01[0m:01[0m
Collecting fastapi>=0.100.0 (from evidently==0.4.9)
  Downloading fastapi-0.104.1-py3-none-any.whl (92 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━

In [2]:
import pandas as pd
from lightgbm import LGBMRegressor
import webbrowser
from pathlib import Path
from datetime import datetime, timedelta
from typing import Tuple
from sklearn.metrics import r2_score

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, RegressionPreset
from evidently.test_suite import TestSuite
from evidently.tests import TestValueR2Score
from evidently.ui.dashboards import ReportFilter, DashboardPanelTestSuiteCounter, CounterAgg, DashboardPanelPlot, PanelValue, PlotType, DashboardConfig
from evidently.renderers.html_widgets import WidgetSize
from evidently import metrics
from evidently.ui.workspace import Workspace, Project

import warnings
warnings.filterwarnings('ignore')

WORKING_DIR = Path.cwd()

Let's first download the dataset and do some pre-processing.

In [3]:
# Download data
dataset_url = "https://raw.githubusercontent.com/yumoL/mlops_eng_course_datasets/master/intro/bike-demanding/train_full.csv"
input_df = pd.read_csv(dataset_url)

# Preprocess
input_df["datetime"] = pd.to_datetime(input_df["datetime"])

# create hour, day and month variables from datetime column
input_df["hour"] = input_df["datetime"].dt.hour
input_df["day"] = input_df["datetime"].dt.day
input_df["month"] = input_df["datetime"].dt.month

# Set datetime as index
input_df.set_index("datetime", inplace=True)

# drop casual and registered columns, we only use the "count" column as the target
input_df.drop(["casual", "registered"], axis=1, inplace=True)


In [4]:
target = "count"
categorical_features=["season", "holiday", "workingday"]

def split_x_y(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Split features and targets from a given DataFrame
    """
    return df.drop([target], axis=1), df[[target]]

We use data for January of 2011 as the training data to train a LightGBM regression model. (The dataset originates from Kaggle, where only data from the initial 19 days of each month is accessible. The remaining days are reserved for Kaggle's evaluation purposes.)

In [5]:
train = input_df.loc['2011-01-01 00:00:00':'2011-01-19 23:00:00']
train_x, train_y = split_x_y(train)

model = LGBMRegressor(random_state=42)
model.fit(train.drop([target], axis=1), train[[target]], categorical_feature=categorical_features)

We then use the data for February of 2011 as the testing data.

In [6]:
test = input_df.loc['2011-02-01 00:00:00':'2011-02-19 23:00:00']
test_x, test_y = split_x_y(test)
predictions = model.predict(test_x)
r2 = r2_score(y_true=test_y, y_pred=predictions)
print(f"The r2 score for the testing data is {r2}")

The r2 score for the testing data is 0.7758083893824562


Once the model is trained, we can start to prepare a reference dataset. This dataset serves as a baseline for detecting target and data drift, as well as the training-serving skew. To calculate data drift, the reference dataset should have the features. To calculate model performance metrics (e.g., MAE and R2 score), the reference should also have the targets (i.e., ground truth) and the predicted values.

In this example, we use the testing data as the reference data. It's worth noting that the reference data doesn't have to be the testing data. Testing data can be used as the reference data but they are not the same thing.

In [7]:
reference = test.copy()
# Add predicted values as a new column
reference["prediction"] = model.predict(test_x)

Now, suppose our model needs to make prediction on the data for March of 2011. Similar to the reference dataset, we also need to construct a production dataset that have model inputs (features), outputs (predictions), and targets (ground truth). In reality, this involves collecting inputs and ground truth from different sources. In our example, the inputs and ground truth are already available in the downloaded data. 

In [8]:
current = input_df.loc['2011-03-01 00:00:00':'2011-03-19 23:00:00']

# curr_x is the inputs received by the model 
curr_x, _ = split_x_y(current)

production = current.copy()
production["prediction"] = model.predict(curr_x)

Evidently needs information of dataset columns. The information includes which columns serve as inputs, which column contains the predicted values, and which represents the ground truth. Additionally, Evidently also needs to know which input columns are numerical and which categorical so that it can select the correct methods for drift detection. 
This information can be passed as an argument to Evidently using a `ColumnMapping` object. 

More details of ColumnMapping can be found [here](https://docs.evidentlyai.com/user-guide/input-data/column-mapping). 

In [9]:
categorical_features=["season", "holiday", "workingday"]
numerical_features = list(filter(lambda feature: feature not in categorical_features, train_x.columns))
column_mapping_conf = {
    "numerical_features": numerical_features,
    "categorical_features": categorical_features,
    "prediction": "prediction",
    "target": target
}
column_mapping = ColumnMapping(**column_mapping_conf)

## Report
Now, let's build an Evidently Report using the `RegressionPreset` and `DataDriftPreset` Metric Presets.  

After running the next code cell, you'll see an Evidently Report saved to an HTML file ("bike_report.html"). 

In [10]:
report = Report(metrics=[RegressionPreset(), DataDriftPreset()])

# reference_data is, literally, the reference that serves as a baseline. current_data is the data we want to monitor
report.run(reference_data=reference, current_data=production, column_mapping=column_mapping)
report.save_html("bike_report.html")
webbrowser.open("file:///" + str(WORKING_DIR/"bike_report.html"))

True

The Report is mostly self-explanatory. For more information, please refer to the Evidently docs of [RegressionPreset](https://docs.evidentlyai.com/presets/reg-performance) and [DataDriftPreset](https://docs.evidentlyai.com/presets/data-drift).

### Some clarifications:

In the "Regression Model Performance" part, you'll see a summary of model quality metrics as below. The values in parentheses represent the standard deviation of the error values. For instance, `32.89 (38.76)` indicates that the mean absolute error is 32.89, and the standard deviation of the absolute error values is 38.76.

<img src="./images/performance-metric-summary.png" width=700/>

In the "Predicted vs Actual" part, you'll see two contour plots showing predicted versus actual values for the production and reference datasets. You may notice that these plots are scatter plots in the documentation. In a scatter plot, you can see all the individual data points. A contour plot aggregates the data points and presents them as contours. The idea is that if we have a large dataset and try to present it as a scatter plot, this would result in points overlapping and hiding patterns. With a contour plot, we can get the general shape of the distribution, and the plot is more lightweight. We can also visually see more "dense" areas where most of the original data points are (with darker color), and more "sparse" areas where there are only a few observations (with light line contour). 

<img src="./images/contour1.png" width=1000/>

If you want to see the scatter plots, you can add an option to the report as follows.

```python
my_report = Report(metrics=[RegressionPreset(),],
    options={"render": {"raw_data": True}}
)
```
This will result in scatter plots (showing all individual data points) as below. 

<img src="./images/scatterplot.png" width=1000/>


In the "Predicted vs Actual in Time" part as shown below,

<img src="./images/predicted-actual-in-time.png" width=1000 />

the index of a DataFrame gets binned based on the date. The values in the bin are averaged. For each bin, it shows the mean (the lines), and +/-standard deviation (the lightly colored zones). The y-axis is the mean value.

If there is no index specified in the DataFrames, Evidently will use the default index starting at 0 as shown below (The number on the x-axis is the number of the bin):

<img src="./images/predicted-actual-in-time-no-index.png" width=1000 />


In the part of data drift, the `drift-score` is the p-value of the statistical test. 

<img src="./images/data-drift.png" width=1000 />

By default, evidently selects a statistical test based on the column type and data size. More about the data drift algorithms used by Evidently can be found [here](https://docs.evidentlyai.com/reference/data-drift-algorithm). In this example, K-S test, Z-test, and Chi-Square test are used, the default p-value threshold for these tests is 0.05. Evidently allows for customizing which statistical test should be applied to which column and the threshold. E.g., 

```python
# Use Wasserstein distance to calculate drift for the "windspeed" column and set the threshold to 0.02
from evidently.calculations.stattests import wasserstein_stat_test
report = Report(metrics=[RegressionPreset(), DataDriftPreset(per_column_stattest={"windspeed": wasserstein_stat_test}, per_column_stattest_threshold={"windspeed": 0.02})])
```

## TestSuite
An Evidently TestSuite works similarly as the Report. In this example, we specify an individual test instead of using a pre-built Test Preset. Run the next code cell should also save the TestSuite to an HTML file.

In [11]:
# The TestSuite has only one test. The test will fail if the r2 score of the predictions on the production data is not greater than 0.9
test_suite = TestSuite(tests=[TestValueR2Score(gt=0.9)])
test_suite.run(reference_data=reference, current_data=production, column_mapping=column_mapping)
test_suite.save_html("bike_test.html")
webbrowser.open("file:///" + str(WORKING_DIR/"bike_test.html"))

True

## Monitoring UI
Evidently provides a user interface (Monitoring UI) that helps you organize your Reports and TestSuites and visualize the metrics and test results in a dashboard. Let's first look at three additional concepts used by Evidently:
- Snapshot: As per Evidently docs, "a snapshot is a JSON version of the Evidently Report or Test Suite. After you generate the Report or Test Suite and save it as a snapshot, you can load it back and restore it as in the HTML or other formats". Multiple snapshots need to be captured (e.g., periodically) so that we can see how some metrics/test results change over a period of time. 
- Workspace: A Workspace is a directory that stores snapshots. 
- Project: A Project is a sub-directory of a Workspace. It allows for organizing snapshots, for example, for individual models.
- Monitoring Dashboard: Each Project can have a Dashboard that visualizes how some metrics/test results change over a period of time.

Let's first create an Evidently Workspace and initialize a Project to that Workspace. 

In [None]:
def init_evidently_project(workspace: Workspace, project_name: str) -> Project:
    """
    Create a Project to a Workspace
    Args:
        workspace: An Evidently Workspace
        project_name: Name of the Project
    """
    # Delete any projects whose name is the given project_name to avoid duplicated projects
    for project in workspace.search_project(project_name=project_name):
        workspace.delete_project(project_id=project.id)

    # Create a project at Evidently
    project = workspace.create_project(name=project_name)

    # Create a dashboard
    project.dashboard = DashboardConfig(name=project_name, panels=[])

    project.dashboard.add_panel(
        DashboardPanelTestSuiteCounter(
            title="R2 Score",
            agg=CounterAgg.LAST
        ),
    )
    project.dashboard.add_panel(
         DashboardPanelPlot(
                title="R2 Score",
                filter=ReportFilter(metadata_values={}, tag_values=[]),
                values=[
                    PanelValue(
                        metric_id="RegressionQualityMetric",
                        field_path=metrics.RegressionQualityMetric.fields.current.r2_score,
                        legend="R2",
                    ),
                ],
                plot_type=PlotType.LINE,
                size=WidgetSize.FULL,
            )
    )

    project.save()
    return project

In [None]:
# Create a Workspace
workspace = Workspace.create(WORKING_DIR/"my_workspace")

# Init a Project
project = init_evidently_project(workspace=workspace, project_name="bike_project")


In the next code cell, we'll generate Reports and TestSuites for March, April, May and June of 2011 and upload the Reports and TestSuites as snapshots to the Evidently Project we just created. We'll also add a tag and timestamp to the Reports and TestSuites:
```python
report = Report(metrics=[...], tags=..., timestamp=...)
```
Tags can be used to provide some additional information for the Reports/TestSuites, such as the model version being monitored. If no timestamp is specified, Evidently will use the computation time of the Report/TestSuite as the timestamp. 

In [None]:
time_periods = [('2011-03-01 00:00:00', '2011-03-19 23:00:00'), ('2011-04-01 00:00:00', '2011-04-19 23:00:00'),  
                ('2011-05-01 00:00:00', '2011-05-19 23:00:00'), ('2011-06-01 00:00:00', '2011-06-19 23:00:00')]

for i in range(len(time_periods)):
    period = time_periods[i]
    production = input_df[period[0]:period[1]]
    prod_x, _ = split_x_y(production)
    production["prediction"] = model.predict(prod_x)

    # Suppose the Report/TestSuite is generated one hour later when a period ends
    timestamp = datetime.strptime(
        period[1], "%Y-%m-%d %H:%M:%S") + timedelta(hours=1)

    # Suppose we retrain a new model version for every period
    tags = [f"bike-model-v{i+1}"]

    report = Report(metrics=[RegressionPreset(),
                    DataDriftPreset()], tags=tags, timestamp=timestamp)
    report.run(reference_data=reference, current_data=production,
               column_mapping=column_mapping)
    # Upload Report snapshot
    workspace.add_report(project_id=project.id, report=report)

    test_suite = TestSuite(tests=[TestValueR2Score(
        gt=0.6)], tags=tags, timestamp=timestamp)
    test_suite.run(reference_data=reference,
                   current_data=production, column_mapping=column_mapping)
    # Upload TestSuite snapshot
    workspace.add_test_suite(project_id=project.id, test_suite=test_suite)

Let's run the Evidently Monitoring UI service using the following command:
```bash
# switch to mlops_eng in your terminal
# under the same directory as this notebook
evidently ui --workspace ./my_workspace
```

Go to [http://localhost:8000](http://localhost:8000), you'll see there is one Evidently Project ("bike_project") listed in the Workspace:

<img src="./images/bike_project.png" width=1000/>

Click the project, you'll see a Monitoring Dashboard with two panels. The upper panel shows that the latest test of R2 score failed. The lower panel shows the changes of R2 score over the past four months. The data shown in the panels is captured from the four TestSuite snapshots we just generated. 

<img src="./images/dashboard.png" width=1000/>

If you go to the "REPORTS" and "TEST SUITES" fields, you'll four Reports and four TestSuites, respectively, one for each month and each model version. 

<img src="./images/reports.png" width=1000/>
<img src="./images/test-suites.png" width=1000/>

If you open one of the Reports/Test Suites, you'll see a similar page as the previous "bike_report.html"/"bike_test.html" file.  