# Lesson 3.1: Train-Test Skew Detection with Evidently

***Key Concepts:*** *Data-Centric ML, Data Skew, Train-Test Skew, Training-Serving Skew, Evidently*

In academia and research, the focus of ML is usually to build the best possible models for a given dataset. However, in practical application, the overall performance of our application is usually determined by the quality of the data, and the model is only secondary. That is why many ML practitioners advocate for **Data-Centric** ML approaches, where we focus on improving the data, while keeping the ML model (mostly) fixed. See also [this great article](https://neptune.ai/blog/data-centric-vs-model-centric-machine-learning) by neptune.ai for more details on model- vs. data-centric approaches.

One of the most important parts of data-centric ML is to monitor data quality. Throughout this chapter, we will learn about many potential data issues, such as train-test skew, training-serving skew, data drift, and more. Being aware of these issues, and having respective safety mechanisms in place, is essential when serving ML models to real users.

This lesson we will start by automatically checking for **Data Skew** within our ML pipelines. Since the performance of ML models on unseen data can be unpredictable, we should always try to design our training data to match the real environment where our model will later be deployed. The difference between those data distributions is called **Training-Serving Skew**. Similarly, differences in distribution between our training and testing datasets are called **Train-Test Skew**.

In the following, we will use the open-source data monitoring tool [Evidently](https://evidentlyai.com/) to measure distribution differences between our datasets. See this little [blog post](https://blog.zenml.io/zenml-loves-evidently/) of ours that explains the evidently integration in a bit more detail.

If you haven't done so, install Evidently by running the following cell, then restart your notebook kernel:

In [None]:
!zenml integration install evidently -f

## Detect Train-Test Skew

To start out, we will use Evidently to check for skew between our training and validation datasets. To do so, we will define a new pipeline with an Evidently step, into which we will then pass our training and validation datasets as . 

At its core, Evidently’s distribution difference calculation functions take in a reference dataset and compare it with a separate comparison dataset. These are both passed in as [pandas DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), though CSV inputs are also possible. ZenML implements this functionality in the form of several standardized steps along with an easy way to use the visualization tools also provided along with Evidently as ‘Dashboards’.

Since our datasets were initially in [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) format before, we just need to add another simple step that converts from numpy to pandas. The overall pipeline will then look like this:

![Pipeline2](_assets/3-1/second_pipeline.png)

Let's define this pipeline in code and import the other steps (which we have already built during previous lessons):

In [None]:
from steps.importer import importer
from steps.evaluator import evaluator
from steps.mlflow_trainer import svc_trainer_mlflow

In [None]:
from zenml.pipelines import pipeline


@pipeline(enable_cache=False)
def digits_pipeline_with_train_test_checks(
    importer,
    trainer,
    evaluator,
    get_reference_data,
    skew_detector,
):
    """Digits pipeline with train-test check."""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)
    reference, comparison = get_reference_data(X_train, X_test)
    skew_detector(reference, comparison)

Next, let's define the two new steps. For data distribution comparison, we can simply use the predefined step of ZenMLs Evidently integration:

In [None]:
from zenml.integrations.evidently.steps import (
    EvidentlyProfileConfig,
    EvidentlyProfileStep,
)

# configure the Evidently step
evidently_profile_config = EvidentlyProfileConfig(
    column_mapping=None, profile_sections=["datadrift"]
)

The step for converting numpy to pandas is also fairly easy to implement:

In [None]:
import numpy as np
import pandas as pd
from zenml.steps import step, Output


@step
def get_reference_data(
    X_train: np.ndarray,
    X_test: np.ndarray,
) -> Output(reference=pd.DataFrame, comparison=pd.DataFrame):
    """Convert numpy data to pandas for distribution difference calculation."""
    columns = [str(x) for x in list(range(X_train.shape[1]))]
    X_train = pd.DataFrame(X_test, columns=columns)
    X_test = pd.DataFrame(X_train, columns=columns)
    return X_train, X_test

And that's it. Let's initialize and run our pipeline to try it out:

In [None]:
evidently_pipeline = digits_pipeline_with_train_test_checks(
    importer=importer(),
    trainer=svc_trainer_mlflow(),
    evaluator=evaluator(),
    get_reference_data=get_reference_data(),
    skew_detector=EvidentlyProfileStep(config=evidently_profile_config),
)
evidently_pipeline.run()

Now we can use ZenMLs `EvidentlyVisualizer` to see the distribution comparison right in our notebook, where we can compare the distributions for each feature visually.

In [None]:
from zenml.integrations.evidently.visualizers import EvidentlyVisualizer
from zenml.repository import Repository

repo = Repository()
p = repo.get_pipeline("digits_pipeline_with_train_test_checks")
last_run = p.runs[-1]

skew_detection_step = last_run.get_step(name="skew_detector")
evidently_outputs = skew_detection_step

EvidentlyVisualizer().visualize(evidently_outputs)

As we see, there is no skew between our training and validation sets. That's great!

In the next lessons, we will add mechanism for training-serving skew and data drift detection into our inference pipelines and will set up automated alerts whenever any data issues were detected. Those lessons are still work in progress, so stay tuned!