# Deepchecks for Data Drift Detection

In this notebook, we assess our data using the Deepchecks library, specifically utilizing the Tabular Sub-Package for data drift.

Drift refers to the evolving pattern of data distribution over time, and it stands out as a primary factor contributing to the deterioration of a machine learning model's performance as time progresses.

We first import the needed libraries. If errors related to the deepchecks libray occur, we suggest to install the deepchecks library with pip in your virtual environment. Here is the command you need to run: pip install deepchecks

In [1]:
%pip install deepchecks

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd
import os
import sys
import pandas as pd
from deepchecks.tabular.checks import FeatureDrift
from deepchecks.tabular.checks import MultivariateDrift
from deepchecks.tabular import Dataset

# Datasets

In this section, we define the file paths for both the train and test sets (default playlists) and proceed to load them for the purpose of detecting data drifts.

In [3]:
current_dir = os.getcwd()
two_levels_up = os.path.dirname(os.path.dirname(current_dir))
sys.path.insert(0, two_levels_up)
import conf

In [4]:
three_levels_up = os.path.dirname(os.path.dirname(os.path.dirname(current_dir)))
sys.path.insert(0, three_levels_up)

In [5]:
processed_train_data_directory = three_levels_up + '/'+ conf.PRO_DATA_DIR + 'trainSet.csv'
processed_test_data_directory = three_levels_up + '/'+ conf.PRO_DATA_DIR + 'testSet.csv'

In [6]:
df_processed_train = pd.read_csv(processed_train_data_directory)
df_processed_test = pd.read_csv(processed_test_data_directory)

In [7]:
ds_processed_train = Dataset(df_processed_train, cat_features=['Name','Artist'])
ds_processed_test = Dataset(df_processed_test, cat_features=['Name','Artist'])

# With Categorical Features

In this section we will check data drift on the full dataset including the categorical features (Song's Name and Artist).

## Univariate Drift

The simplest drift detection method involves assessing one variable at a time. Measures like Kolmogorov-Smirnov test, Jensen-Shannon Divergence, etc., are used to gauge differences between newer and older samples of the variable. In deepchecks, optimal results are obtained with:

- For continuous numeric distributions: Kolmogorov-Smirnov statistic or Wasserstein metric (Earth Movers Distance)
- For discrete or categorical distributions: Cramer’s V or Population Stability Index (PSI)

While these methods offer simplicity and explainable results, they check each feature individually and may miss drift in feature relationships.

In [8]:
check = FeatureDrift()
result = check.run(train_dataset=ds_processed_train, test_dataset=ds_processed_test)
result.show()

VBox(children=(HTML(value='<h4><b>Feature Drift</b></h4>'), HTML(value='<p>    Calculate drift between train d…

## Multivariate Drift

Multivariate drift refers to changes occurring simultaneously in multiple features, potentially impacting the relationships among those features. Univariate drift methods, which examine one feature at a time, may overlook such interconnected changes. The multivariate drift check aims to identify and assess drift across multiple features within two input datasets. This check detects multivariate drift by using a domain classifier.

In [9]:
check = MultivariateDrift()
check.run(train_dataset=ds_processed_train, test_dataset=ds_processed_test)

VBox(children=(HTML(value='<h4><b>Multivariate Drift</b></h4>'), HTML(value='<p>    Calculate drift between th…

# Numerical Features Only

In this section, we will exclusively assess data drift in the numerical features present in our dataset. We are excluding categorical features since the current model utilizes only numerical features to generate clusters.

## Loading Dataset

We reload the data, this time excluding the "Artist" and "Name" columns.

In [10]:
df_processed_train = pd.read_csv(processed_train_data_directory)
df_processed_train = df_processed_train.drop(['Artist', 'Name'], axis=1)
df_processed_train

Unnamed: 0,Energy,Liveness,Loudness
0,0.418,0.159,0.671575
1,0.910,0.252,0.897120
2,0.423,0.143,0.691744
3,0.859,0.127,0.794126
4,0.899,0.241,0.756586
...,...,...,...
605,0.835,0.155,0.848115
606,0.801,0.108,0.791058
607,0.748,0.129,0.845661
608,0.967,0.203,0.889835


In [11]:
df_processed_test = pd.read_csv(processed_test_data_directory)
df_processed_test = df_processed_test.drop(['Artist', 'Name'], axis=1)

In [12]:
ds_processed_train = Dataset(df_processed_train)
ds_processed_test = Dataset(df_processed_test)



## Univariate Drift

The simplest drift detection method involves assessing one variable at a time. Measures like Kolmogorov-Smirnov test, Jensen-Shannon Divergence, etc., are used to gauge differences between newer and older samples of the variable. In deepchecks, optimal results are obtained with:

- For continuous numeric distributions: Kolmogorov-Smirnov statistic or Wasserstein metric (Earth Movers Distance)
- For discrete or categorical distributions: Cramer’s V or Population Stability Index (PSI)

While these methods offer simplicity and explainable results, they check each feature individually and may miss drift in feature relationships.

In [13]:
check = FeatureDrift()
result = check.run(train_dataset=ds_processed_train, test_dataset=ds_processed_test)
result.show()

VBox(children=(HTML(value='<h4><b>Feature Drift</b></h4>'), HTML(value='<p>    Calculate drift between train d…

## Multivariate Drift


Multivariate drift refers to changes occurring simultaneously in multiple features, potentially impacting the relationships among those features. Univariate drift methods, which examine one feature at a time, may overlook such interconnected changes. The multivariate drift check aims to identify and assess drift across multiple features within two input datasets. This check detects multivariate drift by using a domain classifier.

In [14]:
check = MultivariateDrift()
check.run(train_dataset=ds_processed_train, test_dataset=ds_processed_test)

VBox(children=(HTML(value='<h4><b>Multivariate Drift</b></h4>'), HTML(value='<p>    Calculate drift between th…