In this notebook, we assess our data using the Deepchecks library, specifically utilizing the Tabular Sub-Package for data drift.

Drift refers to the evolving pattern of data distribution over time, and it stands out as a primary factor contributing to the deterioration of a machine learning model's performance as time progresses.

We first import the needed libraries. If errors related to the deepchecks libray occur, we suggest to install the deepchecks library with pip in your virtual environment. Here is the command you need to run: pip install deepchecks

In [17]:
import pandas as pd
import os
import sys
import pandas as pd
from deepchecks.tabular.checks import FeatureDrift
from deepchecks.tabular.checks import MultivariateDrift
from deepchecks.tabular import Dataset

# With Categorical Features

In this section we will check data drift on the full dataset including the categorical features

In [18]:
current_dir = os.getcwd()
two_levels_up = os.path.dirname(os.path.dirname(current_dir))
sys.path.insert(0, two_levels_up)
import conf

In [19]:
three_levels_up = os.path.dirname(os.path.dirname(os.path.dirname(current_dir)))
sys.path.insert(0, three_levels_up)
# Processed Data
processed_train_data_directory = three_levels_up + '/'+ conf.PRO_DATA_DIR + 'trainSet.csv'
processed_test_data_directory = three_levels_up + '/'+ conf.PRO_DATA_DIR + 'testSet.csv'

In [20]:
df_processed_train = pd.read_csv(processed_train_data_directory)
df_processed_train

Unnamed: 0,Name,Artist,Energy,Liveness,Loudness
0,Serotonin,Call Me Karizma,0.418,0.159,0.671575
1,Joy,Thornhill,0.910,0.252,0.897120
2,Like,Alissic,0.423,0.143,0.691744
3,Can You Afford to Be An Individual?,Nothing But Thieves,0.859,0.127,0.794126
4,This Feels Like the End,Nothing But Thieves,0.899,0.241,0.756586
...,...,...,...,...,...
605,Tomorrow Is Closed,Nothing But Thieves,0.835,0.155,0.848115
606,Do You Love Me Yet?,Nothing But Thieves,0.801,0.108,0.791058
607,Welcome to the DCC,Nothing But Thieves,0.748,0.129,0.845661
608,CODE MISTAKE,CORPSE,0.967,0.203,0.889835


In [21]:
df_processed_test = pd.read_csv(processed_test_data_directory)
df_processed_test

Unnamed: 0,Name,Artist,Energy,Liveness,Loudness
0,Blinding Lights,The Weeknd,0.730,0.0897,0.756423
1,Shape of You,Ed Sheeran,0.652,0.0931,0.927975
2,Dance Monkey,Tones And I,0.588,0.1490,0.727363
3,Someone You Loved,Lewis Capaldi,0.405,0.1050,0.772325
4,Sunflower - Spider-Man: Into the Spider-Verse,Post Malone,0.522,0.0685,0.854078
...,...,...,...,...,...
495,Hey There Delilah,Plain White T's,0.291,0.1140,0.467199
496,idontwannabeyouanymore,Billie Eilish,0.412,0.1160,0.598840
497,Zombie,The Cranberries,0.635,0.3660,0.567910
498,No Lie,Sean Paul,0.882,0.2060,0.947992


In [22]:
ds_processed_train = Dataset(df_processed_train, cat_features=['Name','Artist'])
ds_processed_test = Dataset(df_processed_test, cat_features=['Name','Artist'])

## Univariate

The simplest drift detection method involves assessing one variable at a time. Measures like Kolmogorov-Smirnov test, Jensen-Shannon Divergence, etc., are used to gauge differences between newer and older samples of the variable. In deepchecks, optimal results are obtained with:

- For continuous numeric distributions: Kolmogorov-Smirnov statistic or Wasserstein metric (Earth Movers Distance)
- For discrete or categorical distributions: Cramer’s V or Population Stability Index (PSI)

While these methods offer simplicity and explainable results, they check each feature individually and may miss drift in feature relationships.

In [23]:
check = FeatureDrift()
result = check.run(train_dataset=ds_processed_train, test_dataset=ds_processed_test)
result.show()

VBox(children=(HTML(value='<h4><b>Feature Drift</b></h4>'), HTML(value='<p>    Calculate drift between train d…

## Multivariate


Multivariate drift refers to changes occurring simultaneously in multiple features, potentially impacting the relationships among those features. Univariate drift methods, which examine one feature at a time, may overlook such interconnected changes. The multivariate drift check aims to identify and assess drift across multiple features within two input datasets. This check detects multivariate drift by using a domain classifier.

In [24]:
check = MultivariateDrift()
check.run(train_dataset=ds_processed_train, test_dataset=ds_processed_test)

VBox(children=(HTML(value='<h4><b>Multivariate Drift</b></h4>'), HTML(value='<p>    Calculate drift between th…

# Numerical Features Only

In this section, we will exclusively assess data drift in the numerical features present in our dataset. We are excluding categorical features since the current model utilizes only numerical features to generate clusters.

## Loading Dataset

In [25]:
current_dir = os.getcwd()
two_levels_up = os.path.dirname(os.path.dirname(current_dir))
sys.path.insert(0, two_levels_up)
import conf

In [26]:
three_levels_up = os.path.dirname(os.path.dirname(os.path.dirname(current_dir)))
sys.path.insert(0, three_levels_up)
# Processed Data
processed_train_data_directory = three_levels_up + '/'+ conf.PRO_DATA_DIR + 'trainSet.csv'
processed_test_data_directory = three_levels_up + '/'+ conf.PRO_DATA_DIR + 'testSet.csv'

In [27]:
df_processed_train = pd.read_csv(processed_train_data_directory)
df_processed_train = df_processed_train.drop(['Artist', 'Name'], axis=1)
df_processed_train

Unnamed: 0,Energy,Liveness,Loudness
0,0.418,0.159,0.671575
1,0.910,0.252,0.897120
2,0.423,0.143,0.691744
3,0.859,0.127,0.794126
4,0.899,0.241,0.756586
...,...,...,...
605,0.835,0.155,0.848115
606,0.801,0.108,0.791058
607,0.748,0.129,0.845661
608,0.967,0.203,0.889835


In [28]:
df_processed_test = pd.read_csv(processed_test_data_directory)
df_processed_test = df_processed_test.drop(['Artist', 'Name'], axis=1)
df_processed_test

Unnamed: 0,Energy,Liveness,Loudness
0,0.730,0.0897,0.756423
1,0.652,0.0931,0.927975
2,0.588,0.1490,0.727363
3,0.405,0.1050,0.772325
4,0.522,0.0685,0.854078
...,...,...,...
495,0.291,0.1140,0.467199
496,0.412,0.1160,0.598840
497,0.635,0.3660,0.567910
498,0.882,0.2060,0.947992


In [29]:
ds_processed_train = Dataset(df_processed_train)
ds_processed_test = Dataset(df_processed_test)



## Univariate Drift

The simplest drift detection method involves assessing one variable at a time. Measures like Kolmogorov-Smirnov test, Jensen-Shannon Divergence, etc., are used to gauge differences between newer and older samples of the variable. In deepchecks, optimal results are obtained with:

- For continuous numeric distributions: Kolmogorov-Smirnov statistic or Wasserstein metric (Earth Movers Distance)
- For discrete or categorical distributions: Cramer’s V or Population Stability Index (PSI)

While these methods offer simplicity and explainable results, they check each feature individually and may miss drift in feature relationships.

In [30]:
check = FeatureDrift()
result = check.run(train_dataset=ds_processed_train, test_dataset=ds_processed_test)
result.show()

VBox(children=(HTML(value='<h4><b>Feature Drift</b></h4>'), HTML(value='<p>    Calculate drift between train d…

## Multivariate Drift Detection


Multivariate drift refers to changes occurring simultaneously in multiple features, potentially impacting the relationships among those features. Univariate drift methods, which examine one feature at a time, may overlook such interconnected changes. The multivariate drift check aims to identify and assess drift across multiple features within two input datasets. This check detects multivariate drift by using a domain classifier.

In [31]:
check = MultivariateDrift()
check.run(train_dataset=ds_processed_train, test_dataset=ds_processed_test)

VBox(children=(HTML(value='<h4><b>Multivariate Drift</b></h4>'), HTML(value='<p>    Calculate drift between th…