# 🏆 Detecting Data Drift with Deepchecks – Wine Quality Dataset

## Introduction  
In machine learning, **data drift** occurs when the statistical properties of input data change over time, potentially leading to model degradation. Identifying and addressing drift is crucial for maintaining model performance in production.

In this notebook, we will:
- Explore **data drift detection** using [Deepchecks](https://deepchecks.com/).
- Use the **Wine Quality (Red) dataset** from the UCI Machine Learning Repository.
- Fit a RandomRegressor model 
- Simulate drift by altering feature distributions.
- Run Deepchecks to detect and analyze the drift.

Let's get started! 🚀

## 1- Read data

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# Deepchecks imports for various drift detection checks
from deepchecks.tabular.checks import FeatureDrift, PredictionDrift, WholeDatasetDrift, MultivariateDrift
# Deepchecks Dataset class to wrap the data for drift detection
from deepchecks.tabular import Dataset

In [2]:
# Read the wine-quality csv file from the URL
csv_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(csv_url,sep=";")

In [3]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## 2- Simulating Production Data for Drift Detection  

In drift detection, we compare two datasets:  

- **📌 Reference Data (Historical Data)** → Set using the **training dataset**. It represents past, stable data that the model was trained on.  
- **🚀 Live/Batch Production Data** → Simulated in this notebook using the **test dataset**. This represents new, incoming data from a live production system (real-time) or batch inference (offline processing).  

💡 **Why does this matter?**  

If **production data** differs significantly from **reference data**, the model may experience **data drift**, which can degrade performance. Identifying this drift early is crucial for maintaining reliable predictions.


In [4]:
# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data,test_size=0.25,random_state=42)
train_x=train.drop(["quality"],axis=1)
test_x=test.drop(["quality"],axis=1)
train_y=train[["quality"]]
test_y=test[["quality"]]

In [5]:
# Create/fit model
lr=RandomForestRegressor()
lr.fit(train_x,train_y.values.flatten())

## 3- Inducing Synthetic Drift

In this step, we will artificially induce drift in the **volatile acidity** feature of the test dataset, multiplying **volatile acidity** feature by **1.5**

In [6]:
# Induce synthetic drift 
test_x['volatile acidity']=test_x['volatile acidity']*1.5

## 4- Drift

In [7]:
# Wrap data in deepchecks datasets (no categrical features)
train_dataset=Dataset(df=train_x,cat_features=[])
test_dataset=Dataset(df=test_x,cat_features=[])

### Feature Drift

In [8]:
#features drift
check = FeatureDrift()
#minmum allowed drift
check.add_condition_drift_score_less_than(max_allowed_numeric_score=0.2)
#Run (Model is optional, used to show feature importance on top of drift score)
result = check.run(train_dataset=train_dataset, test_dataset=test_dataset,model=lr)
result

VBox(children=(HTML(value='<h4><b>Feature Drift</b></h4>'), HTML(value='<p>    Calculate drift between train d…

🚨 Induced Drift successfully Detected 🚨 

### Dataset Drift

In [9]:
dataset_drift_check = MultivariateDrift()
dataset_drift_check.add_condition_overall_drift_value_less_than(max_drift_value=0.2)
ds_drift_result = dataset_drift_check.run(train_dataset=train_dataset, test_dataset=test_dataset)
ds_drift_result

VBox(children=(HTML(value='<h4><b>Multivariate Drift</b></h4>'), HTML(value='<p>    Calculate drift between th…

🚨 Induced Drift successfully detected again 🚨 

In [10]:
# prediction drift 

In [11]:
pred_drift_check = PredictionDrift()
pred_drift_check.add_condition_drift_score_less_than(max_allowed_drift_score=0.2)
#Run check (model is required)
pred_drift_result = pred_drift_check.run(train_dataset,test_dataset,model=lr)
pred_drift_result

VBox(children=(HTML(value='<h4><b>Prediction Drift</b></h4>'), HTML(value='<p>    Calculate prediction drift b…

✅ No Drift as expected !

## 🎉 Conclusion

Deepchecks has proven to be a powerful tool for detecting and analyzing data drift. In this experiment, I introduced a synthetic drift in the volatile acidity feature, and Deepchecks successfully identified it through both feature drift and dataset drift. 

The tool provided clear, actionable insights, demonstrating its effectiveness in monitoring data integrity and ensuring model reliability over time.  Deepchecks' ability to detect such changes underscores its value as a critical component in the machine learning pipeline.