# Drift detection

In the context of machine learning, data drift refers to a change in the statistical properties of the input data over time compared to the data on which the model was originally trained. This can result in the model's performance deteriorating, as it may no longer generalise well to new, unseen data distributions.

There are several types of data drift, including:

- **Covariate Drift (Feature Drift)**: The distribution of the input features changes over time, but the relationship between features and target remains the same. For example, if a model was trained on data where a certain feature had a specific range of values, and over time that feature's values shift significantly, this can lead to covariate drift.

- **Prior Probability Drift (Label Drift)**: The distribution of the target variable changes. For instance, if a model was trained to predict energy consumption patterns during a specific season, and the seasonal patterns change due to external factors.

- **Concept Drift**: The relationship between the input features and the target variable changes. For example, if a model was trained to predict energy consumption based on weather data, and over time, the relationship between weather patterns and energy consumption changes due to new technologies or behavioural changes.

Detecting and addressing data drift is crucial for maintaining the performance of machine learning models in production. Techniques for handling data drift include:

- Monitoring input data in the prediction phase.

- Monitoring model performance (accuracy, precision, recall, etc.) over time.
  
- Periodically retraining the model with new data to adapt to changes.

Skforecast provides the class `RangeDriftDetector` to detect covariate drift in both single and multiple time series, as well as in exogenous variables. It checks whether the input data (lags and exogenous variables) used for predict new values fall within the range of the data used to train the model. Its API follows the same design as the forecasters: the data used to train a forecaster can also be used to fit the `RangeDriftDetector`, and the data passed for prediction can be used to check for drift.

<div role="note"
    style="background: rgba(0,184,212,.08); border-left: 6px solid #00b8d4;
          border-radius: 6px; padding: 10px 12px; margin: 1em 0;">

<p style="display:flex; align-items:center; font-size:1rem; color:#00b8d4;
          margin:0 0 6px 0; font-weight:600;">
  <span style="margin-right:6px; font-size:1.125em;">✏️</span>
  <strong style="font-size:18px;">Note</strong>
</p>

<p style="margin:0; color:inherit;">
  This module is in active development, we expect to add more features and improvements in future releases.
</p>

</div>

## Libraries and data

The dataset used in this user guide consists of information on the number of users of a bicycle rental service, in addition to weather variables and holiday data. Two of the variables in the dataset, `holiday` and `weather`, are categorical.

In [24]:
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
from skforecast.recursive import ForecasterRecursive
from skforecast.drift_detection import RangeDriftDetector
from skforecast.datasets import fetch_dataset
from sklearn.ensemble import HistGradientBoostingRegressor

## Out of range detection for single series

In [25]:
# Simulated data
# ==============================================================================
rgn = np.random.default_rng(123)
series = pd.Series(
    rgn.normal(loc=10, scale=2, size=100),
    index=pd.date_range(start="2020-01-01", periods=100),
    name="y",
)
exog = pd.DataFrame({
        'exog_1': rgn.normal(loc=10, scale=2, size=100),
        'exog_2': rgn.choice(['A', 'B', 'C', 'D', 'E'], size=100)
    }, index=series.index
)
display(series.head())
display(exog.head())

2020-01-01     8.021757
2020-01-02     9.264427
2020-01-03    12.575851
2020-01-04    10.387949
2020-01-05    11.840462
Freq: D, Name: y, dtype: float64

Unnamed: 0,exog_1,exog_2
2020-01-01,8.968465,B
2020-01-02,13.316227,B
2020-01-03,9.405475,A
2020-01-04,7.233246,A
2020-01-05,9.437591,A


In [26]:
# Train RangeDriftDetector
# ==============================================================================
detector = RangeDriftDetector()
detector.fit(series=series, exog=exog)
detector

RangeDriftDetector 
Series value ranges: {'y': (np.float64(5.5850578036003915), np.float64(14.579819894629157))} 
Exogenous value ranges: {'exog_1': (np.float64(4.5430286262543085), np.float64(14.531041199734418)), 'exog_2': {'D', 'A', 'B', 'E', 'C'}} 
Fitted series: ['y'] 
Fitted exogenous: ['exog_1', 'exog_2'] 

Lets assume the model is deployed in production and new data is being used to forecast future values. We simulate a covariate drift in the target series and in the exogenous variables to illustrate how to use the `RangeDriftDetector` class to detect it.

In [27]:
last_window = pd.Series([6.6, 7.5, 100, 9.3, 10.2], name='y') # Value 100 is out of range
last_window_exog = pd.DataFrame({
        'exog_1': [8, 9, 10, 70, 12], # Value 70 is out of range
        'exog_2': ['A', 'B', 'C', 'D', 'W'] # Value 'W' is out of range
    }
)
any_out_of_range, series_out_of_range, exog_out_of_range = detector.predict(
    last_window       = last_window,
    exog              = last_window_exog,
    verbose           = True,
    suppress_warnings = False
)

## Out of range detection for multiple series

Same process can be applied when modeling multiple time series. In the case, if exogenous variables are used, the drift detector will check them grouped by series.

In [28]:
# Simulated data - Multiple time series
# ==============================================================================
index = pd.MultiIndex.from_product(
    [['series_1', 'series_2', 'series_3'], pd.date_range(start="2020-01-01", periods=3)],
    names=['series_id', 'datetime']
)
series = pd.DataFrame({
        'y': [1, 2, 3, 10, 20, 30, 100, 200, 300]
    }, index=index
)
exog = pd.DataFrame({
        'exog_1': [5.0, 6.0, 7.0, 15.0, 25.0, 35.0, 150.0, 250.0, 350.0],
        'exog_2': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']
    }, index=index
)
display(series)
display(exog)


Unnamed: 0_level_0,Unnamed: 1_level_0,y
series_id,datetime,Unnamed: 2_level_1
series_1,2020-01-01,1
series_1,2020-01-02,2
series_1,2020-01-03,3
series_2,2020-01-01,10
series_2,2020-01-02,20
series_2,2020-01-03,30
series_3,2020-01-01,100
series_3,2020-01-02,200
series_3,2020-01-03,300


Unnamed: 0_level_0,Unnamed: 1_level_0,exog_1,exog_2
series_id,datetime,Unnamed: 2_level_1,Unnamed: 3_level_1
series_1,2020-01-01,5.0,A
series_1,2020-01-02,6.0,B
series_1,2020-01-03,7.0,C
series_2,2020-01-01,15.0,D
series_2,2020-01-02,25.0,E
series_2,2020-01-03,35.0,F
series_3,2020-01-01,150.0,G
series_3,2020-01-02,250.0,H
series_3,2020-01-03,350.0,I


In [29]:
# Train RangeDriftDetector - Multiple time series
# ==============================================================================
detector = RangeDriftDetector()
detector.fit(series=series, exog=exog)
detector

RangeDriftDetector 
Series value ranges: {'series_1': (np.int64(1), np.int64(3)), 'series_2': (np.int64(10), np.int64(30)), 'series_3': (np.int64(100), np.int64(300))} 
Exogenous value ranges: {'series_1': {'exog_1': (np.float64(5.0), np.float64(7.0)), 'exog_2': {'A', 'C', 'B'}}, 'series_2': {'exog_1': (np.float64(15.0), np.float64(35.0)), 'exog_2': {'F', 'D', 'E'}}, 'series_3': {'exog_1': (np.float64(150.0), np.float64(350.0)), 'exog_2': {'I', 'H', 'G'}}} 
Fitted series: ['series_1', 'series_2', 'series_3'] 
Fitted exogenous: ['exog_1', 'exog_2'] 

In [30]:
# New data - Multiple time series
# ==============================================================================
rgn = np.random.default_rng(123)
index = pd.MultiIndex.from_product(
    [['series_1', 'series_2', 'series_3'], pd.date_range(start="2020-01-06", periods=2)],
    names=['series_id', 'datetime']
)
last_window = pd.DataFrame({
        'y': [1.5, 2.3, 100, 20, 110, 200]
    }, index=index
) # Value 100 is out of range
last_window_exog = pd.DataFrame({
        'exog_1': [5.0, 6.1, 10, 70, 220, 290],
        'exog_2': ['A', 'B', 'D', 'F', 'W', 'E'] 
    }, index=index
)
any_out_of_range, series_out_of_range, exog_out_of_range = detector.predict(
    last_window       = last_window,
    exog              = last_window_exog,
    verbose           = True,
    suppress_warnings = False
)

## Combine RangeDriftDetector with forecasters

When deploying a forecaster in production, it is useful to pair it with a drift detector. This ensures that both are trained using the same data, enabling the drift detector to verify the input data prior to making predictions.

In [31]:
# Data
# ==============================================================================
data = fetch_dataset(name='h2o_exog')
data.index.name = 'datetime'
data.head(3)

h2o_exog
--------
Monthly expenditure ($AUD) on corticosteroid drugs that the Australian health
system had between 1991 and 2008. Two additional variables (exog_1, exog_2) are
simulated.
Hyndman R (2023). fpp3: Data for Forecasting: Principles and Practice (3rd
Edition). http://pkg.robjhyndman.com/fpp3package/,
https://github.com/robjhyndman/fpp3package, http://OTexts.com/fpp3.
Shape of the dataset: (195, 3)


Unnamed: 0_level_0,y,exog_1,exog_2
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1992-04-01,0.379808,0.958792,1.166029
1992-05-01,0.361801,0.951993,1.117859
1992-06-01,0.410534,0.952955,1.067942


In [32]:
# Train Forecaster and RangeDriftDetector
# ==============================================================================
steps = 36
data_train = data.iloc[:-steps, :]
data_test  = data.iloc[-steps:, :]

forecaster = ForecasterRecursive(
                 regressor = HistGradientBoostingRegressor(random_state=123),
                 lags      = 15
             )
detector = RangeDriftDetector()
forecaster.fit(
    y    = data_train['y'],
    exog = data_train[['exog_1', 'exog_2']]
)
detector.fit(
    series = data_train['y'],
    exog   = data_train[['exog_1', 'exog_2']]
)

In [33]:
data_train['y'].iloc[-forecaster.max_lag:]

datetime
2004-04-01    0.739986
2004-05-01    0.795129
2004-06-01    0.856803
2004-07-01    1.001593
2004-08-01    0.994864
2004-09-01    1.134432
2004-10-01    1.181011
2004-11-01    1.216037
2004-12-01    1.257238
2005-01-01    1.170690
2005-02-01    0.597639
2005-03-01    0.652590
2005-04-01    0.670505
2005-05-01    0.695248
2005-06-01    0.842263
Freq: MS, Name: y, dtype: float64

In [34]:
# Predict with Forecaster and check data with RangeDriftDetector
# ==============================================================================
detector.predict(
    last_window       = data_train['y'].iloc[-forecaster.max_lag:],
    exog              = data_test[['exog_1', 'exog_2']],
    verbose           = True,
    suppress_warnings = False
)
predictions = forecaster.predict(
                  steps = 36,
                  exog  = data_test[['exog_1', 'exog_2']]
              )