<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Concept Drift Detectors - Solution
---

Dynamic environments are challenging for machine learning methods because data changes over time.

For this example, we will generate a synthetic data stream by concatenating data from 3 different distributions:
- $dist_a$: value from $0.0$ to $0.75$
- $dist_b$: value from $0.75$ to $1.0$
- $dist_c$: value from $0.0$ to $0.3$

In [1]:
import numpy as np
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook
from bokeh.layouts import gridplot
from bokeh.palettes import Pastel1
from bokeh.models import Span

In [2]:
def plot_data(dist_a, dist_b, dist_c, drifts=None, warnings=None):
    output_notebook()
    color_0 = Pastel1[3][0]
    color_1 = Pastel1[3][1]
    color_2 = Pastel1[3][2]

    left = figure(plot_width=600, plot_height=400,
                  tools="pan,box_zoom,reset,save",
                  title="drift stream",
                  x_axis_label='samples', y_axis_label='value',
                  background_fill_color="#fafafa"
                  )
    # add some renderers
    left.circle(range(1000), dist_a, legend_label=r"dist_a",
                fill_color=color_0, line_color=color_0, size=4)    
    left.circle(range(1000, 2000, 1), dist_b, legend_label=r"dist_b",
                fill_color=color_1, line_color=color_1, size=4)
    left.circle(range(2000, 3000, 1), dist_c, legend_label=r"dist_c",
                fill_color=color_2, line_color=color_2, size=4)

    right = figure(plot_width=300, plot_height=400,
                   tools="pan,box_zoom,reset,save",
                   title="distributions",
                   background_fill_color="#fafafa"
                   )
    hist, edges = np.histogram(dist_a, density=True, bins=50)
    right.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
               fill_color=color_0, line_color=color_0, legend_label='dist_a')
    hist, edges = np.histogram(dist_b, density=True, bins=50)
    right.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
                   fill_color=color_1, line_color=color_1, legend_label='dist_b')
    hist, edges = np.histogram(dist_c, density=True, bins=50)
    right.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
                   fill_color=color_2, line_color=color_2, legend_label='dist_c')
    
    if drifts is not None:
        for drift_loc in drifts:
            drift_line = Span(location=drift_loc, dimension='height',
                              line_color='red', line_width=2)
            left.add_layout(drift_line)
    
    if warnings is not None:
        for warning_loc in warnings:
            warning_line = Span(location=warning_loc, dimension='height',
                              line_color='blue', line_width=2,line_dash='dashed')
            left.add_layout(warning_line)
    
    p = gridplot([[left, right]])
    show(p)

In [4]:
np.random.seed(42)
dist_a1 = np.random.uniform(0,0.5,750)
dist_a2 = np.random.uniform(0,0.75,250)
dist_a = np.concatenate((dist_a1, dist_a2))
dist_b = np.random.uniform(0.6,1,1000)
dist_c = np.random.uniform(0,0.25,1000)

data_stream = np.concatenate((dist_a, dist_b, dist_c))

In [5]:
plot_data(dist_a,dist_b,dist_c)

As observed above, the synthetic data stream has **1 gradual drifts** and **1 abrupt drift**.

The goal is to detect that drift has occurred, after samples **1000** and **2000** in the synthetic data stream.

## ADWIN
---
In this example, we will use the [ADaptive WINdowing (`ADWIN`)](https://riverml.xyz/latest/api/drift/ADWIN/) drift detection method.

In [6]:
from river.drift import ADWIN

drift_detector = ADWIN()
drifts = []
warnings = []
warning = -1

for i, val in enumerate(data_stream):
    drift_detector.update(val)           # Data is processed one sample at a time
    if drift_detector.warning_detected:
        warning = i
    if drift_detector.change_detected:
        if warning != -1:
            print(f'Warning detected at index {warning} and Change detected at index {i}')
            warnings.append(warning)
            warning = -1
        else: 
            print(f'Change detected at index {i}')
        drifts.append(i)
        drift_detector.reset()           # As a best practice, we reset the detector

Change detected at index 1023
Change detected at index 1055
Change detected at index 1087
Change detected at index 1119
Change detected at index 1151
Change detected at index 2047
Change detected at index 2079


In [7]:
plot_data(dist_a,dist_b,dist_c,drifts,warnings)

## Page Hinkley
---
In this example, we will use the [Page Hinkley](https://riverml.xyz/latest/api/drift/PageHinkley/) drift detection method. This change detection method works by computing the observed values and their mean up to the current moment. Page-Hinkley does not signal warning zones, only change detections. The method works by means of the Page-Hinkley test. In general lines it will detect a concept drift if the observed mean at some instant is greater then a threshold value lambda.

In [8]:
from river.drift import PageHinkley

drift_detector = PageHinkley()
drifts = []
warnings = []
warning = -1

for i, val in enumerate(data_stream):
    drift_detector.update(val)           # Data is processed one sample at a time
    if drift_detector.warning_detected:
        warning = i
    if drift_detector.change_detected:
        if warning != -1:
            print(f'Warning detected at index {warning} and Change detected at index {i}')
            warnings.append(warning)
            warning = -1
        else: 
            print(f'Change detected at index {i}')
        drifts.append(i)
        drift_detector.reset()           # As a best practice, we reset the detector

Change detected at index 1051


In [9]:
plot_data(dist_a,dist_b,dist_c,drifts,warnings)

## DDM
---
In this example, we will use the [DDM](https://riverml.xyz/latest/api/drift/DDM/) drift detection method. It is based on the PAC learning model premise, that the learner's error rate will decrease as the number of analysed samples increase, as long as the data distribution is stationary.

If the algorithm detects an increase in the error rate, that surpasses a calculated threshold, either change is detected or the algorithm will warn the user that change may occur in the near future, which is called the warning zone.

In [10]:
from river.drift import DDM

drift_detector = DDM()
drifts = []
warnings = []
warning = -1

for i, val in enumerate(data_stream):
    drift_detector.update(val)           # Data is processed one sample at a time
    if drift_detector.warning_detected:
        warning = i
    if drift_detector.change_detected:
        if warning != -1:
            print(f'Warning detected at index {warning} and Change detected at index {i}')
            warnings.append(warning)
            warning = -1
        else: 
            print(f'Change detected at index {i}')
        drifts.append(i)
        drift_detector.reset()             # As a best practice, we reset the detector



In [11]:
plot_data(dist_a,dist_b,dist_c,drifts,warnings)

## EDDM
---
In this example, we will use the [EDDM](https://riverml.xyz/latest/api/drift/EDDM/) drift detection method. It works by keeping track of the average distance between two errors instead of only the error rate. For this, it is necessary to keep track of the running average distance and the running standard deviation, as well as the maximum distance and the maximum standard deviation.

In [12]:
from river.drift import EDDM

drift_detector = EDDM()
drifts = []
warnings = []
warning = -1

for i, val in enumerate(data_stream):
    drift_detector.update(val)           # Data is processed one sample at a time
    if drift_detector.warning_detected:
        warning = i
    if drift_detector.change_detected:
        if warning != -1:
            print(f'Warning detected at index {warning} and Change detected at index {i}')
            warnings.append(warning)
            warning = -1
        else: 
            print(f'Change detected at index {i}')
        drifts.append(i)
        drift_detector.reset()          # As a best practice, we reset the detector

In [13]:
plot_data(dist_a,dist_b,dist_c,drifts,warnings)

## CUSUM
---
It gives an alarm when the mean of the input data is significantly different from zero.
- $g_0 = 0$
- $\hat{x}$ update
- $sum_t = max(0,sum_{t-1}+(x_t - \hat{x}) - \delta)$
- $n += 1$
- if $n > min_{obs}$ and $sum_t > \lambda:$ Change

Use $\delta=0.005$, $\lambda=50$, and $min_{obs}=30$

In [14]:
class CUSUM():
    def __init__(self,delta,lamb,min_obs):
        # Initialization
        self._n = 1
        self._x_mean = 0.0
        self._sum = 0.0
        self._delta = delta
        self._lambda = lamb
        self._min_obs = min_obs
        self.warning_detected = False
        self.change_detected = False
        
    def update(self,value):
        self._x_mean += (value - self._x_mean) / self._n
        self._sum = max(0,self._sum + value - self._x_mean - self._delta)
        self._n += 1
        
        if self._n >= self._min_obs and self._sum > self._lambda:
            self.change_detected = True
        
        
    def reset(self):
        self._n = 1
        self._x_mean = 0.0
        self._sum = 0.0
        self.warning_detected = False
        self.change_detected = False

In [15]:
drift_detector = CUSUM(delta=0.005,lamb=50,min_obs=30)
drifts = []
warnings = []
warning = -1

for i, val in enumerate(data_stream):
    drift_detector.update(val)           # Data is processed one sample at a time
    if drift_detector.warning_detected:
        warning = i
    if drift_detector.change_detected:
        if warning != -1:
            print(f'Warning detected at index {warning} and Change detected at index {i}')
            warnings.append(warning)
            warning = -1
        else: 
            print(f'Change detected at index {i}')
        drifts.append(i)
        drift_detector.reset()          # As a best practice, we reset the detector

Change detected at index 1050


In [16]:
plot_data(dist_a,dist_b,dist_c,drifts,warnings)

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) Quantia Consulting, srl. All rights reserved.