<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Stream Classification
---

## `NEWeather` dataset

**Description:** The National Oceanic and Atmospheric Administration (NOAA),
has compiled a database of weather measurements from over 7,000 weather 
stations worldwide. Records date back to the mid-1900s. Daily measurements
include a variety of features (temperature, pressure, wind speed, etc.) as
well as a series of indicators for precipitation and other weather-related
events. The `NEweather` dataset contains data from this database, specifically
from the Offutt Air Force Base in Bellevue, Nebraska ranging for over 50 years
(1949-1999).

**Features:** 8 Daily weather measurements
 
|       Attribute      | Description |
|:--------------------:|:-----------------------------|
| `temp`                   | Temperature
| `dew_pnt`                | Dew Point
| `sea_lvl_press`          | Sea Level Pressure
| `visibility`             | Visibility
| `avg_wind_spd`           | Average Wind Speed
| `max_sustained_wind_spd` | Maximum Sustained Wind Speed
| `max_temp`               | Maximum Temperature
| `min_temp`               | Minimum Temperature


**Class:** `rain` | 0: no rain, 1: rain
 
**Samples:** 18,159


In [1]:
import pandas as pd
from river.stream import iter_pandas
from river.metrics import Accuracy
from river.evaluate import progressive_val_score

In [2]:
data = pd.read_csv("../datasets/NEweather.csv")
features = data.columns[:-1]

In this example, we load the data from a csv file with `pandas.read_csv`, and we use the [iter_pandas](https://riverml.xyz/latest/api/stream/iter-pandas/) utility method to iterate over the `DataFrame`.

In [3]:
stream = iter_pandas(X=data[features], y=data['rain'])

## Naïve Bayes
---
[GaussianNB](https://riverml.xyz/latest/api/naive-bayes/GaussianNB/) maintains a Gaussian distribution $G_{cf}$ is maintained for each class $c$ and each feature $f$. Each Gaussian is updated using the amount associated with each feature; the details can be be found in proba.Gaussian. The joint log-likelihood is then obtained by summing the log probabilities of each feature associated with each class.

In [4]:
from river.naive_bayes import GaussianNB

model = GaussianNB()
metric = Accuracy()

progressive_val_score(dataset=stream,
                      model=model,
                      metric=metric,
                      print_every=1000)

[1,000] Accuracy: 71.27%
[2,000] Accuracy: 69.88%
[3,000] Accuracy: 68.99%
[4,000] Accuracy: 68.82%
[5,000] Accuracy: 69.09%
[6,000] Accuracy: 69.13%
[7,000] Accuracy: 69.15%
[8,000] Accuracy: 68.50%
[9,000] Accuracy: 68.65%
[10,000] Accuracy: 69.04%
[11,000] Accuracy: 69.52%
[12,000] Accuracy: 69.74%
[13,000] Accuracy: 69.79%
[14,000] Accuracy: 69.88%
[15,000] Accuracy: 70.14%
[16,000] Accuracy: 70.05%
[17,000] Accuracy: 69.70%
[18,000] Accuracy: 69.36%


Accuracy: 69.21%

## K-Nearest Neighbors
---
[KNN](https://riverml.xyz/latest/api/neighbors/KNNClassifier/) is a non-parametric classification method that keeps track of the last window_size training samples. The predicted class-label for a given query sample is obtained in two steps:

- Find the closest n_neighbors to the query sample in the data window. 
- Aggregate the class-labels of the n_neighbors to define the predicted class for the query sample.

In [5]:
from river.neighbors import KNNClassifier

model = KNNClassifier(n_neighbors=5, window_size=1000)
metric = Accuracy()
stream = iter_pandas(X=data[features], y=data['rain'])

progressive_val_score(dataset=stream,
                      model=model,
                      metric=metric,
                      print_every=1000)

[1,000] Accuracy: 77.18%
[2,000] Accuracy: 78.34%
[3,000] Accuracy: 78.86%
[4,000] Accuracy: 78.29%
[5,000] Accuracy: 78.06%
[6,000] Accuracy: 77.95%
[7,000] Accuracy: 78.24%
[8,000] Accuracy: 77.96%
[9,000] Accuracy: 78.12%
[10,000] Accuracy: 78.16%
[11,000] Accuracy: 78.35%
[12,000] Accuracy: 78.47%
[13,000] Accuracy: 78.36%
[14,000] Accuracy: 78.26%
[15,000] Accuracy: 78.36%
[16,000] Accuracy: 78.24%
[17,000] Accuracy: 78.10%
[18,000] Accuracy: 77.90%


Accuracy: 77.91%

## Hoeffding Tree
---

[Hoeffding Tree](https://riverml.xyz/latest/api/tree/HoeffdingTreeClassifier/) 

In [6]:
from river.tree import HoeffdingTreeClassifier

model = HoeffdingTreeClassifier()
metric = Accuracy()
stream = iter_pandas(X=data[features], y=data['rain'])

progressive_val_score(dataset=stream,
                      model=model,
                      metric=metric,
                      print_every=1000)

[1,000] Accuracy: 70.87%
[2,000] Accuracy: 69.73%
[3,000] Accuracy: 70.89%
[4,000] Accuracy: 71.29%
[5,000] Accuracy: 71.79%
[6,000] Accuracy: 72.13%
[7,000] Accuracy: 72.82%
[8,000] Accuracy: 72.58%
[9,000] Accuracy: 72.80%
[10,000] Accuracy: 72.85%
[11,000] Accuracy: 73.30%
[12,000] Accuracy: 73.55%
[13,000] Accuracy: 73.80%
[14,000] Accuracy: 73.73%
[15,000] Accuracy: 73.99%
[16,000] Accuracy: 74.03%
[17,000] Accuracy: 73.93%
[18,000] Accuracy: 73.58%


Accuracy: 73.55%

Tree-based models are popular due to their interpretability. They use a tree data structure to model the data. When a sample arrives, it traverses the tree until it reaches a leaf node. Internal nodes define the path for a data sample based on the values of its features. Leaf nodes are models that provide predictions for unlabeled-samples and can update their internal state using the labels from labeled samples.

## Hoeffding Adaptive Tree
---
The [HAT](https://riverml.xyz/latest/api/tree/HoeffdingAdaptiveTreeClassifier/) model uses `ADWIN` to detect changes. If change is detected in a given branch, an alternate branch is created and eventually replaces the original branch if it shows better performance on new data.

In [7]:
from river.tree import HoeffdingAdaptiveTreeClassifier

model = HoeffdingAdaptiveTreeClassifier(seed=42)
metric = Accuracy()
stream = iter_pandas(X=data[features], y=data['rain'])

progressive_val_score(dataset=stream, 
                      model=model, 
                      metric=metric, 
                      print_every=1000)

[1,000] Accuracy: 68.37%
[2,000] Accuracy: 69.48%
[3,000] Accuracy: 71.09%
[4,000] Accuracy: 72.02%
[5,000] Accuracy: 72.85%
[6,000] Accuracy: 73.33%
[7,000] Accuracy: 73.91%
[8,000] Accuracy: 73.51%
[9,000] Accuracy: 73.81%
[10,000] Accuracy: 73.85%
[11,000] Accuracy: 74.03%
[12,000] Accuracy: 74.16%
[13,000] Accuracy: 74.14%
[14,000] Accuracy: 73.96%
[15,000] Accuracy: 74.28%
[16,000] Accuracy: 74.34%
[17,000] Accuracy: 74.12%
[18,000] Accuracy: 73.60%


Accuracy: 73.59%

## Concept Drift Impact

Concept drift can negatively impact learning methods if not properly handled. Multiple real-world applications suffer **model degradation** as the models can not adapt to changes in the data.

---
## `AGRAWAL` dataset

We will load the data from a csv file. The data was generated using the `AGRAWAL` data generator with 3 **gradual drifts** at the 5k, 10k, and 15k marks. It contains 9 features, 6 numeric and 3 categorical.

There are 10 functions for generating binary class labels from the features. These functions determine whether a **loan** should be approved.

| Feature    | Description            | Values                                                                |
|------------|------------------------|-----------------------------------------------------------------------|
| `salary`     | salary                 | uniformly distributed from 20k to 150k                                |
| `commission` | commission             | if (salary <   75k) then 0 else uniformly distributed from 10k to 75k |
| `age`        | age                    | uniformly distributed from 20 to 80                                   |
| `elevel`     | education level        | uniformly chosen from 0 to 4                                          |
| `car`        | car maker              | uniformly chosen from 1 to 20                                         |
| `zipcode`    | zip code of the town   | uniformly chosen from 0 to 8                                          |
| `hvalue`     | value of the house     | uniformly distributed from 50k x zipcode to 100k x zipcode            |
| `hyears`     | years house owned      | uniformly distributed from 1 to 30                                    |
| `loan`       | total loan amount      | uniformly distributed from 0 to 500k                                  |

**Class:** `y` | 0: no loan, 1: loan
 
**Samples:** 20,000

`elevel`, `car`, and `zipcode` are categorical features.

In [8]:
data = pd.read_csv("../datasets/agr_a_20k.csv")
features = data.columns[:-1]

## Naïve Bayes

In [9]:
from river.naive_bayes import GaussianNB

model = GaussianNB()
metric = Accuracy()
stream = iter_pandas(X=data[features], y=data['class'])

progressive_val_score(dataset=stream,
                      model=model,
                      metric=metric,
                      print_every=1000)

[1,000] Accuracy: 83.98%
[2,000] Accuracy: 86.29%
[3,000] Accuracy: 87.00%
[4,000] Accuracy: 87.55%
[5,000] Accuracy: 87.42%
[6,000] Accuracy: 80.50%
[7,000] Accuracy: 74.71%
[8,000] Accuracy: 70.87%
[9,000] Accuracy: 68.01%
[10,000] Accuracy: 66.25%
[11,000] Accuracy: 66.75%
[12,000] Accuracy: 67.30%
[13,000] Accuracy: 67.96%
[14,000] Accuracy: 68.74%
[15,000] Accuracy: 69.29%
[16,000] Accuracy: 68.33%
[17,000] Accuracy: 67.45%
[18,000] Accuracy: 66.90%
[19,000] Accuracy: 66.32%
[20,000] Accuracy: 65.94%


Accuracy: 65.94%

## KNN with ADWIN
---

This classifier is an improvement from the regular kNN method, as it is resistant to concept drift. It uses the ADWIN change detector to decide which samples to keep and which ones to forget, and by doing so it regulates the sample window size.

In [10]:
from river.neighbors import KNNADWINClassifier
from river import compose

model = (
    compose.Discard('elevel', 'car', 'zipcode') |
    KNNADWINClassifier(n_neighbors=5, window_size=1000)
)
metric = Accuracy()
stream = iter_pandas(X=data[features], y=data['class'])

progressive_val_score(dataset=stream,
                      model=model,
                      metric=metric,
                      print_every=1000)

[1,000] Accuracy: 58.16%
[2,000] Accuracy: 58.08%
[3,000] Accuracy: 58.72%
[4,000] Accuracy: 59.56%
[5,000] Accuracy: 59.99%
[6,000] Accuracy: 59.46%
[7,000] Accuracy: 60.55%
[8,000] Accuracy: 61.30%
[9,000] Accuracy: 61.98%
[10,000] Accuracy: 62.32%
[11,000] Accuracy: 61.23%
[12,000] Accuracy: 60.97%
[13,000] Accuracy: 60.88%
[14,000] Accuracy: 60.97%
[15,000] Accuracy: 61.00%
[16,000] Accuracy: 61.25%
[17,000] Accuracy: 62.22%
[18,000] Accuracy: 63.09%
[19,000] Accuracy: 63.80%
[20,000] Accuracy: 64.41%


Accuracy: 64.41%

## Hoeffding Tree

In [11]:
from river.tree import HoeffdingTreeClassifier

model = HoeffdingTreeClassifier(nominal_attributes=['elevel', 'car', 'zipcode'])
metric = Accuracy()
stream = iter_pandas(X=data[features], y=data['class'])

progressive_val_score(dataset=stream,
                      model=model,
                      metric=metric,
                      print_every=1000)

[1,000] Accuracy: 82.18%
[2,000] Accuracy: 82.79%
[3,000] Accuracy: 84.63%
[4,000] Accuracy: 86.27%
[5,000] Accuracy: 87.08%
[6,000] Accuracy: 80.76%
[7,000] Accuracy: 76.87%
[8,000] Accuracy: 74.67%
[9,000] Accuracy: 74.14%
[10,000] Accuracy: 74.41%
[11,000] Accuracy: 73.54%
[12,000] Accuracy: 73.48%
[13,000] Accuracy: 73.84%
[14,000] Accuracy: 74.56%
[15,000] Accuracy: 75.55%
[16,000] Accuracy: 74.16%
[17,000] Accuracy: 73.17%
[18,000] Accuracy: 72.73%
[19,000] Accuracy: 72.42%
[20,000] Accuracy: 72.28%


Accuracy: 72.28%

## Hoeffding Adaptive Tree

In [12]:
from river.tree import HoeffdingAdaptiveTreeClassifier

model = HoeffdingAdaptiveTreeClassifier(nominal_attributes=['elevel', 'car', 'zipcode'], seed=42)
metric = Accuracy()
stream = iter_pandas(X=data[features], y=data['class'])

progressive_val_score(dataset=stream, 
                      model=model, 
                      metric=metric, 
                      print_every=1000)

[1,000] Accuracy: 84.38%
[2,000] Accuracy: 87.84%
[3,000] Accuracy: 89.03%
[4,000] Accuracy: 90.30%
[5,000] Accuracy: 90.74%
[6,000] Accuracy: 84.38%
[7,000] Accuracy: 81.33%
[8,000] Accuracy: 79.51%
[9,000] Accuracy: 78.25%
[10,000] Accuracy: 77.10%
[11,000] Accuracy: 75.24%
[12,000] Accuracy: 74.58%
[13,000] Accuracy: 75.38%
[14,000] Accuracy: 76.71%
[15,000] Accuracy: 77.75%
[16,000] Accuracy: 76.36%
[17,000] Accuracy: 76.38%
[18,000] Accuracy: 76.60%
[19,000] Accuracy: 76.79%
[20,000] Accuracy: 76.91%


Accuracy: 76.91%

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) Quantia Consulting, srl. All rights reserved.