<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Extending functionality - Solution

## Implement a simple estimator


A core design element in `river` is that it should be easy to create new methods or extend existing methods.

In the following example, we show how to implement the `MajorityClassClassifier`.

The Majority Class is one of the simplest classifiers: it predicts the class of a new sample to be the most frequent at that point in the stream.

It is used mostly as a baseline, but also as a default classifier at the leaves of decision trees.

`River` provides a set of `mixins` for different learning tasks. In this case, we want to create a classifier so we extend the `base.Classifier` mixin as follows: 

In [4]:
from river import base
from collections import Counter

class MajorityClassClassifier(base.Classifier):
    def __init__(self):
        # Initialization
        self._counts = Counter()
    
    def learn_one(self, x:dict, y):
        # Learn one sample
        self._counts.update([y])

    def predict_one(self, x:dict):
        # Predict class for one sample
        mc = self._counts.most_common()
        if mc:
            return mc[0][0]
        return 0   # Counter is empty

    def predict_proba_one(self, x:dict):
        # Predict class probability for one sample
        total = sum(self._counts)
        y_proba = {}
        if total > 0:    # Protect division by zero
            for x, cnt in self._counts.items():
                y_proba[x] = cnt / total
        return y_proba
    
    @property
    def _multiclass(self):
        return True

## SEA Stream
---
Each observation in [SEA](https://riverml.xyz/latest/api/synth/SEA/) generator is composed of 3 features. Only the first two features are relevant. The target is binary, and is positive if the sum of the features exceeds a certain threshold.

Synthetic data generators are useful since they do not store the data but generate it on demand. Although data generators are infinite, for this example, we limit the number of samples generated using the `take()` method.

In [2]:
from river import synth

stream = synth.SEA(seed=42).take(20000)

In [5]:
from river.metrics import Accuracy
from river.evaluate import progressive_val_score

model = MajorityClassClassifier()
metric = Accuracy()

progressive_val_score(dataset=stream, model=model, metric=metric, print_every=1000)

[1,000] Accuracy: 69.70%
[2,000] Accuracy: 69.30%
[3,000] Accuracy: 69.23%
[4,000] Accuracy: 69.17%
[5,000] Accuracy: 69.12%
[6,000] Accuracy: 68.80%
[7,000] Accuracy: 68.61%
[8,000] Accuracy: 68.45%
[9,000] Accuracy: 68.47%
[10,000] Accuracy: 68.34%
[11,000] Accuracy: 68.37%
[12,000] Accuracy: 68.30%
[13,000] Accuracy: 68.32%
[14,000] Accuracy: 68.32%
[15,000] Accuracy: 68.27%
[16,000] Accuracy: 68.34%
[17,000] Accuracy: 68.34%
[18,000] Accuracy: 68.21%
[19,000] Accuracy: 68.18%
[20,000] Accuracy: 68.20%


Accuracy: 68.20%

Which is consistent with the class distribution as the data is slightly unbalanced in favor of the `True` class.

In [6]:
model._counts[True] / sum(model._counts.values())

0.6821

As mentioned, the performance of simple methods such as the majority class classifier is as a baseline.

For example, we can compare it against the Gaussian Naive Bayes classifier:

In [7]:
from river.naive_bayes import GaussianNB

stream = synth.SEA(seed=42).take(20000)

model = GaussianNB()
metric = Accuracy()

progressive_val_score(dataset=stream, model=model, metric=metric, print_every=1000)

[1,000] Accuracy: 92.99%
[2,000] Accuracy: 93.70%
[3,000] Accuracy: 94.23%
[4,000] Accuracy: 94.37%
[5,000] Accuracy: 94.40%
[6,000] Accuracy: 94.05%
[7,000] Accuracy: 94.00%
[8,000] Accuracy: 93.96%
[9,000] Accuracy: 94.02%
[10,000] Accuracy: 94.03%
[11,000] Accuracy: 94.16%
[12,000] Accuracy: 94.27%
[13,000] Accuracy: 94.25%
[14,000] Accuracy: 94.23%
[15,000] Accuracy: 94.14%
[16,000] Accuracy: 94.16%
[17,000] Accuracy: 94.10%
[18,000] Accuracy: 94.08%
[19,000] Accuracy: 94.08%
[20,000] Accuracy: 94.11%


Accuracy: 94.11%

## No Change Classifier

---
Implement the `NoChangeClassifier`, which predicts the label for a new instance to be the true label of the previous instance. Like the Majority Class classifier, it does not require the instance features, so it is very easy to implement. In the intrusion detection case where long passages of “no intrusion” are followed with briefer periods of “intrusion,” this classifier makes mistakes only on the boundary cases, adjusting quickly to the consistent pattern of labels.

In [11]:
from river import base

class NoChangeClassifier(base.Classifier):
    def __init__(self):
        # Initialization
        self._last_y = 0
        self._classes = []
    
    def learn_one(self, x:dict, y):
        # Learn one sample
        <FILL-IN>

    def predict_one(self, x:dict):
        # Predict class for one sample        
        <FILL-IN>

    def predict_proba_one(self, x:dict):
        # Predict class probability for one sample        
        y_proba = {}        
        <FILL-IN>
        return y_proba
    
    @property
    def _multiclass(self):
        return True

In [12]:
model = NoChangeClassifier()
metric = Accuracy()
stream = synth.SEA(seed=42).take(20000)

progressive_val_score(dataset=stream, model=model, metric=metric, print_every=1000)

[1,000] Accuracy: 58.40%
[2,000] Accuracy: 57.80%
[3,000] Accuracy: 57.87%
[4,000] Accuracy: 57.53%
[5,000] Accuracy: 57.14%
[6,000] Accuracy: 57.08%
[7,000] Accuracy: 56.67%
[8,000] Accuracy: 56.51%
[9,000] Accuracy: 56.66%
[10,000] Accuracy: 56.99%
[11,000] Accuracy: 57.09%
[12,000] Accuracy: 57.04%
[13,000] Accuracy: 57.02%
[14,000] Accuracy: 56.92%
[15,000] Accuracy: 56.71%
[16,000] Accuracy: 56.78%
[17,000] Accuracy: 56.75%
[18,000] Accuracy: 56.75%
[19,000] Accuracy: 56.62%
[20,000] Accuracy: 56.61%


Accuracy: 56.61%