This notebook is part of the supplementary material of the books "Online Machine Learning - Eine praxisorientiere Einführung",  
https://link.springer.com/book/9783658425043 and "Online Machine Learning - A Practical Guide with Examples in Python" https://link.springer.com/book/9789819970063
The contents are open source and published under the "BSD 3-Clause License".
This software is provided "as is" without warranty of any kind, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose. The author or authors assume no liability for any damages or liability, whether in contract, tort, or otherwise, arising out of or in connection with the software or the use or other dealings with the software.

# Chapter 2: Supervised Learning: Classification and Regression

# Linear online regression with River

* Linear online regression can be performed in `river` with the `LinearRegression` class 
from the `linear_model` module. 
* In the following example, the MAE (mean absolute error) error is measured during training.

## Laden der Daten

* First, we create a dataset for classification using the `sklearn.datasets` class `make_classification`.

In [None]:
## Generate a classification dataset with sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=10000, n_features=20, n_informative=2, n_redundant=2, random_state=42)

## Convert X to a pandas dataframe
import pandas as pd
X = pd.DataFrame(X)
y = pd.Series(y)


## Auswahl der Metrik und der Datenvorverarbeitung

* The Mean Absolute Error (MAE) is selected as the metric.
* In addition, the scaling of the data is specified:
  * By selecting the `StandardScalers` class, the data is scaled to have zero mean and one variance.
  * Internally, a running mean and running variance are determined. This scaling differs slightly from the scaling of the data in the batch because the exact means and variances are not known in advance.

In [None]:
from river import metrics
from river import preprocessing

metric = metrics.MAE()
scaler = preprocessing.StandardScaler()

## Selection of the model

* In the next step the model (here the `LinearRegression`) is selected.
* For the sequential online optimization of the model coefficients, the stochastic gradient descent (`SGD`) is selected.
  * The learning rate for `SGD` is set to 0.01. 
  * The setting of a suitable learning rate is crucial for the optimization.

In [None]:
from river import optim
from river import linear_model

optimizer = optim.SGD(lr=0.01)
lin_reg = linear_model.LinearRegression(optimizer)

## Test-then-train 

* Now the single samples of the dataset are used for testing and training. 
  * For each sample, the metric is updated.
* Finally, the metric that was incrementally calculated on the data from the entire process is available.

In [None]:
from river import stream

y_true = []
y_pred = []
metric_list = []

for xi, yi in stream.iter_pandas(X, y):

    # Scale the features
    xi_scaled = scaler.learn_one(xi).transform_one(xi)

    # Test the current model on the new "unobserved" sample
    yi_pred = lin_reg.predict_one(xi_scaled)
    # Train the model with the new sample
    lin_reg.learn_one(xi_scaled, yi)

    # Store the truth and the prediction
    y_true.append(yi)
    y_pred.append(yi_pred)
    metric = metric.update(yi, yi_pred)
    # Store the metric after each sample in a list
    metric_list.append(metric.get( ))

print(metric)


## Plotting the values of the metric_list versus the sample number.

In [None]:
import matplotlib.pyplot as plt
# Plot the values of the metric_list verus the number of samples
plt.plot(metric_list)
plt.xlabel("Number of samples")
plt.ylabel("MAE")
plt.show()

* Da der `SGD` nicht-deterministisch ist, führt jeder Aufruf zu einem (leicht) modifizierten Ergebnis.

# SVM (ALMA) Classification for Synthetic Data

## Loading the data 

* Synthetic classification data is generated using the `make_classification` function from the `sklearn` package, see [sklearn.datasets.make_classification](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html).
  * This function allows the creation of classification problems with $n$ classes.
  * In our example, $n=2$ is chosen (this is the default).  
* The data is then split into the training and test datasets using the `train_test_split` function.

In [None]:
from sklearn.datasets import make_classification
from river import linear_model

X, y = make_classification(shuffle=True, n_samples=2000)

# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


## Model Selection and Model Fit

In [None]:
model = linear_model.ALMAClassifier()

# fit the model
for x_i,y_i in zip(X_train,y_train):
    x_json = {'val'+str(i): x for i,x in enumerate(x_i)}
    model.learn_one(x_json,y_i)


## Predict on the test set and compute the accuracy

In [None]:
preds = []
for x_i in X_test:
    x_json = {'val'+str(i): x for i,x in enumerate(x_i)}
    preds.append(model.predict_one(x_json))

# compute accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_test, preds)

## Show the scikit-learn "classification report"

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, preds))

# SVM (ALMA) classification for a sklearn classification dataset

In [None]:
## Generate a classification dataset with sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=10000, n_features=20, n_informative=2, n_redundant=2, random_state=42)

scaler = preprocessing.StandardScaler()


## Convert X to a pandas dataframe
import pandas as pd
X = pd.DataFrame(X)
y = pd.Series(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

* The model is fitted on the training data:

In [None]:
model = linear_model.ALMAClassifier()
for xi, yi in stream.iter_pandas(X_train,y_train):
    xi_scaled = scaler.learn_one(xi).transform_one(xi)
    model.learn_one(xi_scaled, yi) 

*  Prediction on the test data:

In [None]:
preds = []
for xi, _ in stream.iter_pandas(X_test):    
    xi_scaled = scaler.learn_one(xi).transform_one(xi)
    preds.append(model.predict_one(xi_scaled))

* Computation of the accuracy metric:

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, preds)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, preds))

# Passiv-Aggressiv Classification for Synthetic Data

In [None]:
from sklearn.datasets import make_classification
X,y=make_classification(shuffle=True,n_samples=2000)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from river import linear_model

model = linear_model.PAClassifier()

for x_i,y_i in zip(X_train,y_train):
    x_json = {'val'+str(i): x for i,x in enumerate(x_i)}
    model.learn_one(x_json,y_i)
    

preds = []
for x_i in X_test:
    x_json = {'val'+str(i): x for i,x in enumerate(x_i)}
    preds.append(model.predict_one(x_json))

from sklearn.metrics import accuracy_score
accuracy_score(y_test, preds)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, preds))

# Classification Tree

In [None]:
import matplotlib.pyplot as plt
import datetime as dt

from river import datasets
from river import evaluate
from river import metrics
from river import preprocessing
from river.datasets import synth
from river import tree

from river.datasets import synth

dataset = synth.SEA(variant=0, seed=42)
#dataset = datasets.Phishing()
data = dataset.take(10000)

model = tree.HoeffdingTreeClassifier(grace_period=50)

for x, y in data:
    print(x, y)
    x = {f'x_{key}': value for key, value in x.items()}
    print(x,y)
    model.learn_one(x, y)

model

model.summary

model.draw()
