<a href="https://colab.research.google.com/github/quartermaine/Machine-Learning-Notebooks/blob/main/Incremental_learning/Incremental_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this Google Colab notebook, our focus is on implementing incremental mini-batch learning utilizing the [car_evaluation](https://www.kaggle.com/datasets/elikplim/car-evaluation-data-set) dataset. To kick off the process, we've prepared two CSV files—one containing training data and the other holding test data. The sequential steps detailed in the notebook are as follows:


**STEP 1: Initial Model Creation and Cross-Validation**

Firstly, we establish an initial model and conduct cross-validation. This involves assessing accuracy as a metric on the training data. To achieve this, we leverage algorithms from both the sklearn and river packages.

**STEP 2: Incremental Batch Learning and Accuracy Evaluation**

Following the initial model creation, we proceed with incremental batch learning using the test data. Subsequently, we evaluate the accuracy once again.

**NOTE**

 It's essential to highlight that during the second step, which involves incremental learning, we generate two accuracy metrics for the river algorithms. These metrics indicate the accuracy levels before and after the learning process. Conversely, for the sklearn models, only the accuracy after the learning process is available.

## Install and import libraries

In [1]:
try:
    import river
except ImportError:
    !pip install river


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn import model_selection, metrics as sk_metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier, Perceptron, PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from river import tree, compose, compat, preprocessing, multiclass, metrics as riv_metrics, evaluate, imblearn, ensemble, forest


In [3]:
# set random seed
# np.random.seed(1234)


## Helper functions and classes

In [4]:
class RiverModel:
  """
    A class representing a River model.

    Attributes:
    - model_name (str): The name of the River model.
    - model (object): The instance of the River model.

    Methods:
    - __init__(model_name): Initializes a RiverModel instance.
    - create_model(): Creates and returns a River model based on the specified model_name.
    - fit_model(X, y): Fits the River model on the given training data X and labels y.
    - get_cv_score(X, y): Performs cross-validation and returns the accuracy scores.

    Example:
    ```python
    # Create a RiverModel instance for a decision tree
    tree_model = RiverModel(model_name='tree')

    # Fit the model on training data
    tree_model.fit_model(X_train, y_train)

    # Get cross-validation accuracy scores
    cv_scores = tree_model.get_cv_score(X_train, y_train)
    ```
  """

  def __init__(self, model_name):
    """
    Initializes a RiverModel instance.

    Parameters:
    - model_name (str): The name of the River model.
    """
    self.model_name = model_name
    self.model = self.create_model()

  def create_model(self):
    """
    Creates and returns a River model based on the specified model_name.

    Returns:
    - model (object): The instance of the River model.
    """

    classifier_tree = tree.HoeffdingTreeClassifier(
        split_criterion='gini',
        grace_period=1000,
        max_depth=150
    )

    classifier_forest = forest.AMFClassifier(
        n_estimators=100,
        use_aggregation=True,
        dirichlet=0.5
    )

    if self.model_name == 'tree':
        return multiclass.OneVsRestClassifier(classifier_tree)

    if self.model_name == 'forest':
        return multiclass.OneVsRestClassifier(classifier_forest)

    if self.model_name == 'boost':
        return ensemble.AdaBoostClassifier(model=classifier_tree, n_models=5)

  def fit_model(self, X, y):
    """
    Fits the River model on the given training data X and labels y.

    Parameters:
    - X (pd.DataFrame): The input features.
    - y (pd.Series): The target labels.
    """

    # river model convert to sklearn
    self.model = compat.convert_river_to_sklearn(self.model)
    self.model.fit(X, y)

  def get_cv_score(self, X, y):
    """
    Performs cross-validation and returns the accuracy scores.

    Parameters:
    - X (pd.DataFrame): The input features.
    - y (pd.Series): The target labels.

    Returns:
    - scores (list): List of accuracy scores.
    """
    cv = model_selection.KFold(n_splits=10, shuffle=True)
    scores = model_selection.cross_val_score(self.model,
                                              X,
                                              y,
                                              scoring='accuracy',
                                              cv=cv,
                                              error_score='raise'
                                              )
    print(f'Accuracy ± Std for {self.model_name}: {scores.mean():.3f} (± {scores.std():.3f})')
    return scores


In [5]:
class SklearnModel:
  """
  A class representing a Sklearn model.

  Attributes:
  - model_name (str): The name of the Sklearn model.
  - model (object): The instance of the Sklearn model.

  Methods:
  - __init__(model_name): Initializes a SklearnModel instance.
  - create_model(): Creates and returns a Sklearn model based on the specified model_name.
  - fit_model(X, y): Fits the Sklearn model on the given training data X and labels y.
  - get_cv_score(X, y): Performs cross-validation and returns the accuracy scores.

  Example:
  ```python
  # Create a SklearnModel instance for a SGDClassifier
  sgd_model = SklearnModel(model_name='SGDClassifier')

  # Fit the model on training data
  sgd_model.fit_model(X_train, y_train)

  # Get cross-validation accuracy scores
  cv_scores = sgd_model.get_cv_score(X_train, y_train)
   ```
  """
  def __init__(self, model_name):
    """
    Initializes a SklearnModel instance.

    Parameters:
    - model_name (str): The name of the Sklearn model.
    """
    self.model_name = model_name
    self.model = self.create_model()

  def create_model(self):
    if self.model_name == 'SGDClassifier':
        return SGDClassifier(loss="modified_huber",
                              max_iter=500,
                              warm_start=True
                              )

    if self.model_name == 'MultinomialNB':
        return MultinomialNB()

    if self.model_name == 'Perceptron':
        return Perceptron()

    if self.model_name == 'PassiveAggressiveClassifier':
        return PassiveAggressiveClassifier()

  def fit_model(self, X, y):
    """
    Fits the Sklearn model on the given training data X and labels y.

    Parameters:
    - X (pd.DataFrame): The input features.
    - y (pd.Series): The target labels.
    """
    self.model.fit(X, y)

  def get_cv_score(self, X, y):
    """
    Performs cross-validation and returns the accuracy scores.

    Parameters:
    - X (pd.DataFrame): The input features.
    - y (pd.Series): The target labels.

    Returns:
    - scores (list): List of accuracy scores.
    """
    cv = model_selection.KFold(n_splits=10, shuffle=True)
    scores = model_selection.cross_val_score(self.model,
                                              X,
                                              y,
                                              scoring='accuracy',
                                              cv=cv,
                                              error_score='raise'
                                              )
    print(f'Accuracy ± Std for {self.model_name}: {scores.mean():.3f} (± {scores.std():.3f})')
    return scores


## Import data and check for class imbalances

In [6]:
df = pd.read_csv('/content/car_evaluation.csv')


In [7]:
df.head()


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [8]:
# sample data
data = df.sample(frac=1)


In [9]:
# Calculate the percentage of each class
class_percentages = data['class'].value_counts(normalize=True) * 100

# Display the result
print(class_percentages)


class
unacc    69.468268
acc      22.241281
good      4.345340
vgood     3.945111
Name: proportion, dtype: float64


## Initialize models and cross validation

In [10]:
# Transform features and target
pp_feat = compose.Select('buying',
                         'maint',
                         'doors',
                         'persons',
                         'lug_boot',
                         'safety') | preprocessing.OneHotEncoder()

pp_target = compose.Select('class')

X_transf = pp_feat.transform_many(data).sparse.to_dense()
y_transf = pp_target.transform_many(data).values.ravel() # no actual transfromation
                                                         # performed


In [11]:
# X_transf


In [12]:
# y_transf


In [13]:
# combine river and sklearn models
river_models = ['tree', 'forest', 'boost']
sk_models = ['SGDClassifier', 'MultinomialNB', 'Perceptron', 'PassiveAggressiveClassifier']
all_models = river_models + sk_models

results = {'model': [], 'accuracy_mean': [], 'accuracy_std': [], 'fitted_model': []}

# Fit and cross-validate river models
for model_name in river_models:
    model_instance = RiverModel(model_name)
    model_instance.fit_model(X_transf, y_transf)
    results['fitted_model'].append(model_instance)

# Fit and cross-validate sklearn models
for model_name in sk_models:
    model_instance = SklearnModel(model_name)
    model_instance.fit_model(X_transf, y_transf)
    results['fitted_model'].append(model_instance)

# Add cross-validation results for all models
for model_instance in results['fitted_model']:
    cv_scores = model_instance.get_cv_score(X_transf, y_transf)
    results['model'].append(model_instance.model_name)
    results['accuracy_mean'].append(cv_scores.mean())
    results['accuracy_std'].append(cv_scores.std())


Accuracy ± Std for tree: 0.796 (± 0.034)
Accuracy ± Std for forest: 0.758 (± 0.037)
Accuracy ± Std for boost: 0.695 (± 0.036)
Accuracy ± Std for SGDClassifier: 0.879 (± 0.029)
Accuracy ± Std for MultinomialNB: 0.854 (± 0.025)
Accuracy ± Std for Perceptron: 0.866 (± 0.032)
Accuracy ± Std for PassiveAggressiveClassifier: 0.851 (± 0.025)


In [14]:
# Convert the results dictionary to a DataFrame
results_df = pd.DataFrame(results)

# Create an interactive bar plot using Plotly Express with a darker color palette
fig = px.bar(results_df,
             x='model',
             y='accuracy_mean',
             error_y='accuracy_std',
             color='accuracy_mean',
             color_continuous_scale='darkmint',
             labels={'accuracy_mean': 'Accuracy'},
             title='Cross-Validation Accuracy for Different Models'
             )

# Customize the layout
fig.update_layout(xaxis_title='Model',
                  yaxis_title='Accuracy',
                  xaxis_tickangle=-45
                  )

# Show the plot
fig.show()


## Import new data and perform incremental mini batch learning

In [15]:
# import new data
new_data = pd.read_csv('/content/car_evaluation_newdata.csv')
new_data.head()


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,low,low,3,more,small,low,unacc
1,low,low,2,2,small,low,unacc
2,low,med,5more,2,big,med,unacc
3,low,med,5more,4,small,low,unacc
4,low,med,3,more,med,med,good


In [16]:
# transform new data
X_transf_new = pp_feat.transform_many(new_data).sparse.to_dense()
y_transf_new = pp_target.transform_many(new_data).values.ravel()


In [17]:
# X_transf_new.head(1)


In [18]:
# y_transf_new


In [19]:
# Initialize dictionaries to store evaluation results
sk_evaluation_results = {'model': [], 'accuracy': []}

# Iterate over sklearn models
for model_instance in results['fitted_model']:
    if model_instance.model_name in sk_models:
        model_sk = model_instance.model  # Retrieve the sklearn model

        # Perform partial fit
        model_sk.partial_fit(X_transf_new[list(model_sk.feature_names_in_)],
                              y_transf_new,
                              classes=np.unique(y_transf_new))

        # Make predictions
        preds_sk = model_sk.predict(X_transf_new[list(model_sk.feature_names_in_)])

        # Calculate accuracy
        accuracy_sk = sk_metrics.accuracy_score(y_transf_new, preds_sk)

        # Store results
        sk_evaluation_results['model'].append(model_instance.model_name)
        sk_evaluation_results['accuracy'].append(accuracy_sk)

        print(f'Accuracy {model_instance.model_name}: {accuracy_sk:.3f}')


Accuracy SGDClassifier: 0.800
Accuracy MultinomialNB: 0.800
Accuracy Perceptron: 0.867
Accuracy PassiveAggressiveClassifier: 0.867


In [20]:
from river import metrics as riv_metrics

riv_evaluation_results_before = {'model': [], 'accuracy': []}
riv_evaluation_results_after = {'model': [], 'accuracy': []}

# Iterate over river models
for model_instance in results['fitted_model']:
    if model_instance.model_name in river_models:
        model_riv_sk = model_instance.model  # Retrieve the model
        # Convert scikit-learn model to river model
        model_riv = compat.convert_sklearn_to_river(model_riv_sk,
                                                    classes=np.unique(y_transf_new)
                                                    )

        # Metrics for BEFORE ONLINE TRAINING
        acc_old = riv_metrics.Accuracy()

        # Test the current model on the new "unobserved" sample
        for index, row in X_transf_new.iterrows():
            x_dict = row.to_dict()
            y = y_transf_new[index]
            y_pred_old = model_riv.predict_one(x_dict)
            acc_old.update(y, y_pred_old)

        # Metrics for AFTER ONLINE TRAINING
        acc_new = riv_metrics.Accuracy()

        # Train the model with the new sample
        for index, row in X_transf_new.iterrows():
            x_dict = row.to_dict()
            y = y_transf_new[index]
            model_riv.learn_one(x_dict, y)
            # Test the current new model on the new "observed" sample
            y_pred_new = model_riv.predict_one(x_dict)
            acc_new.update(y, y_pred_new)

        # Store results
        riv_evaluation_results_before['model'].append(f"{model_instance.model_name}")
        riv_evaluation_results_before['accuracy'].append(acc_old.get())
        riv_evaluation_results_after['model'].append(f"{model_instance.model_name}")
        riv_evaluation_results_after['accuracy'].append(acc_new.get())

        print(f'Accuracy Score (for {model_instance.model_name} / before online): {acc_old.get():.3f}')
        print(f'Accuracy Score (for {model_instance.model_name} / after online): {acc_new.get():.3f}')


Accuracy Score (for tree / before online): 0.533
Accuracy Score (for tree / after online): 0.533
Accuracy Score (for forest / before online): 0.533
Accuracy Score (for forest / after online): 0.667
Accuracy Score (for boost / before online): 0.467
Accuracy Score (for boost / after online): 0.467


In [21]:
###############################################################################
# Matplotlib Plot for Model Evaluation Before and After Inremental Learning   #
###############################################################################

# # Convert dictionaries to DataFrames
# sk_evaluation_df = pd.DataFrame(sk_evaluation_results)
# riv_evaluation_before_df = pd.DataFrame(riv_evaluation_results_before)
# riv_evaluation_after_df = pd.DataFrame(riv_evaluation_results_after)

# # Add a column to each dataframe to indicate the evaluation type
# sk_evaluation_df['evaluation'] = 'after'
# riv_evaluation_before_df['evaluation'] = 'before'
# riv_evaluation_after_df['evaluation'] = 'after'

# # Concatenate the results for all models
# combined_results = pd.concat([riv_evaluation_before_df, riv_evaluation_after_df, sk_evaluation_df])

# # Make a plot using Seaborn's barplot
# fig, ax = plt.subplots(figsize=(14, 8))

# # Plot for all models
# sns.barplot(x='model', y='accuracy', hue='evaluation', data=combined_results, palette='viridis', ax=ax)

# # Set plot labels and title
# ax.set_ylabel('Accuracy')
# ax.set_title('Model Evaluation Before and After Online')

# # Rotate x-axis labels for better visibility
# plt.xticks(rotation=45, ha='right')

# # Display the legend
# plt.legend(title='Evaluation')

# plt.tight_layout()
# plt.show()


In [22]:
# Convert dictionaries to DataFrames
sk_evaluation_df = pd.DataFrame(sk_evaluation_results)
riv_evaluation_before_df = pd.DataFrame(riv_evaluation_results_before)
riv_evaluation_after_df = pd.DataFrame(riv_evaluation_results_after)

# Add a column to each dataframe to indicate the evaluation type
sk_evaluation_df['evaluation'] = 'after'
riv_evaluation_before_df['evaluation'] = 'before'
riv_evaluation_after_df['evaluation'] = 'after'

# Concatenate the results for all models
combined_results = pd.concat([riv_evaluation_before_df,
                              riv_evaluation_after_df,
                              sk_evaluation_df]
                             )

# Make a plot using Plotly Express
fig = px.bar(
    combined_results,
    x='model',
    y='accuracy',
    color='evaluation',
    barmode='group',  # 'group' for side-by-side bars
    labels={'accuracy': 'Accuracy', 'model': 'Model'},
    title='Model Evaluation Before and After Incremental Learning',
)

# Rotate x-axis labels for better visibility
fig.update_layout(xaxis=dict(tickangle=45,
                             tickmode='array',
                             tickvals=combined_results['model'])
                  )

fig.show()






In [23]:
# combined_results