<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/6_2_2_Activity_AdaBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<br>

**First things first** - please go to 'File' and select 'Save a copy in Drive' so that you have your own version of this activity set up and ready to use.
Remember to update the portfolio index link to your own work once completed!

# 6.2.2 Activity

## Objective:

Create a simple AdaBoost implementation and test it on the data set used in the demonstration video.

## Activity guidance:

* Create your own simple implementation of AdaBoost based on the steps provided in the lecture notes and demonstration video.
* Use the provided data set, which is the same as the one used in the  demonstration.
* Create a DataFrame that stores your predicted values at each iteration.
* Compare your predictions against your training set to verify for overfitting.
* Compare the results against the demonstration to verify the correct implementation.
* (Optional) Implement your algorithm against the Iris data set.


In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score

# Data

In [18]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split



In [21]:
# Dataset
X = dict()
X["a"] = [2, 3, 5, 6, 7, 9, 10, 12, 13, 14]
X["b"] = [5, 1, 8, 7, 2, 12, 6, 11, 3, 10]
X["target"] = [1, 1, 1, 0, 0, 1, 0, 1, 0, 0]

df = pd.DataFrame(X, columns=X.keys())

# Features and target
x = df[["a", "b"]]
y = df["target"]

# Initialize equal weights
df["weights_1"] = 1 / len(df)

In [23]:

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)


# AdaBoost function
def AdaBoost(X, y, n_estimators=5):
    n_samples = X.shape[0]
    weights = np.ones(n_samples) / n_samples  # Initialize equal weights

    models = []
    alphas = []
    predictions_at_each_iteration = pd.DataFrame(index=range(n_samples))

    for i in range(n_estimators):
        # Train a weak classifier (Decision Stump)
        stump = DecisionTreeClassifier(max_depth=1)  # Decision stump (a tree with max depth of 1)
        stump.fit(X, y, sample_weight=weights)

        # Make predictions
        predictions = stump.predict(X)
        predictions_at_each_iteration[f"Iteration_{i+1}"] = predictions

        # Calculate weighted error rate
        error = np.sum(weights * (predictions != y)) / np.sum(weights)

        # Calculate alpha (classifier weight)
        alpha = 0.5 * np.log((1 - error) / (error + 1e-10))  # Adding a small number to avoid division by zero

        # Update weights
        weights *= np.exp(-alpha * y * (2 * (predictions == y) - 1))  # Correct predictions get reduced weight
        weights /= np.sum(weights)  # Normalize weights

        # Store the stump and alpha
        models.append(stump)
        alphas.append(alpha)

        print(f"Iteration {i+1}:")
        print(f"  Error: {error}")
        print(f"  Alpha: {alpha}")
        print(f"  Weights: {weights}")
        print(f"  Predictions: {predictions}")
        print()

    return models, alphas, predictions_at_each_iteration

# Run AdaBoost on the training data
models, alphas, train_predictions = AdaBoost(x_train, y_train, n_estimators=5)



Iteration 1:
  Error: 0.14285714285714288
  Alpha: 0.8958797342640274
  Weights: 0    0.056186
7    0.337117
2    0.056186
9    0.137628
4    0.137628
3    0.137628
6    0.137628
Name: target, dtype: float64
  Predictions: [1 0 1 0 0 0 0]

Iteration 2:
  Error: 0.11237243574396417
  Alpha: 1.0333667853567925
  Weights: 0    0.160108
7    0.121617
2    0.160108
9    0.139542
4    0.139542
3    0.139542
6    0.139542
Name: target, dtype: float64
  Predictions: [0 1 0 0 0 0 0]

Iteration 3:
  Error: 0.12161691309846019
  Alpha: 0.9886033837342585
  Weights: 0    0.059329
7    0.325489
2    0.059329
9    0.138963
4    0.138963
3    0.138963
6    0.138963
Name: target, dtype: float64
  Predictions: [1 0 1 0 0 0 0]

Iteration 4:
  Error: 0.11865758115876851
  Alpha: 1.0026021721224692
  Weights: 0    0.161908
7    0.119589
2    0.161908
9    0.139149
4    0.139149
3    0.139149
6    0.139149
Name: target, dtype: float64
  Predictions: [0 1 0 0 0 0 0]

Iteration 5:
  Error: 0.1195886217187714

# DataFrame with predictions

In [26]:
# Final combined model prediction
def predict_boosted(models, alphas, X):
    final_predictions = np.zeros(X.shape[0])
    for model, alpha in zip(models, alphas):
        predictions = model.predict(X)
        final_predictions += alpha * (2 * (predictions == 1) - 1)
    return (final_predictions > 0).astype(int)

# Predict on training and test data
y_train_pred = predict_boosted(models, alphas, x_train)
y_test_pred = predict_boosted(models, alphas, x_test)

# Evaluate on training data
print("\nTraining Data Evaluation:")
print(f"Accuracy: {accuracy_score(y_train, y_train_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_train, y_train_pred)}")
print(f"Classification Report:\n{classification_report(y_train, y_train_pred)}")

# Evaluate on test data
print("\nTest Data Evaluation:")
print(f"Accuracy: {accuracy_score(y_test, y_test_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_test_pred)}")
print(f"Classification Report:\n{classification_report(y_test, y_test_pred)}")


Training Data Evaluation:
Accuracy: 0.8571428571428571
Confusion Matrix:
[[4 0]
 [1 2]]
Classification Report:
              precision    recall  f1-score   support

           0       0.80      1.00      0.89         4
           1       1.00      0.67      0.80         3

    accuracy                           0.86         7
   macro avg       0.90      0.83      0.84         7
weighted avg       0.89      0.86      0.85         7


Test Data Evaluation:
Accuracy: 0.6666666666666666
Confusion Matrix:
[[1 0]
 [1 1]]
Classification Report:
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       1.00      0.50      0.67         2

    accuracy                           0.67         3
   macro avg       0.75      0.75      0.67         3
weighted avg       0.83      0.67      0.67         3



# Check for overfitting

In [27]:
# Accuracy on training set at each iteration
for i in range(1, 6):
    print(f"Iteration {i} - Training Accuracy: {accuracy_score(y_train, train_predictions[f'Iteration_{i}'])}")

# Accuracy on test set using final boosted model
print(f"Final Test Set Accuracy: {accuracy_score(y_test, y_test_pred)}")


Iteration 1 - Training Accuracy: 0.8571428571428571
Iteration 2 - Training Accuracy: 0.7142857142857143
Iteration 3 - Training Accuracy: 0.8571428571428571
Iteration 4 - Training Accuracy: 0.7142857142857143
Iteration 5 - Training Accuracy: 0.8571428571428571
Final Test Set Accuracy: 0.6666666666666666



In this AdaBoost experiment, I observed how the model's performance evolved over five iterations. Initially, the training accuracy was relatively high, around 86%, but it fluctuated across iterations, dropping to about 71% in some cases before recovering again. This variation is typical in AdaBoost, as the algorithm continually shifts its focus to the more challenging examples that were previously misclassified.

Interestingly, while the training accuracy bounced back to 86% by the final iteration, the model's ability to generalise was somewhat limited, with the test set accuracy sitting at 66.67%. This discrepancy between training and test accuracy suggests signs of overfitting, where the model becomes overly specialised to the training data but struggles to perform as well on unseen data.

The alternating training accuracy and lower test accuracy highlight the trade-off AdaBoost faces—its iterative focus on difficult examples can lead to stronger performance on the training set but may come at the cost of generalisation.

In future, I'd look to address this by perhaps increasing the dataset size, experimenting with regularisation techniques, or fine-tuning parameters like the number of iterations or base learner depth.

# (Optional) Run your algorithm against the Iris dataset

Make sure to do the appropriate preprocessing of the data before  running it through your algorithm.

In [None]:
from sklearn.datasets import load_iris

x, y = load_iris(return_X_y=True, as_frame=True)
y