# Unit 3 Boosting with AdaBoost in Machine Learning

Welcome\! In today's lesson, we'll explore Boosting, focusing on AdaBoost. Boosting improves model accuracy by combining weak models. By the end, you'll understand AdaBoost and how to use it to improve your machine learning models.

## Introduction to Boosting and AdaBoost

Boosting increases model accuracy by combining weak models. Think of a group of not-so-great basketball players; individually, they may not win, but together they can be strong.

**AdaBoost** (Adaptive Boosting) combines several weak classifiers into a strong one. A weak classifier is slightly better than guessing. AdaBoost focuses on correcting errors made by previous classifiers. Here's how it works:

1.  **Initialize Weights:** Assign equal weights to all training samples.
2.  **Train Weak Classifier:** Train a weak classifier on the weighted data.
3.  **Calculate Error:** Compute the classification error of the weak classifier.
4.  **Update Weights:** Increase the weights of misclassified samples and decrease the weights of correctly classified samples. This ensures that subsequent classifiers focus more on the difficult samples.
5.  **Combine Classifiers:** Combine all the weak classifiers to form a strong classifier, with each classifier's vote weighted according to its accuracy.

## Loading the Dataset and Splitting the Dataset

Before training our model, we need data. We'll use the **wine dataset**, which contains chemical properties of wines. This data helps us train and test our model.

To load the dataset, use `load_wine` from `sklearn.datasets`, which returns features `X` and labels `y`. Features describe the properties, while labels indicate the type of wine.

```python
from sklearn.datasets import load_wine

# Load dataset
X, y = load_wine(return_X_y=True)
```

Next, we split the data into training and testing sets using `train_test_split` from `sklearn.model_selection`. We use 80% for training and 20% for testing.

```python
from sklearn.model_selection import train_test_split

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

## Training an AdaBoost Classifier

Now, let's train our AdaBoost model using `AdaBoostClassifier` from `sklearn.ensemble`. We’ll use `DecisionTreeClassifier` from `sklearn.tree` as the weak classifier. In this case, each decision tree will have just one node.

```python
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Train AdaBoost classifier
ada_clf = AdaBoostClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, algorithm='SAMME')
ada_clf.fit(X_train, y_train)

# Make predictions
y_pred_ada = ada_clf.predict(X_test)
```

In the code:

  * `base_estimator=DecisionTreeClassifier()` specifies the weak classifier.
  * `n_estimators=100` combines 100 weak classifiers.
  * `algorithm='SAMME'` specifies what algorithm to use. There is essentially only one option, which is `'SAMME'`. If it is not set, the program will use another algorithm, called `'SAMME.R'`. However, this algorithm is deprecated and will be removed in the future versions of sklearn, so you shouldn't use it. Always specify `algorithm='SAMME'` when using the AdaBoost classifier.
  * `fit(X_train, y_train)` trains the model.
  * `predict(X_test)` makes predictions on the test set.

## Comparing AdaBoost with RandomForest

To understand the effectiveness of AdaBoost, let’s compare it with `RandomForestClassifier` from `sklearn.ensemble`.

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Train RandomForest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Make predictions with RandomForest
y_pred_rf = rf_clf.predict(X_test)

# Calculate and compare accuracies
accuracy_ada = accuracy_score(y_test, y_pred_ada)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

print(f"AdaBoost accuracy: {accuracy_ada}")  # 0.94
print(f"RandomForest accuracy: {accuracy_rf}")  # 1.0
```

In this code, we initialize the `RandomForestClassifier` with 100 trees, which we already know to perform perfectly on this dataset. Then, we make predictions with Random Forest and compare accuracies of the Random Forest and the AdaBoost models.

In this case, AdaBoost shows slightly lower performance, but its accuracy is still very high and outperforms simple models.

## Lesson Summary

Great job\! You've learned about Boosting and how AdaBoost uses weak classifiers to create a strong model. We covered:

  * What Boosting and AdaBoost are.
  * Loading the wine dataset.
  * Splitting the dataset.
  * Training an AdaBoost classifier with decision trees.
  * Comparing the accuracies of AdaBoost and RandomForest.

Next, you'll practice by loading data, splitting it, and training your own AdaBoost classifier. Ready to boost your skills? Let's dive in\!

## Change the Weak Classifier in AdaBoost

Galactic Pioneer, it's time to change the weak classifier! Replace the DecisionTreeClassifier with a RandomForestClassifier as the base estimator for the AdaBoostClassifier. This will help you see how different weak classifiers perform in boosting.

Let's code!

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost classifier with DecisionTreeClassifier as the base estimator
dt_clf = DecisionTreeClassifier()
ada_clf = AdaBoostClassifier(estimator=dt_clf, n_estimators=100, algorithm='SAMME')
ada_clf.fit(X_train, y_train)

# Make predictions
y_pred_ada = ada_clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred_ada)
print(f"AdaBoost Classifier accuracy: {accuracy}")

```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost classifier with RandomForestClassifier as the base estimator
rf_clf_weak = RandomForestClassifier(n_estimators=10, random_state=42) # Using a small number of estimators for the weak learner
ada_clf_rf = AdaBoostClassifier(estimator=rf_clf_weak, n_estimators=100, algorithm='SAMME', random_state=42)
ada_clf_rf.fit(X_train, y_train)

# Make predictions
y_pred_ada_rf = ada_clf_rf.predict(X_test)

# Calculate accuracy
accuracy_rf_base = accuracy_score(y_test, y_pred_ada_rf)
print(f"AdaBoost Classifier with RandomForest base estimator accuracy: {accuracy_rf_base}")

# Original AdaBoost Classifier with DecisionTreeClassifier (for comparison)
dt_clf = DecisionTreeClassifier()
ada_clf_dt = AdaBoostClassifier(estimator=dt_clf, n_estimators=100, algorithm='SAMME', random_state=42)
ada_clf_dt.fit(X_train, y_train)

# Make predictions
y_pred_ada_dt = ada_clf_dt.predict(X_test)

# Calculate accuracy
accuracy_dt_base = accuracy_score(y_test, y_pred_ada_dt)
print(f"AdaBoost Classifier with DecisionTree base estimator accuracy: {accuracy_dt_base}")
```

## Train and Predict with AdaBoost

Hey Galactic Pioneer, ready to boost your skills?

Here's your mission: complete the code to train an AdaBoost classifier and make predictions. Can you also calculate the accuracy of the model? Fill in the TODO comments and make the code work!

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Initialize AdaBoostClassifier with DecisionTreeClassifier as a base model, 100 estimators, and 'SAMME' algorithm.

# TODO: Fit the AdaBoost classifier with the training data.

# TODO: Make predictions on test dataset and calculate accuracy.

```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Initialize AdaBoostClassifier with DecisionTreeClassifier as a base model, 100 estimators, and 'SAMME' algorithm.
ada_clf = AdaBoostClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, algorithm='SAMME', random_state=42)

# TODO: Fit the AdaBoost classifier with the training data.
ada_clf.fit(X_train, y_train)

# TODO: Make predictions on test dataset and calculate accuracy.
y_pred_ada = ada_clf.predict(X_test)
accuracy_ada = accuracy_score(y_test, y_pred_ada)

print(f"AdaBoost Classifier accuracy: {accuracy_ada}")
```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Initialize AdaBoostClassifier with DecisionTreeClassifier as a base model, 100 estimators, and 'SAMME' algorithm.
ada_clf = AdaBoostClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, algorithm='SAMME', random_state=42)

# TODO: Fit the AdaBoost classifier with the training data.
ada_clf.fit(X_train, y_train)

# TODO: Make predictions on test dataset and calculate accuracy.
y_pred_ada = ada_clf.predict(X_test)
accuracy_ada = accuracy_score(y_test, y_pred_ada)

print(f"AdaBoost Classifier accuracy: {accuracy_ada}")
```

## AdaBoost vs RandomForest

Great job, Space Explorer!

Let's make it more engaging. Fill in the missing pieces of code to train an AdaBoost classifier using a synthetically generated dataset. The generated dataset is imbalanced, and has 0 redundant features. Boosting techniques usually perform better than RandomForest here.

After completing the AdaBoost classifier, also train a RandomForest classifier on the same dataset and compare their performances.

May the stars guide your way!

```python
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=0, weights=[0.9, 0.1], random_state=42)

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize AdaBoost with DecisionTree as the base estimator
ada_clf = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=3), n_estimators=100, algorithm='SAMME')
# TODO: Train the AdaBoost classifier

# TODO: Make predictions and calculate accuracy for AdaBoost

# TODO: Initialize and train the RandomForest classifier

# TODO: Make predictions and calculate accuracy for RandomForest

```

```python
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=0, weights=[0.9, 0.1], random_state=42)

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize AdaBoost with DecisionTree as the base estimator
ada_clf = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=3), n_estimators=100, algorithm='SAMME', random_state=42)
# TODO: Train the AdaBoost classifier
ada_clf.fit(X_train, y_train)

# TODO: Make predictions and calculate accuracy for AdaBoost
y_pred_ada = ada_clf.predict(X_test)
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print(f"AdaBoost Classifier Accuracy: {accuracy_ada:.4f}")

# TODO: Initialize and train the RandomForest classifier
# Using similar number of estimators for a fair comparison, and adjusting class_weight for imbalanced data
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rf_clf.fit(X_train, y_train)

# TODO: Make predictions and calculate accuracy for RandomForest
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"RandomForest Classifier Accuracy: {accuracy_rf:.4f}")
```