# Unit 1 Bagging in Machine Learning

Hello\! In this lesson, we're diving into a powerful technique in machine learning called **Bagging**. Bagging stands for **Bootstrap Aggregating**. Imagine making important decisions by averaging the opinions of a large group rather than relying on just one individual. This collaborative approach generally leads to better and more stable decisions. The idea behind all ensemble methods is to combine predictions from multiple models to produce a single prediction. Our goal is to understand **Bagging**, how it works, and how to implement it using Python's **scikit-learn** library.

Bagging is an ensemble method. It improves the stability and accuracy of machine learning models by training multiple copies of a dataset and combining their results. Think of it as working with a panel of experts rather than a single adviser.

### How Bagging Works: An Example

Let's break it down with a simple example:

Suppose you have a dataset of different types of flowers and you want to classify them. Instead of training just one decision tree which might overfit to your training data, you can train multiple decision trees on different subsets of your data. Each subset is created by randomly selecting samples from the original dataset (with replacement) and has the same size as the original dataset. Then, you aggregate the predictions from all the trees. This process reduces overfitting and leads to a more robust model.

It is important to note that a decision tree is just an example. You can use any model with bagging.

### Loading a Dataset and Splitting the Data

Let's start by loading a dataset. Think of it as a table of data where each row is an example we're learning from, and each column is a feature or quality about the examples. For today, we'll use a dataset about wine. This dataset comes with `scikit-learn`, so it's easy to load.

Here's the code to load the dataset:

```python
from sklearn.datasets import load_wine

# Load dataset
X, y = load_wine(return_X_y=True)
# Note: The output is a tuple of feature matrix X and target vector y
```

In this code, `X` represents the features of the dataset, and `y` represents the labels (the class of wine).

To test our model properly, we split our data into training and testing parts, like studying for a test and then taking it. Use `train_test_split` from `scikit-learn` to do this:

```python
from sklearn.model_selection import train_test_split

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# X_train, X_test, y_train, and y_test are arrays
# For instance, len(X_train) would be 142, which is 80% of 178 samples
```

  * `test_size=0.2` uses 20% of the data for testing and 80% for training.
  * `random_state=42` ensures the split is the same each time you run the code.

### Building and Training a Single Decision Tree Classifier

Before we dive into Bagging, let's first build a simple decision tree to see its performance:

```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Train a single decision tree classifier
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred_tree = tree_clf.predict(X_test)
tree_accuracy = accuracy_score(y_test, y_pred_tree)
print(f"Accuracy of single Decision Tree: {tree_accuracy:.2f}")  # 0.94
```

### Building and Training a Bagging Classifier: Part 1

Now let's create our Bagging classifier. We’ll start by defining the Bagging classifier and specifying its parameters:

```python
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Train a bagging classifier
bag_clf = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)
```

In this code:

  * We create a `BaggingClassifier`, our team captain organizing the mini-models.
  * `estimator=DecisionTreeClassifier()` means each mini-model is a decision tree.
  * `n_estimators=100` means we’ll have 100 mini-models (or decision trees) on our team.

### Building and Training a Bagging Classifier: Part 2

Let's continue by training the classifier with our training data:

```python
bag_clf.fit(X_train, y_train)
```

Finally, let's make predictions with our Bagging classifier and evaluate its performance:

```python
# Predict and calculate accuracy
y_pred_bag = bag_clf.predict(X_test)
bag_accuracy = accuracy_score(y_test, y_pred_bag)
print(f"Accuracy of Bagging Classifier: {bag_accuracy:.2f}")  # 0.97
```

We see that using the bagging technique helped us to improve the resulting accuracy from 0.94 to 0.97.

### Advantages and Disadvantages of Bagging

While bagging offers numerous benefits, it also has some drawbacks:

**Advantages:**

  * **Reduced Overfitting:** By aggregating the results from multiple models, bagging helps to minimize overfitting.
  * **Improved Accuracy:** The overall performance of the ensemble method is generally better than that of a single model.
  * **Stability:** Bagging provides more stable predictions by reducing the variance in the model's output.

**Disadvantages:**

  * **Increased Computational Cost:** Training multiple models can be computationally expensive and time-consuming.
  * **Complexity:** Combining multiple models can make the model more complex and harder to interpret.

### Lesson Summary and Practice Introduction

Well done\! You've learned what **Bagging** is, why it's useful, and how it works through an example. You also learned how to load a dataset, split it, and build both a single Decision Tree and a Bagging classifier using `scikit-learn`. We've shown that the Bagging classifier typically performs better by combining the results of multiple decision trees.

Now, it’s time for hands-on practice\! Apply what you've learned by writing the code yourself. Ready? Let's get started\!

## Adjust the Number of Estimators

Hey Space Voyager!

It's time to tweak our wine classifier a bit. Change the base estimator in the BaggingClassifier from DecisionTreeClassifier to KNeighborsClassifier to see how it impacts the accuracy.

Let's code!

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a bagging classifier
bag_clf = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)
bag_clf.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred_bag = bag_clf.predict(X_test)
bag_accuracy = accuracy_score(y_test, y_pred_bag)

print(f"Bagging Classifier Accuracy with DecisionTreeClassifier: {bag_accuracy:.2f}")

```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_wine(return_X_y=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a bagging classifier with DecisionTreeClassifier (original)
bag_clf_dt = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)
bag_clf_dt.fit(X_train, y_train)

# Predict and calculate accuracy for DecisionTreeClassifier
y_pred_bag_dt = bag_clf_dt.predict(X_test)
bag_accuracy_dt = accuracy_score(y_test, y_pred_bag_dt)

print(f"Bagging Classifier Accuracy with DecisionTreeClassifier: {bag_accuracy_dt:.2f}")

# Train a bagging classifier with KNeighborsClassifier
bag_clf_knn = BaggingClassifier(estimator=KNeighborsClassifier(), n_estimators=100, random_state=42)
bag_clf_knn.fit(X_train, y_train)

# Predict and calculate accuracy for KNeighborsClassifier
y_pred_bag_knn = bag_clf_knn.predict(X_test)
bag_accuracy_knn = accuracy_score(y_test, y_pred_bag_knn)

print(f"Bagging Classifier Accuracy with KNeighborsClassifier: {bag_accuracy_knn:.2f}")
```

## Train and Evaluate Bagging Classifier

Alright, young data scientist! It's time to add a missing piece. Train and evaluate a Bagging Classifier to classify wines based on their features. Make sure to use the DecisionTreeClassifier as a base model.

Fill in the missing pieces, and let's see how high you can score!

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset and split it
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: initialize and train the Bagging classifier, use `Decision Tree` as a base estimator

# Predict and calculate accuracy on testing set
y_pred = bag_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier Test Accuracy: {accuracy:.2f}")

```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset and split it
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Bagging classifier, use `Decision Tree` as a base estimator
# Create an instance of DecisionTreeClassifier to be used as the base estimator
base_estimator = DecisionTreeClassifier(random_state=42)

# Initialize BaggingClassifier with the DecisionTreeClassifier as the estimator
# n_estimators=100 is a common choice for the number of base estimators
# random_state ensures reproducibility
bag_clf = BaggingClassifier(estimator=base_estimator, n_estimators=100, random_state=42)

# Train the Bagging classifier on the training data
bag_clf.fit(X_train, y_train)

# Predict and calculate accuracy on testing set
y_pred = bag_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier Test Accuracy: {accuracy:.2f}")
```

## Optimize Bagging Classifier for Wine Classification

Hey there, Space Wanderer!

Let's optimize our Bagging classifier for classifying different types of wine. Use TODO comments to fill in the code for initializing and training the classifier, and evaluate its performance with varying numbers of decision trees n_estimators.

May the celestial forces be with you!

```python
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

best_accuracy = 0
best_n_estimators = 0

for n in range(10, 110, 10):
    # TODO: Initialize a BaggingClassifier with DecisionTreeClassifier and given n_estimators (n) and put it into the bag_clf variable
    # TODO: Fit the BaggingClassifier to the training data
    
    y_pred = bag_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_n_estimators = n

print(f"Best n_estimators: {best_n_estimators}, Best Accuracy: {best_accuracy:.2f}")

```

```python
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

best_accuracy = 0
best_n_estimators = 0

for n in range(10, 110, 10):
    # Initialize a BaggingClassifier with DecisionTreeClassifier and given n_estimators (n)
    bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=n, random_state=42)
    
    # Fit the BaggingClassifier to the training data
    bag_clf.fit(X_train, y_train)
    
    y_pred = bag_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_n_estimators = n

print(f"Best n_estimators: {best_n_estimators}, Best Accuracy: {best_accuracy:.2f}")
```

## Enhance Your Bagging Classifier

Great job, Galactic Pioneer! Now, let's make it a bit more challenging. You'll need to fill in the missing pieces to complete our Bagging classifier code.

Load the wine dataset.
Split the dataset into training and test sets.
Train a Bagging classifier with different numbers of estimators (50 to 150 with a step of 10) and different base models (DecisionTreeClassifier, KNeighborsClassifier and GaussianNB).
Let's lock in those missing pieces!

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load dataset and split into training and testing sets
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a bagging classifier with different numbers of estimators and different base models
best_accuracy = 0
best_n_estimators = 0
best_base_model = None
base_models = [DecisionTreeClassifier(), KNeighborsClassifier(), GaussianNB()]

for model in base_models:
    for n in range(50, 151, 10):
        # TODO: train bagging classifier with parameters estimator=model and n_estimators=n
        # TODO: calculate the accuracy on the testing data and put it into the bag_accuracy variable
        if bag_accuracy > best_accuracy:
            best_accuracy = bag_accuracy
            best_n_estimators = n
            best_base_model = model.__class__.__name__

print(f"Best accuracy achieved: {best_accuracy:.2f} with {best_n_estimators} n_estimators and {best_base_model} as the base model")

```

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load dataset and split into training and testing sets
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a bagging classifier with different numbers of estimators and different base models
best_accuracy = 0
best_n_estimators = 0
best_base_model = None
base_models = [DecisionTreeClassifier(), KNeighborsClassifier(), GaussianNB()]

for model in base_models:
    for n in range(50, 151, 10):
        # train bagging classifier with parameters estimator=model and n_estimators=n
        bag_clf = BaggingClassifier(estimator=model, n_estimators=n, random_state=42)
        bag_clf.fit(X_train, y_train)
        
        # calculate the accuracy on the testing data and put it into the bag_accuracy variable
        y_pred = bag_clf.predict(X_test)
        bag_accuracy = accuracy_score(y_test, y_pred)
        
        if bag_accuracy > best_accuracy:
            best_accuracy = bag_accuracy
            best_n_estimators = n
            best_base_model = model.__class__.__name__

print(f"Best accuracy achieved: {best_accuracy:.2f} with {best_n_estimators} n_estimators and {best_base_model} as the base model")
```