### Ensemble learning

multiple models (often of the same type) are trained to solve the same problem and combined to get better results. 

**Bootstrapping:** This is a resampling technique used to estimate statistics on a dataset by sampling with replacement. It involves randomly selecting a subset of data for a model and repeating this process a number of times. 

Each subset can be used to train a separate model, and the results can be averaged (for regression) or voted upon (for classification) to produce a final model. 

Application:
- dataset imbalance: generate new samples for classes are rare
- ensemble learning: bagging



**Bagging:** Short for Bootstrap Aggregating, bagging is an extension of the bootstrapping technique and is used to improve the stability and accuracy of machine learning algorithms. 

It involves creating multiple versions of a predictor and using these to get an aggregated predictor. Each model in a bagging ensemble is trained on a randomly drawn subset of the training dataset with replacement (i.e., bootstrapped sample). 

After training, predictions from each model are combined (typically by averaging or majority vote) to form a final prediction. 

Bagging is effective because it both reduces the variance of the prediction (by averaging) and also reduces the likelihood of overfitting.

Application:
- random forest
- AdaBoost
- XGBoost

**In essence, bootstrapping is random sampling with replacement from the available training data. Bagging (= bootstrap aggregation) is performing it many times and training an estimator for each bootstrapped dataset.**

In [1]:
import numpy as np

# Sample data
data = np.array([4, 7, 2, 9, 5, 10, 3, 6, 8, 11])

# Number of bootstrap samples
n_bootstrap_samples = 1000

# Function to draw bootstrap samples and compute their mean
def bootstrap(data, n_bootstrap_samples):
    np.random.seed(0)  # For reproducibility
    sample_means = []
    for _ in range(n_bootstrap_samples):
        sample = np.random.choice(data, size=len(data), replace=True)
        sample_means.append(np.mean(sample))
    return np.array(sample_means)

# Compute bootstrap samples
bootstrap_means = bootstrap(data, n_bootstrap_samples)

# Compute 95% confidence interval
lower_bound = np.percentile(bootstrap_means, 2.5)
upper_bound = np.percentile(bootstrap_means, 97.5)

print(f"95% confidence interval for the mean: {lower_bound:.2f} to {upper_bound:.2f}")


95% confidence interval for the mean: 4.80 to 8.40


In [2]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Create a Bagging Classifier with Decision Trees
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=0
)

# Train the classifier
bagging_clf.fit(X_train, y_train)

# Make predictions and evaluate
predictions = bagging_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

print(f"Accuracy: {accuracy:.2f}")


Accuracy: 0.96
