<a href="https://colab.research.google.com/github/thepersonuadmire/Ensemble-Learning/blob/main/Ensemble_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Theoretical

1. Yes, Bagging can be used for regression problems. In this case, the predictions from multiple models are averaged to produce a final prediction.

2. Multiple model training involves training several models (e.g., in ensemble methods) and combining their predictions, while single model training involves training just one model to make predictions.

3. In Random Forest, feature randomness refers to the practice of selecting a random subset of features for each split in the decision trees. This helps to reduce correlation among the trees and improves the model's generalization.

4. The OOB score is an internal validation method for Bagging models. It uses the predictions from trees that did not see a particular sample during training to estimate the model's performance.

5. The OOB score is an internal validation method for Bagging models. It uses the predictions from trees that did not see a particular sample during training to estimate the model's performance.

6. A Bagging Classifier works by creating multiple subsets of the training data through bootstrapping (sampling with replacement), training a base model (e.g., decision tree) on each subset, and then aggregating the predictions (e.g., by majority voting) to make the final prediction.

7. The performance of a Bagging Classifier can be evaluated using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC, depending on the nature of the classification problem.

8. A Bagging Regressor works similarly to a Bagging Classifier but focuses on regression tasks. It creates multiple bootstrapped datasets, trains a regressor on each, and averages the predictions to produce the final output.

9. The main advantage of ensemble techniques is that they often provide better predictive performance than individual models by reducing variance (in Bagging) or bias (in Boosting).

10. The main challenge of ensemble methods is increased computational complexity and the potential for overfitting if the base models are too complex or if the ensemble is not properly tuned.

11. The key idea behind ensemble techniques is to combine multiple models to improve overall performance, leveraging the strengths of each model while mitigating their weaknesses.

12. A Random Forest Classifier is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of their predictions for classification tasks.

13. The main types of ensemble techniques include Bagging, Boosting, Stacking, and Voting.

14. Ensemble learning is a technique that combines multiple models to produce a single model that performs better than any individual model.

15. Ensemble methods should be avoided when the dataset is very small, as they may lead to overfitting, or when interpretability is crucial, as ensembles can be more complex to interpret than single models.

16. Bagging reduces overfitting by averaging the predictions of multiple models trained on different subsets of the data, which helps to smooth out the noise and variance in the predictions.

17. Random Forest is better than a single Decision Tree because it reduces overfitting by averaging multiple trees, which leads to better generalization on unseen data.

18. Bootstrap sampling involves creating multiple subsets of the training data by sampling with replacement. This allows each model in the ensemble to learn from different data points, enhancing diversity and robustness.

19. Real-world applications include credit scoring, fraud detection, medical diagnosis, image classification, and natural language processing tasks.

20. Bagging builds multiple models independently and combines their predictions, while Boosting builds models sequentially, where each model attempts to correct the errors of the previous one.

# Practical

21.

In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42 )

# Train Bagging Classifier
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = bagging_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

22.

In [None]:
from sklearn.datasets import load_boston
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Regressor
bagging_reg = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
bagging_reg.fit(X_train, y_train)

# Predict and evaluate
y_pred = bagging_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

23.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Print feature importance scores
feature_importances = rf_clf.feature_importances_
for i, score in enumerate(feature_importances):
    print(f"Feature {i}: {score:.4f}")

24.

In [None]:
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

# Train Decision Tree Regressor
dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_train, y_train)
dt_pred = dt_reg.predict(X_test)
dt_mse = mean_squared_error(y_test, dt_pred)

print(f"Random Forest MSE: {rf_mse:.2f}")
print(f"Decision Tree MSE: {dt_mse:.2f}")

25.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# Train Random Forest Classifier with OOB
rf_clf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf_clf.fit(X, y)

# Print OOB score
print(f"OOB Score: {rf_clf.oob_score_:.2f}")

26.

In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
from sklearn.model import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Classifier with SVM
bagging_svm = BaggingClassifier(base_estimator=SVC(), n_estimators=50, random_state=42)
bagging_svm.fit(X_train, y_train)

# Predict and evaluate
y_pred = bagging_svm.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

27.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifiers with different numbers of trees
for n_trees in [10, 50, 100, 200]:
    rf_clf = RandomForestClassifier(n_estimators=n_trees, random_state=42)
    rf_clf.fit(X_train, y_train)
    y_pred = rf_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy with {n_trees} trees: {accuracy:.2f}")

28.

In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Classifier with Logistic Regression
bagging_log_reg = BaggingClassifier(base_estimator=LogisticRegression(), n_estimators=50, random_state=42)
bagging_log_reg.fit(X_train, y_train)

# Predict and evaluate AUC
y_pred_proba = bagging_log_reg.predict_proba(X_test)[:, 1]
auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"AUC Score: {auc_score:.2f}")

29.

In [None]:
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Load dataset
boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)

# Analyze feature importance scores
feature_importances = rf_reg.feature_importances_
for i, score in enumerate(feature_importances):
    print(f"Feature {i}: {score:.4f}")

30.

In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Classifier
bagging_clf = BaggingClassifier(n_estimators=50, random_state=42)
bagging_clf.fit(X_train, y_train)
bagging_pred = bagging_clf.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_pred)

# Train Random Forest Classifier
rf_cl = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)

print(f"Bagging Classifier Accuracy: {bagging_accuracy:.2f}")
print(f"Random Forest Classifier Accuracy: {rf_accuracy:.2f}")

31.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set up the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters and score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_:.2f}")

32.

In [None]:
from sklearn.datasets import load_boston
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Regressors with different numbers of base estimators
for n_estimators in [10, 50, 100, 200]:
    bagging_reg = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=n_estimators, random_state=42)
    bagging_reg.fit(X_train, y_train)
    y_pred = bagging_reg.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"MSE with {n_estimators} estimators: {mse:.2f}")

33.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict and find misclassified samples
y_pred = rf_clf.predict(X_test)
misclassified = X_test[y_pred != y_test]

print(f"Number of misclassified samples: {len(misclassified)}")

34.

In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Classifier
bagging_clf = BaggingClassifier(n_estimators=50, random_state=42)
bagging_clf.fit(X_train, y_train)
bagging_pred = bagging_clf.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_pred)

# Train Decision Tree Classifier
dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)
dt_pred = dt_clf.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

print(f"Bagging Classifier Accuracy: {bagging_accuracy:.2f}")
print(f"Decision Tree Classifier Accuracy: {dt_accuracy:.2f}")

35.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict and compute confusion matrix
y_pred = rf_clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.title("Confusion Matrix")
plt.show()

36.

In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base estimators
base_estimators = [
    ('dt', DecisionTreeClassifier()),
    ('svm', SVC(probability=True)),
    ('log_reg', LogisticRegression())
]

# Train Stacking Classifier
stacking_clf = StackingClassifier(estimators=base_estimators, final_estimator=LogisticRegression())
stacking_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = stacking_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Stacking Classifier Accuracy: {accuracy:.2f}")

37.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Get feature importances and sort them
feature_importances = rf_clf.feature_importances_
indices = np.argsort(feature_importances)[::-1]

# Print top 5 most important features
print("Top 5 Most Important Features:")
for i in range(5):
    print(f"Feature {indices[i]}: {feature_importances[indices[i]]:.4f}")

38.

In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Classifier
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging_clf.fit (X_train, y_train)

# Predict and evaluate
y_pred = bagging_clf.predict(X_test)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

39.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifiers with different max_depth values
for max_depth in [None, 5, 10, 20]:
    rf_clf = RandomForestClassifier(n_estimators=100, max_depth=max_depth, random_state=42)
    rf_clf.fit(X_train, y_train)
    y_pred = rf_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy with max_depth={max_depth}: {accuracy:.2f}")

40.

In [None]:
from sklearn.datasets import load_boston
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Regressor with Decision Tree
bagging_dt = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
bagging_dt.fit(X_train, y_train)
dt_pred = bagging_dt.predict(X_test)
dt_mse = mean_squared_error(y_test, dt_pred)

# Train Bagging Regressor with KNeighbors
bagging_knn = BaggingRegressor(base_estimator=KNeighborsRegressor(), n_estimators=50, random_state=42)
bagging_knn.fit(X_train, y_train)
knn_pred = bagging_knn.predict(X_test)
knn_mse = mean_squared_error(y_test, knn_pred)

print(f"Decision Tree Bagging MSE: {dt_mse:.2f}")
print(f"KNeighbors Bagging MSE: {knn_mse:.2f}")

41.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict probabilities and evaluate ROC-AUC
y_pred_proba = rf_clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score: {roc_auc:.2f}")

42.

In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train Bagging Classifier
bagging_clf = BaggingClassifier (base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)

# Evaluate performance using cross-validation
cv_scores = cross_val_score(bagging_clf, X, y, cv=5)
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Mean Cross-Validation Score: {cv_scores.mean():.2f}")

43.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Predict probabilities
y_scores = rf_clf.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_scores)

# Plot Precision-Recall curve
plt.plot(recall, precision, marker='.')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()

44.

In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base estimators
base_estimators = [
    ('rf', RandomForestClassifier(n_estimators=100)),
    ('log_reg', LogisticRegression())
]

# Train Stacking Classifier
stacking_clf = StackingClassifier(estimators=base_estimators, final_estimator=LogisticRegression())
stacking_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = stacking_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Stacking Classifier Accuracy: {accuracy:.2f}")

45.

In [None]:
from sklearn.datasets import load_boston
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Bagging Regressors with different bootstrap samples
for bootstrap in [True, False]:
    bagging_reg = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=50, bootstrap=bootstrap, random_state=42)
    bagging_reg.fit(X_train, y_train)
    y_pred = bagging_reg.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"MSE with bootstrap={bootstrap}: {mse:.2f}")