## Random Forests, Bagging and Boosting
In this Jupyter Notebook we'll introduce the concepts of Bagging and Boosting and we'll look at Random Forests which are a form of bagging.

The idea of bagging is to create a model that is an average of a bunch of other smaller models.  We do this by bootstrapping the data and building a smaller model (think trees without much depth or logistic regression models with only a small number of predictors, say 2 or 3) on that data. Then repeating that process so you get lots of small models and you average all of those to get your predictions.

A random forest is a particular form of bagging by which you build small trees by randomly sampling observations and randomly subsetting features.  

Boosting is a way to improve model performance by effectively bumping up or 'boosting' the value that the model places on residuals or observations that are misclassified.

These methods can be computationally intensive for larger data sets.

In [None]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.metrics import accuracy_score
import xgboost as xgb
from xgboost import XGBClassifier

from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import StandardScaler
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LogisticRegression


We'll start with the _iris_ data.

In [None]:
# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

### Bagging, Boosting and Random Forests

So below are three different  approaches below.

The first is bagging using Logistic Regression so that the basic classifier is a logistic regression.  That estimator can be replaced with other
types of classifiers.  The default is a decision tree classifier.

The second is a classifier that uses XGBoost, so that is boosting.  The third is a random forest.

We'll start by looking at performance on the test set generated above for the _iris_ data.'

The argument *n_estimators* is the number of models being generated by the bagging process.  For a Random Forest, it is the number of trees in the forest. 
Making this a large number can drastically increase the amount of time your code will take to run.

In [None]:

# Train Bagging Classifier (using Decision Trees as base estimators)
bagging = BaggingClassifier(estimator=LogisticRegression(max_iter=200),n_estimators=100, random_state=42)
bagging.fit(X_train, y_train)
bagging_preds = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_preds)
print("Bagging Classifier Accuracy:", bagging_accuracy)

In [None]:
# Train XGBoost Classifier
xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42)
xgb_model.fit(X_train, y_train)
xgb_preds = xgb_model.predict(X_test)
xgb_accuracy = accuracy_score(y_test, xgb_preds)
print("XGBoost Accuracy:", xgb_accuracy)


In [None]:

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_preds)
print("Random Forest Accuracy:", rf_accuracy)

These classifiers do really well on these data.

So let's move to the Penguins data.  We are going to try classifying whether or not a Penguin is a female.

In [None]:
penguins = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2301/Data/penguins.csv", na_values=['NA'])
# remove rows with missing data
penguins.dropna(inplace=True)
penguins.head()
penguins['sex01'] = (penguins['sex']=="female").astype(int)

In [None]:
# Split the data into training and test sets
np.random.seed(4242)
X = penguins[['bill_length_mm',	'bill_depth_mm',	'flipper_length_mm',	'body_mass_g']]
y=penguins['sex01']

feature_names=['bill_length_mm',	'bill_depth_mm',	'flipper_length_mm',	'body_mass_g']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:

# Train Bagging Classifier (using Decision Trees as base estimators)
bagging = BaggingClassifier(estimator=LogisticRegression(max_iter=200), n_estimators=100, random_state=42)
bagging.fit(X_train, y_train)
bagging_preds = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_preds)
print("Bagging Classifier Accuracy:", bagging_accuracy)

In [None]:
# Train XGBoost Classifier
xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42)
xgb_model.fit(X_train, y_train)
xgb_preds = xgb_model.predict(X_test)
xgb_accuracy = accuracy_score(y_test, xgb_preds)
print("XGBoost Accuracy:", xgb_accuracy)

In [None]:
# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_preds)
print("Random Forest Accuracy:", rf_accuracy)

Now, of course, we have learned not to judge a model's performance on how it does on a single train/test split.  So let's do some cross-validation.

For *cross_val_score*, if we pass an _int_ or _None_ to *cv* as an input and the estimator is a classifier and y is either binary or multiclass, _StratifiedKFold_ is used. 

What is _StratifiedKFold_ you ask?  _StratifiedKFold_ is a way of generating the folds by stratified sampling so that the proportion from each class is the same.  For example,
suppose that we had 70% female penguins in our data.  Using stratified sampling, _StratifiedKFold_ would ensure that each of our folds for cross-validation would also have 70% female penguins.  Slick, right?

Note that doing the cross validation below will take some time because with each fold we are fitting 100 different models


In [None]:
np.random.seed(123)
bagging_pens = BaggingClassifier(estimator=LogisticRegression(max_iter=200),n_estimators=100, random_state=42)
cv_scores_rbf = cross_val_score(bagging_pens, X, y, cv=5)  # 5-fold cross-validation

print(f"Logistic Bagging cross-validation accuracy: {cv_scores_rbf.mean() * 100:.2f}%")

In [None]:
xgb_pens=xgb.XGBClassifier(n_estimators=100)
cv_scores_rbf = cross_val_score(xgb_pens, X, y, cv=5)  # 5-fold cross-validation

print(f"XGBoost cross-validation accuracy: {cv_scores_rbf.mean() * 100:.2f}%")

In [None]:
rf_pens=RandomForestClassifier(n_estimators=100)

cv_scores_rbf = cross_val_score(rf_pens, X, y, cv=5)  # 5-fold cross-validation

print(f"Random Forest cross-validation accuracy: {cv_scores_rbf.mean() * 100:.2f}%")

So the LogisticRegression takes awhile to fit 100 models but it does slightly better than the other two models in this case.  

Let's look at the Importance of variables for the bagging logistic model using Permutation Importance.  

In [None]:
# Set a seed so we have similar
np.random.seed(250402)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
#Training the Model
bagging_pens.fit(X_train_scaled, y_train)
#Predicting on Test Set and Evaluating the Model
y_pred = bagging_pens.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the BaggingClassifier: {accuracy*100:.2f}%")

In [None]:
#Using Permutation Importance
perm_importance = permutation_importance(bagging_pens, X_test_scaled, y_test, n_repeats=100, random_state=42)

In [None]:
# Visualizing the Feature Importance

features = data.feature_names

# Plotting the permutation importance results
plt.figure(figsize=(8, 6))
plt.barh(features, perm_importance.importances_mean, color='orange')
plt.title("Permutation Importance from Logistic BaggingClassifier")
plt.xlabel("Importance")
plt.ylabel("Features")
plt.show()


Now recall that the Permutation Importance is based upon a single test set and may change.

### Tasks

1. Change the line above that is 'np.random.seed(250402)' and replace that with another seed.  How does that change the permutation importance values?

2. Repeat the above code on permutation importance for the XGBoost Model and the Random Forest Model.  How do your results differ?

3. Open the breast cancer data, and fit an XGBoost and a Random Forest model to those data.  How well do those models do with 8-fold cross validation?  Which is better?

4. Determine the permutation importance of the features from the XGBoost and the Random Forest.  

