<a href="https://colab.research.google.com/github/thabetAljbreen/RepoT5/blob/main/Bagging_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bagging Exercise

In this exercise, you will explore the concept of Bagging (Bootstrap Aggregating) and implement it using a random forest model. Bagging is an ensemble technique mainly used for reducing the variance of a predictive model and preventing overfitting. The main idea behind bagging is to combine multiple learners in a way that the ensemble model performs better than an individual model.

## Dataset
We will use the Iris dataset for this exercise. The Iris dataset is a classic dataset from the field of machine learning, containing measurements for iris flowers of three different species. **Feel free to use another dataset!!**

## Task
Your task is to:
1. Load the dataset.
2. Preprocess the data (if necessary).
3. Implement Bagging models.
4. Evaluate the models performance.

Please fill in the following code blocks to complete the exercise.


In [2]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# Load the dataset


In [3]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation',
           'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
data = pd.read_csv(url, header=None, names=columns, na_values=' ?')

# Preprocess the data (if necessary)

In [5]:
data.isnull().sum()
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  31978 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [4]:
data = data.dropna()

In [5]:
data = pd.get_dummies(data, drop_first=True)

# Split the Dataset

In [6]:
X = data.drop('income_ >50K', axis=1)
y = data['income_ >50K']
X
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


# Initialize and Train the Classifiers

## Random Forest
Initialize and train a Random Forest classifier.

In [9]:
# from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import BaggingClassifier

base_estimator=RandomForestClassifier()
bagging_classifier=BaggingClassifier(base_estimator, n_estimators=50, random_state=42)

bagging_classifier.fit(X_train, y_train)


### Evaluate the model performance

In [16]:
predictions = bagging_classifier.predict(X_test)
accuracy=accuracy_score(y_test, predictions)
print(f'Random forest Classifier Model Accuracy: {accuracy * 100:.2f}%')

Random forest Classifier Model Accuracy: 77.02%


## Bagging Meta-estimator
Initialize a K-Nearest Neighbors classifier and use it as the base estimator for the Bagging classifier.

In [10]:
from sklearn.neighbors import KNeighborsClassifier
base_estimator = KNeighborsClassifier()
bagging_classifier = BaggingClassifier(base_estimator, n_estimators=50, random_state=42)
bagging_classifier.fit(X_train, y_train)


### Evaluate the model performance

In [15]:
predictions = bagging_classifier.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f'Bagging Classifier Model Accuracy: {accuracy * 100:.2f}%')

Bagging Classifier Model Accuracy: 77.02%


## Pasting
Initialize a Decision Tree classifier and use it as the base estimator for a Bagging classifier with Pasting (without replacement).

In [11]:
from sklearn.tree import DecisionTreeClassifier
base_estimator = DecisionTreeClassifier()
pasting_classifier = BaggingClassifier(base_estimator, n_estimators=50, max_samples=0.7, bootstrap=False, random_state=42)
pasting_classifier.fit(X_train, y_train)


### Evaluate the model performance

In [18]:
predictions = pasting_classifier.predict(X_test)
print(f'Pasting Classifier Model Accuracy: {accuracy * 100:.2f}%')


Pasting Classifier Model Accuracy: 77.02%


## Roughly Balanced Bagging (RBB)
Implement Roughly Balanced Bagging by manually creating balanced bootstrap samples and aggregating predictions from multiple Decision Tree classifiers.

In [7]:
import numpy as np

n_estimators = 100
ensemble_preds = np.zeros((n_estimators, len(X_test)))
ensemble_models = []

In [12]:
for i in range(n_estimators):

    pos_indices = np.where(y_train == 1)[0]
    neg_indices = np.where(y_train == 0)[0]
    chosen_pos_indices = np.random.choice(pos_indices, size=len(pos_indices), replace=True)

    chosen_neg_indices = np.random.choice(neg_indices, size=len(pos_indices), replace=True)
    balanced_sample_indices = np.concatenate([chosen_pos_indices, chosen_neg_indices])
    np.random.shuffle(balanced_sample_indices)

    X_train_balanced = X_train.iloc[balanced_sample_indices]
    y_train_balanced = y_train.iloc[balanced_sample_indices]
    tree_clf = DecisionTreeClassifier(random_state=i)
    tree_clf.fit(X_train_balanced, y_train_balanced)
    ensemble_models.append(tree_clf)
    ensemble_preds[i] = tree_clf.predict(X_test)

### Evaluate the model performance

In [13]:
final_preds = np.round(np.mean(ensemble_preds, axis=0))

accuracy = accuracy_score(y_test, final_preds)
print(f'Roughly Balanced Bagging Model Accuracy: {accuracy:.2f}')

Roughly Balanced Bagging Model Accuracy: 0.84
