<h2 align='center'>Ensemble Learning: Bagging Tutorial</h2>

**We will use pima indian diabetes dataset to predict if a person has a diabetes or not based on certain features such as blood pressure, skin thickness, age etc. We will train a standalone model first and then use bagging ensemble technique to check how it can improve the performance of the model**

dataset credit: https://www.kaggle.com/gargmanas/pima-indians-diabetes

In [8]:
import pandas as pd

# df = pd.read_csv("diabetes.csv")
# df=pd.read_excel("diabetes.xlsx")
df=pd.read_csv("../../Datasets/heart.csv")
df.head()
df.isnull().sum()
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [9]:
df.HeartDisease.value_counts()

HeartDisease
1    508
0    410
Name: count, dtype: int64

In [10]:
# enumerate categorical columns
for col, col_type in df.dtypes.items():
    if col_type == 'O':
        df[col] = df[col].astype('category')
        
# one-hot encode categorical columns
df = pd.get_dummies(df, drop_first=True)

df.head()


Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_M,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ExerciseAngina_Y,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,True,True,False,False,True,False,False,False,True
1,49,160,180,0,156,1.0,1,False,False,True,False,True,False,False,True,False
2,37,130,283,0,98,0.0,0,True,True,False,False,False,True,False,False,True
3,48,138,214,0,108,1.5,1,False,False,False,False,True,False,True,True,False
4,54,150,195,0,122,0.0,0,True,False,True,False,True,False,False,False,True


There is slight imbalance in our dataset but since it is not major we will not worry about it!

<h3>Train test split</h3>

In [11]:
X = df.drop("HeartDisease",axis="columns")
y = df.HeartDisease

In [12]:
X

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,Sex_M,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ExerciseAngina_Y,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,True,True,False,False,True,False,False,False,True
1,49,160,180,0,156,1.0,False,False,True,False,True,False,False,True,False
2,37,130,283,0,98,0.0,True,True,False,False,False,True,False,False,True
3,48,138,214,0,108,1.5,False,False,False,False,True,False,True,True,False
4,54,150,195,0,122,0.0,True,False,True,False,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,110,264,0,132,1.2,True,False,False,True,True,False,False,True,False
914,68,144,193,1,141,3.4,True,False,False,False,True,False,False,True,False
915,57,130,131,0,115,1.2,True,False,False,False,True,False,True,True,False
916,57,130,236,0,174,0.0,False,True,False,False,False,False,False,True,False


In [14]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled[:1]

array([[-1.4331398 ,  0.41090889,  0.82507026, -0.55134134,  1.38292822,
        -0.83243239,  0.51595242,  2.07517671, -0.53283777, -0.22967867,
         0.81427482, -0.49044933, -0.8235563 , -1.00218103,  1.15067399]])

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, random_state=10)

In [17]:
X_train.shape
X_test.shape
y_train.value_counts()
y_test.value_counts()

HeartDisease
1    127
0    103
Name: count, dtype: int64

<h3>Train using stand alone model</h3>

In [18]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

scores = cross_val_score(DecisionTreeClassifier(), X, y, cv=5)
scores

array([0.74456522, 0.7173913 , 0.77717391, 0.65027322, 0.66120219])

In [97]:
scores.mean()

0.7123334182157711

<h3>Train using Bagging</h3>

In [24]:
from sklearn.ensemble import BaggingClassifier
# TypeError: BaggingClassifier.__init__() got an unexpected keyword argument 'base_estimator'

bag_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(), 
    n_estimators=100, 
    max_samples=0.8, 
    oob_score=True,
    random_state=0
)
bag_model.fit(X_train, y_train)
bag_model.oob_score_

0.9464285714285714

In [99]:
bag_model.score(X_test, y_test)

0.7760416666666666

In [100]:
bag_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(), 
    n_estimators=100, 
    max_samples=0.8, 
    oob_score=True,
    random_state=0
)
scores = cross_val_score(bag_model, X, y, cv=5)
scores

array([0.75324675, 0.72727273, 0.74675325, 0.82352941, 0.73856209])

In [101]:
scores.mean()

0.7578728461081402

We can see some improvement in test score with bagging classifier as compared to a standalone classifier

<h3>Train using Random Forest</h3>

In [102]:
from sklearn.ensemble import RandomForestClassifier

scores = cross_val_score(RandomForestClassifier(n_estimators=50), X, y, cv=5)
scores.mean()

0.7617689500042442

### Bagging vs Random Forest

In [23]:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,stratify=y, random_state=42)

# 1. Generic Bagging
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=10,
    random_state=42
)
bag.fit(X_train, y_train)
print("Bagging Accuracy:", accuracy_score(y_test, bag.predict(X_test)))

# 2. Random Forest
rf = RandomForestClassifier(
    n_estimators=10,
    random_state=42
)
rf.fit(X_train, y_train)
print("Random Forest Accuracy:", accuracy_score(y_test, rf.predict(X_test)))

Bagging Accuracy: 0.9210526315789473
Random Forest Accuracy: 0.8947368421052632
