# Ensemble Learning-Bagging

Link to the Youtube tutorial video: https://www.youtube.com/watch?v=RtrBtAKwcxQ&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=22&t=361s

# Load the dataset

In [61]:
import pandas as pd

df = pd.read_csv("diabetes.csv")

df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# Data exploration

In [62]:
# check if any column of the dataset consists of missing values (NA/NaN)
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [63]:
# show the basic statistics for each of the columns of the dataset
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [64]:
# check if the data of dependent variable (Outcome column) of the dataset is imbalanced (EG: the number of 0 is not equal to the number of 1) 
print(df.Outcome.value_counts())

# the output show there are 500 samples with outcome of 0; 268 samples with outcome of 1.

ratio = df.Outcome.value_counts()[1]/df.Outcome.value_counts()[0]
print('The ratio of dependent variable data of the dataset: '+str(ratio))
# the ratio is 0.536 (around 2:1 ratio). It looks like slight imbalance but it is not a major imbalanceeee. Major imbalance would be like 10:1 or 100:1 ratio.

Outcome
0    500
1    268
Name: count, dtype: int64
The ratio of dependent variable data of the dataset: 0.536


# Data preprocessing

## Load the independent and dependent variables

In [65]:
# load the independent variables of the dataset to X variable
X = df.drop('Outcome',axis='columns')

# load the dependent variable of the dataset to Y variable
Y = df.Outcome

## Scale the features (independent variables)

In [66]:
# since the maximum value of each feature (independent variable) are not the same, means they are on a different scale. Just to be on a safe side, you scale those features.

from sklearn.preprocessing import StandardScaler

# create a StandardScaler as the scaler
scaler = StandardScaler()

# scale the features and save them to X_scaled variable
X_scaled = scaler.fit_transform(X)

# show the first 3 rows of the X_scaled variable
X_scaled[:3]


array([[ 0.63994726,  0.84832379,  0.14964075,  0.90726993, -0.69289057,
         0.20401277,  0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575,  0.53090156, -0.69289057,
        -0.68442195, -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, -1.28821221, -0.69289057,
        -1.10325546,  0.60439732, -0.10558415]])

## Split the dataset into train and test sets using train_test_split method

In [67]:
from sklearn.model_selection import train_test_split

'''
Since the dependent variable data of the dataset is slightly imbalanced, you use stratify
to ensure the samples in the test and train sets maintains the ratio calculated above respectively.
The random_state is specified to allow the reproducibility (means every time you run train_test_split, you get the same train and test sets).
'''
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, stratify=Y, random_state=10)

print('The train set consists of '+str(Y_train.shape[0])+' samples\nThe test set consists of '+str(Y_test.shape[0])+' samples\n')

print('The ratio of dependent variable data in train set: '+str(Y_train.value_counts()[1]/Y_train.value_counts()[0]))
print('The ratio of dependent variable data in test set: '+str(Y_test.value_counts()[1]/Y_test.value_counts()[0]))


The train set consists of 576 samples
The test set consists of 192 samples

The ratio of dependent variable data in train set: 0.536
The ratio of dependent variable data in test set: 0.536


# Develop the machine learning model

## Using Decision Tree Classifier

### Alone, with cross validation

In [68]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

scores = cross_val_score(DecisionTreeClassifier(), X_scaled, Y, cv=5)
print('The accuracy of the model at each iteration of cross validation: ',scores)

print('The final/mean accuracy of the model from the cross validation: ',scores.mean())

The accuracy of the model at each iteration of cross validation:  [0.68181818 0.68181818 0.66883117 0.79084967 0.69281046]
The final/mean accuracy of the model from the cross validation:  0.7032255326372973


### Using Bagging Classifier, with Decision Tree Classifier as the base estimator

Explanation of the Bagging Classifier parameters (16:35 -> 20:07): https://www.youtube.com/watch?v=RtrBtAKwcxQ&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=22&t=361s


#### Without cross validation

In [69]:
from sklearn.ensemble import BaggingClassifier

# create the Bagging Classifier
bag_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    oob_score=True,
    random_state=0
)

# train the Bagging Classifier
bag_model.fit(X_train,Y_train)

# Here, the OOB score is not computed based on the samples in X_test and Y_test, just using the samples in train set which are not found in the 100 subsets sampled from X_train.
print('The Out-of-Bag (OOB) score of the trained Bagging Classifier: ',bag_model.oob_score_)

print('The score of the trained Bagging Classifier: ',bag_model.score(X_test,Y_test))

'''
Findings:
The base model (by using a decision tree classifier alone) gives lower accuracy while the bagged model (by using decision tree classifier as base estimator model)
gives higher accuracy. So for unstable classifier like decision tree classifier, the bagging technique helps.
If you have an unstable classifier OR your dataset consists of many missing values, your resulting model has high variance.
And whenever you have high variance, it makes sense to use bagging classifier.
'''

The Out-of-Bag (OOB) score of the trained Bagging Classifier:  0.7534722222222222
The score of the trained Bagging Classifier:  0.7760416666666666


'\nFindings:\nThe base model (by using a decision tree classifier alone) gives lower accuracy while the bagged model (by using decision tree classifier as base estimator model)\ngives higher accuracy. So for unstable classifier like decision tree classifier, the bagging technique helps.\nIf you have an unstable classifier OR your dataset consists of many missing values, your resulting model has high variance.\nAnd whenever you have high variance, it makes sense to use bagging classifier.\n'

#### With cross validation

In [70]:
# create the Bagging Classifier
bag_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    oob_score=True,
    random_state=0
)

scores = cross_val_score(bag_model, X_scaled, Y, cv=5)
print('The accuracy of the model at each iteration of cross validation: ',scores)

print('The final/mean accuracy of the model from the cross validation: ',scores.mean())

The accuracy of the model at each iteration of cross validation:  [0.75324675 0.72727273 0.74675325 0.82352941 0.74509804]
The final/mean accuracy of the model from the cross validation:  0.7591800356506239


## Using Random Forest Classifier alone, with cross validation

In [71]:
from sklearn.ensemble import RandomForestClassifier

scores = cross_val_score(RandomForestClassifier(), X_scaled, Y, cv=5)

print('The final/mean accuracy of the model from the cross validation: ',scores.mean())

'''
Findings:
The random forest classifier gives score similar to the one of bagging classfier with decision tree classifier as base estimator.
This is because inside the random forest classifier, it will use bagging technique, similar to the theory behind the bagging classifier.
'''

The final/mean accuracy of the model from the cross validation:  0.7617944147355913


'\nFindings:\nThe random forest classifier gives score similar to the one of bagging classfier with decision tree classifier as base estimator.\nThis is because inside the random forest classifier, it will use bagging technique, similar to the theory behind the bagging classifier.\n'