# <span style="color:Brown"> Classification Using Bagging
</span>

#### <span style="color:black"> Author : Sivaprasad Puthumadthil rameshan nair </span>

<br>

## <span style="color:blue"> Aim: </span>

Implement a basic bagging algorithm and apply it to a classification model and compare the performance of the base model with bagging ensemble.


<br>


#### What is bagging ? 

Bagging is an ensemble machine learning technique that involves training multiple models independently on different subsets of the training data and then combining their predictions. The main idea of bagging is to reduce overfitting and improve the performance and robustness of the model.


<br>

#### The various steps involved in the project :

    - Data loading and analysis
    - Prediction using Base classifier model
    - Prediction using Bagging classifier
    
<br>

#### Concepts used :

Bagging.

<br>
    
## Data :

Heart Disease UCI Dataset (https://archive.ics.uci.edu/dataset/45/heart+disease)

<br>
<br>





## <span style="color:blue"> Data loading and analysis </span>

Import all the required packages for the project.

In [40]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score , confusion_matrix
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification
from sklearn.utils import column_or_1d
import seaborn as sns

#### Fetch data from repo

In [4]:
pip install ucimlrepo

Note: you may need to restart the kernel to use updated packages.


In [5]:
from ucimlrepo import fetch_ucirepo 

heart_disease = fetch_ucirepo(id=45) 

## Analysing the Data :
<br>
<br>

In this section we will check the number of rows , number of coloumns and understand how the data is.


In [6]:
# get feature matrix (input data)
X = heart_disease.data.features 

# get target variable (output)
y = heart_disease.data.targets 
print("shape of X", X.shape)
print("shape of y", y.shape)

shape of X (303, 13)
shape of y (303, 1)


In [7]:
num_rows, num_columns = X.shape
print(f"The number of rows {num_rows} and number of coloumns {num_columns}")

The number of rows 303 and number of coloumns 13


In [8]:
print("The first 5 coloumns in data set :\n",X[0:5])

The first 5 coloumns in data set :
    age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   1       145   233    1        2      150      0      2.3      3   
1   67    1   4       160   286    0        2      108      1      1.5      2   
2   67    1   4       120   229    0        2      129      1      2.6      2   
3   37    1   3       130   250    0        0      187      0      3.5      3   
4   41    0   2       130   204    0        2      172      0      1.4      1   

    ca  thal  
0  0.0   6.0  
1  3.0   3.0  
2  2.0   7.0  
3  0.0   3.0  
4  0.0   3.0  


In [9]:
X.info()

y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        299 non-null    float64
 12  thal      301 non-null    float64
dtypes: float64(3), int64(10)
memory usage: 30.9 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   num     303 non-null    int64
dtypes: int64(1)
memory usage: 2.5 KB


In [10]:
X.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,299.0,301.0
mean,54.438944,0.679868,3.158416,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,1.60066,0.672241,4.734219
std,9.038662,0.467299,0.960126,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.616226,0.937438,1.939706
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,3.0
50%,56.0,1.0,3.0,130.0,241.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0,3.0
75%,61.0,1.0,4.0,140.0,275.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0


<br>
Check for null entries. if there is any null entries let's remove them.
<br>

In [11]:
print("shape of X", X.shape)
print("shape of y", y.shape)

#check if any null rows are present
rows_with_null_X = np.any(np.isnan(X), axis=1)

#remove the null rows from both x and y 
X_cleaned = X[~rows_with_null_X]
y_cleaned = y[~rows_with_null_X]
print("shape of X after cleaning", X_cleaned.shape)
print("shape of y after cleaning", y_cleaned.shape)

shape of X (303, 13)
shape of y (303, 1)
shape of X after cleaning (297, 13)
shape of y after cleaning (297, 1)


as you can see , we removed 6 rows which had null values.

## <span style="color:blue"> Prediction using Base classifier model </span>

<br>

now let's split the data for testing and training.

In [12]:
# split the dataset to train and test data for training and testing the model
# test size = 20% and train size will be 80%
X_train, X_test, y_train, y_test = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)

print(f"shape of X train {X_train.shape} and shape of y train {y_train.shape}")
print(f"shape of X test {X_test.shape} and shape of y test {y_test.shape}")

shape of X train (237, 13) and shape of y train (237, 1)
shape of X test (60, 13) and shape of y test (60, 1)



now let's train and predict the data using a basic classifier (DescionTreeClassifier)

In [13]:
# let's define the base classifier and train the model , the random_state is set to 42 so that every time 
# the random descisions made during the building remain same

base_classifier = DecisionTreeClassifier(random_state=42)
base_classifier.fit(X_train, y_train)

#now let's test the model and get the predicted y value.

y_predicted = base_classifier.predict(X_test)

# let's calculate the accuracy score 
accuracy_score_base_classifier = accuracy_score(y_test, y_predicted)
print(f'Accuracy of Base Classifier: {accuracy_score_base_classifier}')


Accuracy of Base Classifier: 0.48333333333333334


##  <span style="color:blue"> Prediction using Bagging classifier </span>
<br>

    Define a bagging classifier with the base classifier.
    
    bagging classifier can be called using inbuilt method BaggingClassifier

In [34]:
# define the bahhing classifier with base classifier (DecisionTree)
bagging_classifier = BaggingClassifier(base_classifier, n_estimators=10, max_samples=0.8, random_state=422)

# reshape the y to be a 1d data
y_train_reshaped = y_train.to_numpy().ravel()
y_test_reshaped = y_test.to_numpy().ravel()

# train the model
bagging_classifier.fit(X_train, y_train_reshaped)
print("reshaped y train ",y_train_reshaped.shape)

# predict the test values
y_pred_bagging = bagging_classifier.predict(X_test)

#calculate the accuracy of the bagging model
accuracy_bagging = accuracy_score(y_test_reshaped, y_pred_bagging)
print(f'Accuracy after bagging : {accuracy_bagging}')


reshaped y train  (237,)
Accuracy after bagging : 0.5166666666666667


as we can see the accuracy after bagging is increased 

### How the bagging classifier increased the prediction accuracy ?

1.Variance Reduction : By training multiple base classifiers on different subsets of the data, bagging helps reduce the variability in individual predictions.<br>
2.Bagging can mitigate biases associated with specific training data subsets. <br>
3.Bagging promotes better generalization by reducing overfitting.<br>
4.Bagging provides stability to the model, as small changes in the training data are less likely to significantly impact the overall predictions.

##  <span style="color:blue"> Conclusion </span>
<br>

In this project, we successfully loaded and analyzed the Heart Disease UCI dataset, gaining valuable insights into its structure and features. We started by employing a base classifier, specifically a Decision Tree Classifier, to establish a baseline for comparison. Subsequently, we implemented a Bagging Classifier to enhance predictive accuracy and model robustness. By training multiple instances of the base classifier on diverse subsets of the data, the Bagging Classifier effectively reduced variability in individual predictions. The comprehensive evaluation, including accuracy metrics, provided a clear comparison between the performance of the base classifier and the boosted accuracy achieved through the ensemble learning technique.