### Objective :

The objective of this case study is to demonstrate an **ensembling** technique popular by the name of **Bagging or Bootstrap Aggregation**. Ensembling basically involves stacking together multiple ***weak classifier*** inorder to obtain a ***meta-classifer*** substantially powerful than the original classifier.

Bagging involves the following steps :

1) We choose the type and number of classifiers to build the ensemble.

2) For each classifier in the ensemble, we draw a specific number of instances from the training data with replacement. The   dataset thus obtained is known as a bootstrapped dataset.

3) Each classifier is trained on the corresponding bootstrapped dataset.

4) Instances from the test set are fed to the ensemble, instances are designated that class label which has been predicted by the maximum number of classifiers contained within the ensemble. 

### Data :

**Data Source ** : https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data

**About the Data** : The dataset that we have is popular in the machine learning community by the name of 'Abalone Dataset'. Abalone refers to the small to very large snails that dwell at ocean surface. In the context of classification based machine learning, thise dataset is used for predicting the gender (male,female or infant) of an abalone based on attributes such as Length, diameter, weight etc.

**Input Attributes**:

1) Length : continuous 

2) Diameter: continuous

3) Height : continuous

4) Whole Weight : continuous

5) Shucked weight : continuous : weight of the meat

6) Viscera weight : continuous : weight after bleeding

7) Shell weight : continuous : weight after being dried

8) Rings : integer : No of rings


** Target Attribute**:

9) Sex : categorical: (male,female,infant)

#### 1) Importing the relevant libraries :

In [1]:
import pandas as pd
import numpy as np

#### 2) Loading the dataset :

In [2]:
columns=['sex','length','diameter','height','whole_weight','shuckled_weight','viscerea_weight','shell_weight','rings']
abalone_data=pd.read_table('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',sep=',',header=None,
                           names=columns)

In [3]:
abalone_data.head()

Unnamed: 0,sex,length,diameter,height,whole_weight,shuckled_weight,viscerea_weight,shell_weight,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


#### 3) Label encoding the categorical attributes :

In [4]:
from sklearn.preprocessing import LabelEncoder 
encoder=LabelEncoder()
abalone_data['sex']=encoder.fit_transform(abalone_data['sex'])

#### 4) Segregating the input features and target feature :

In [5]:
X=abalone_data.loc[:,'length':'rings'].values
Y=abalone_data.loc[:,'sex'].values

#### 5) Splitting the dataframe into training set and testing set :

In [6]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=300,random_state=42)

#### 6) Standardizing the training data and testing data :

In [7]:
from sklearn.preprocessing import StandardScaler
standardizer=StandardScaler()
X_train=standardizer.fit_transform(X_train)
X_test=standardizer.transform(X_test)

#### 7) Reducing the dimensionality of the data using Principal Component Analysis (PCA) :

In [8]:
from sklearn.decomposition import PCA
pca_object=PCA(0.95)
X_train=pca_object.fit_transform(X_train)
X_test=pca_object.transform(X_test)
print('number of components :',pca_object.n_components_)
print('explained variance ratio :',pca_object.explained_variance_ratio_)

number of components : 3
explained variance ratio : [0.83756569 0.08773401 0.03301425]


#### 8) Fitting the training data on the Decision Tree Classifier:

In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score,confusion_matrix
tree_clf=DecisionTreeClassifier()
tree_clf.fit(X_train,Y_train)
Y_pred=tree_clf.predict(X_test)
print('Accuracy using DecisionTreeClassifier:',100* accuracy_score(Y_test,Y_pred))
print('Confusion Matrix:\n',confusion_matrix(Y_test,Y_pred))

Accuracy using DecisionTreeClassifier: 44.0
Confusion Matrix:
 [[33 11 40]
 [23 52 19]
 [58 17 47]]


#### 9) Tuning the hyperparameters of Bagging Classifier:

n_estimators : The number of estimators in the ensemble.

max_samples : The maximum number of samples to be bootstrapped.


In [10]:
from sklearn.ensemble import BaggingClassifier
hyperparams={'n_estimators':[150,300,450,600,750,900,1050],'max_samples':[150,300,450,600,750,900,1050]}
from sklearn.model_selection import GridSearchCV
grid_object=GridSearchCV(estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(),bootstrap=True),cv=5,param_grid=hyperparams,scoring='accuracy',verbose=3,n_jobs=5)
grid_object.fit(X_train,Y_train)

Fitting 5 folds for each of 49 candidates, totalling 245 fits


[Parallel(n_jobs=5)]: Done  22 tasks      | elapsed:    5.6s
[Parallel(n_jobs=5)]: Done 118 tasks      | elapsed:   40.4s
[Parallel(n_jobs=5)]: Done 245 out of 245 | elapsed:  2.1min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            ...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=5,
       param_grid={'n_estimators': [150, 300, 450, 600, 750, 900, 1050], 'max_samples': [150, 300, 450, 600, 750, 900, 1050]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=3)

#### 10) Determining the best parameters :

In [11]:
grid_object.best_params_

{'max_samples': 450, 'n_estimators': 750}

#### 11) Making predictions with the ensemble with its hyperparameters tuned :

In [13]:
bagging_clf=BaggingClassifier(base_estimator=DecisionTreeClassifier(),max_samples=450,n_estimators=750,max_features=3)
bagging_clf.fit(X_train,Y_train)
Y_pred=bagging_clf.predict(X_test)
print('Accuracy Using Bagging Classifier :',100*accuracy_score(Y_test,Y_pred))

Accuracy Using Bagging Classifier : 54.333333333333336


#### 11) Conclusion

Accuracy using DecisionTreeClassifier : 44%

Accuracy using Bagging Classifier : 54.33 %

We thus observe that when we use an ensemble of DecisionTreeClassifiers trained on bootstrapped samples of the original dataset instead of a single DecisionTreeClassifier trained on the original dataset, the performance of the former exceeds the latter by a significant margin, when assesed on the basis of accuracy i.e the proportion of the correctly predicted instances to the total no of instances in the test set.