# Ensemble


Ensemble learning helps improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model.

Most ensemble methods use a single base learning algorithm i.e. learners of the same type, leading to homogeneous ensembles.

There are also some methods that use heterogeneous learners, i.e. learners of different types, leading to heterogeneous ensembles. In order for ensemble methods to be more accurate than any of its individual members, the base learners have to be as accurate as possible and as diverse as possible.



### Bagging

Bagging stands for bootstrap aggregation. One way to reduce the variance of an estimate is to average together multiple estimates. For example, we can train M different trees on different subsets of the data (chosen randomly with replacement).

Bagging uses bootstrap sampling to obtain the data subsets for training the base learners. For aggregating the outputs of base learners, bagging uses voting for classification and averaging for regression.

### Boosting

Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost is one of the most successful boosting algorithms developed for binary classification.

### Libraries useful in Ensemble are listed below

### Import all the libraries required

In [None]:
import pandas as pd
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

### Load the "letter-recognition" data

In [2]:
# import dataset
import pandas as pd
df = pd.read_csv("letter-recognition.data.txt", header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,T,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


### Split the dataset into training and testing parts (70-30 ratio with a random state value 30)

In [3]:
# Select the independent variables and the target attribute
X = df[df.columns[1:]] # Selecting the independent variables
Y = df[df.columns[0]] # selecting only the target lableled column
X.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


In [4]:
# Divide the dataset into training and testing partition
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

### Q1. Ensemble Method by manipulation of Dataset (Bagged Decision Trees)

Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.

We will create decision tree classifiers with and without bagging ensemble method and compare their performance.

In [5]:
# Implement the decision tree classifier using entropy and random state value as 30
from sklearn.tree import DecisionTreeClassifier
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30) 

In [7]:
# Use k-fold cross validation with k=5
from sklearn.model_selection import cross_val_score
dtree_entropy = dtree_entropy.fit(X_train,Y_train)
scores = cross_val_score(dtree_entropy, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.865      0.86964286 0.85214286 0.86       0.87      ]
mean score:  0.8633571428571427


### Prediction and Evaluation

In [9]:
# Predict results on the testing part
Y_pred = dtree_entropy.predict(X_test)
Y_pred

array(['A', 'L', 'O', ..., 'N', 'T', 'C'], dtype=object)

In [11]:
# Calculate and print confusion matrix and other performance measures 
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score
print(classification_report(Y_test,Y_pred))
print("Confusion Matrix")
print(confusion_matrix(Y_test,Y_pred))
print("\n Accuracy")
print(accuracy_score(Y_test,Y_pred))

              precision    recall  f1-score   support

           A       0.93      0.91      0.92       229
           B       0.82      0.83      0.83       228
           C       0.91      0.88      0.89       220
           D       0.78      0.86      0.82       219
           E       0.84      0.87      0.85       232
           F       0.83      0.77      0.80       225
           G       0.87      0.80      0.83       234
           H       0.74      0.79      0.76       206
           I       0.88      0.92      0.90       236
           J       0.90      0.89      0.90       209
           K       0.83      0.84      0.84       213
           L       0.92      0.92      0.92       239
           M       0.91      0.91      0.91       240
           N       0.89      0.87      0.88       239
           O       0.90      0.81      0.85       243
           P       0.85      0.92      0.88       243
           Q       0.87      0.82      0.84       228
           R       0.81    

### Comparison with Bagged Decision Tree

In [12]:
# Create a moodel using bagging using 5 decision tree classifiers
from sklearn.ensemble import BaggingClassifier

seed = 30
dtree = DecisionTreeClassifier(criterion='entropy', random_state = 30) 
num_trees = 5
model = BaggingClassifier(base_estimator=dtree, n_estimators=num_trees, random_state=seed)

In [25]:
# Use k-fold cross validation with k=5
scores_ensemble = cross_val_score(model, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores_ensemble)
print('mean score: ', scores_ensemble.mean())

scores:  [0.88714286 0.89035714 0.89       0.88285714 0.88714286]
mean score:  0.8875


### Prediction and Evaluation

In [14]:
# Predict results on the testing part
model.fit(X_train, Y_train)
Y_pred_ensemble = model.predict(X_test)

In [15]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,Y_pred_ensemble))
print("Confusion Matrix")
print(confusion_matrix(Y_test,Y_pred_ensemble))
print("\n Accuracy")
print(accuracy_score(Y_test,Y_pred_ensemble))

              precision    recall  f1-score   support

           A       0.93      0.98      0.95       229
           B       0.78      0.92      0.84       228
           C       0.88      0.89      0.89       220
           D       0.78      0.90      0.84       219
           E       0.84      0.91      0.87       232
           F       0.87      0.81      0.84       225
           G       0.85      0.82      0.84       234
           H       0.82      0.86      0.84       206
           I       0.90      0.93      0.91       236
           J       0.93      0.89      0.91       209
           K       0.87      0.92      0.89       213
           L       0.94      0.92      0.93       239
           M       0.93      0.93      0.93       240
           N       0.96      0.90      0.92       239
           O       0.87      0.81      0.84       243
           P       0.89      0.93      0.91       243
           Q       0.89      0.89      0.89       228
           R       0.89    

### Q2. Ensemble Method by manipulation of Classifiers (using Voting Classifier)

The VotingClassifier takes in a list of different estimators as arguments and a voting method. The **hard** voting method uses the predicted labels and a majority rules system, while the **soft** voting method predicts a label based on the argmax/largest predicted value of the sum of the predicted probabilities.

After we provide the desired classifiers, we need to fit the resulting ensemble classifier object. We can then get predictions and use accuracy metrics.

In [17]:
#Import required library
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

In [19]:
# Implement the different classifiers
dt = DecisionTreeClassifier(criterion='gini', random_state = 30)
knn1 = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn2 = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn3 = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
nb = GaussianNB()

In [20]:
# Build Voting Classifier using above estimators and hard voting method
# Function to be used: VotingClassifier(estimators,voting)
# Estimators represent the base classifiers used taken as ('base classifier name', variable_name)
from sklearn.ensemble import VotingClassifier
model_voting = VotingClassifier(estimators=[('m1_dt', dt),('m2_knn1', knn1),('m3_knn2',knn2),('m4_knn3',knn3),('m5_nb',nb)],voting='hard')

In [28]:
# Fit the voting classifier model and print scores using k-fold cross validation with k=5
model_voting.fit(X_train, Y_train)
scores_voting = cross_val_score(model_voting, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores_voting)
print('mean score: ', scores_voting.mean())

scores:  [0.94178571 0.945      0.94357143 0.94464286 0.94357143]
mean score:  0.9437142857142857


### Prediction and Evaluation

In [23]:
# Predict results on the testing part
Y_pred_voting = model_voting.predict(X_test)

In [24]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,Y_pred_voting))
print("Confusion Matrix")
print(confusion_matrix(Y_test,Y_pred_voting))
print("\n Accuracy")
print(accuracy_score(Y_test,Y_pred_voting))

              precision    recall  f1-score   support

           A       0.98      1.00      0.99       229
           B       0.84      0.98      0.90       228
           C       0.96      0.95      0.96       220
           D       0.89      0.98      0.93       219
           E       0.94      0.93      0.93       232
           F       0.93      0.93      0.93       225
           G       0.94      0.91      0.93       234
           H       0.89      0.92      0.90       206
           I       0.94      0.97      0.96       236
           J       0.97      0.93      0.95       209
           K       0.93      0.91      0.92       213
           L       0.99      0.95      0.97       239
           M       0.97      0.98      0.98       240
           N       0.98      0.95      0.96       239
           O       0.92      0.95      0.93       243
           P       0.95      0.93      0.94       243
           Q       0.97      0.96      0.97       228
           R       0.94    

### Q3. Manipulating the features

In [66]:
# Generate five random vectors
df1 = df.copy(deep=True)
df2 = df.copy(deep=True)
df3 = df.copy(deep=True)
df4 = df.copy(deep=True)
df5 = df.copy(deep=True)

In [67]:
# Model 1
# Select the independent variables 
# select only the target lableled column
# Train the model
x1 = np.random.choice(np.arange(1,17),10,replace=False)
x1.sort()
print(x1)
X1 = df1[df1.columns[x1]]
Y1 = df1[0]
X1.head() 
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, Y1, test_size=0.30, random_state = 30)
dt1 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dt1.fit(X1_train, Y1_train)

[ 1  3  4  5  7  8 11 12 13 15]


DecisionTreeClassifier(criterion='entropy', random_state=30)

In [68]:
# Model 2
# Select the independent variables 
# select only the target lableled column
# Train the model
x2 = np.random.choice(np.arange(1,17),10,replace=False)
x2.sort()
print(x2)
X2 = df2[df2.columns[x2]]
Y2 = df2[0]
X2.head() 
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2, Y2, test_size=0.30, random_state = 30)
dt2 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dt2.fit(X2_train, Y2_train)

[ 1  3  4  5  6  7 12 13 14 15]


DecisionTreeClassifier(criterion='entropy', random_state=30)

In [69]:
# Model 3
# Select the independent variables 
# select only the target lableled column
# Train the model
x3 = np.random.choice(np.arange(1,17),10,replace=False)
x3.sort()
print(x3)
X3 = df3[df3.columns[x3]]
Y3 = df3[0]
X3.head() 
X3_train, X3_test, Y3_train, Y3_test = train_test_split(X3, Y3, test_size=0.30, random_state = 30)
dt3 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dt3.fit(X3_train, Y3_train)

[ 3  4  5  6  8  9 10 11 13 15]


DecisionTreeClassifier(criterion='entropy', random_state=30)

In [70]:
# Model 4
# Select the independent variables 
# select only the target lableled column
# Train the model
x4 = np.random.choice(np.arange(1,17),10,replace=False)
x4.sort()
print(x4)
X4 = df4[df4.columns[x4]]
Y4 = df4[0]
X4.head() 
X4_train, X4_test, Y4_train, Y4_test = train_test_split(X4, Y4, test_size=0.30, random_state = 30)
dt4 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dt4.fit(X4_train, Y4_train)

[ 2  4  5  7  9 11 12 13 14 15]


DecisionTreeClassifier(criterion='entropy', random_state=30)

In [71]:
# Model 5
# Select the independent variables 
# select only the target lableled column
# Train the model
x5 = np.random.choice(np.arange(1,17),10,replace=False)
x5.sort()
print(x5)
X5 = df5[df5.columns[x5]]
Y5 = df5[0]
X5.head() 
X5_train, X5_test, Y5_train, Y5_test = train_test_split(X5, Y5, test_size=0.30, random_state = 30)
dt5 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dt5.fit(X5_train, Y5_train)

[ 1  3  5  6  8 10 11 12 14 15]


DecisionTreeClassifier(criterion='entropy', random_state=30)

In [72]:
# Apply Voting Classifier
from sklearn.ensemble import VotingClassifier
model_feature_subset = VotingClassifier(estimators=[('m1', dt1),('m2', dt2),('m3',dt3),('m4',dt4),('m5',dt5)],voting='hard')
model_feature_subset.fit(X_train, Y_train)
Y_pred_feature_subset = model_feature_subset.predict(X_test)
Y_pred_feature_subset

array(['A', 'L', 'O', ..., 'N', 'T', 'C'], dtype=object)

In [73]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,Y_pred_feature_subset))
print("Confusion Matrix")
print(confusion_matrix(Y_test,Y_pred_feature_subset))
print("\n Accuracy")
print(accuracy_score(Y_test,Y_pred_feature_subset))

              precision    recall  f1-score   support

           A       0.93      0.91      0.92       229
           B       0.82      0.83      0.83       228
           C       0.91      0.88      0.89       220
           D       0.78      0.86      0.82       219
           E       0.84      0.87      0.85       232
           F       0.83      0.77      0.80       225
           G       0.87      0.80      0.83       234
           H       0.74      0.79      0.76       206
           I       0.88      0.92      0.90       236
           J       0.90      0.89      0.90       209
           K       0.83      0.84      0.84       213
           L       0.92      0.92      0.92       239
           M       0.91      0.91      0.91       240
           N       0.89      0.87      0.88       239
           O       0.90      0.81      0.85       243
           P       0.85      0.92      0.88       243
           Q       0.87      0.82      0.84       228
           R       0.81    

### Q4. Manipulating the classes

In [86]:
# Generate 5 sets of two class representation
features = np.zeros((5,13))
for i in range(5):
    features[i] = np.random.choice(np.arange(1,27),13,replace=False)
    
print(features)
df1 = df.copy(deep = True)
df2 = df.copy(deep = True)
df3 = df.copy(deep = True)
df4 = df.copy(deep = True)
df5 = df.copy(deep = True)

for i in range(len(df)):
    col = ord(df.iloc[i, 0]) - 64
    
    if col not in features[0]:
        df1.iloc[i, 0] = "1"
    else:
        df1.iloc[i, 0] = "0"
        
    if col not in features[1]:
        df2.iloc[i, 0] = "1"
    else:
        df2.iloc[i, 0] = "0"
        
    if col not in features[2]:
        df3.iloc[i, 0] = "1"
    else:
        df3.iloc[i, 0] = "0"
        
    if col not in features[3]:
        df4.iloc[i, 0] = "1"
    else:
        df4.iloc[i, 0] = "0"
        
    if col not in features[4]:
        df5.iloc[i, 0] = "1"
    else:
        df5.iloc[i, 0] = "0"

[[ 6. 18. 17. 16. 13. 25. 14. 12.  4. 19. 10. 11. 22.]
 [24.  5. 15.  9.  2.  1. 20. 21.  4. 10. 12.  3. 18.]
 [ 3. 17. 20. 11.  6.  9.  1. 25. 13.  7. 18. 26. 23.]
 [18.  5. 12.  8.  6. 13.  4.  1. 14. 26. 19. 21. 17.]
 [ 1. 12.  6. 22.  4. 10. 24. 18.  3. 13.  8.  9. 19.]]


In [87]:
# Model 1
# Select the independent variables 
# select only the target lableled column
# Train the model
X1 = df1[df1.columns[1:]]
Y1 = df1[df1.columns[0]]
X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, Y1, test_size=0.30, random_state = 30)
dt1 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dt1 = dt1.fit(X1_train, Y1_train)

In [88]:
# Model 2
# Select the independent variables 
# select only the target lableled column
# Train the model
X2 = df2[df2.columns[1:]]
Y2 = df2[df2.columns[0]]
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2, Y2, test_size=0.30, random_state = 30)
dt2 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dt2 = dt2.fit(X2_train, Y2_train)

In [89]:
# Model 3
# Select the independent variables 
# select only the target lableled column
# Train the model
X3 = df1[df3.columns[1:]]
Y3 = df1[df3.columns[0]]
X3_train, X3_test, Y3_train, Y3_test = train_test_split(X3, Y3, test_size=0.30, random_state = 30)
dt3 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dt3 = dt3.fit(X3_train, Y3_train)

In [90]:
# Model 4
# Select the independent variables 
# select only the target lableled column
# Train the model
X4 = df4[df4.columns[1:]]
Y4 = df4[df4.columns[0]]
X4_train, X4_test, Y4_train, Y4_test = train_test_split(X4, Y4, test_size=0.30, random_state = 30)
dt4 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dt4 = dt4.fit(X4_train, Y4_train)

In [91]:
# Model 5
# Select the independent variables 
# select only the target lableled column
# Train the model
X5 = df5[df5.columns[1:]]
Y5 = df5[df5.columns[0]]
X5_train, X5_test, Y5_train, Y5_test = train_test_split(X5, Y5, test_size=0.30, random_state = 30)
dt5 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dt5 = dt1.fit(X5_train, Y5_train)

In [92]:
# Apply Voting Classifier
model_two_class = VotingClassifier(estimators=[('m1', dt1),('m2', dt2),('m3',dt3),('m4',dt4),('m5',dt5)],voting='hard')
model_two_class.fit(X_train, Y_train)
scores_two_class = cross_val_score(model_two_class, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores_two_class)
print('mean score: ', scores_two_class.mean())

scores:  [0.865      0.86964286 0.85214286 0.86       0.87      ]
mean score:  0.8633571428571427


In [95]:
# Calculate and print confusion matrix and other performance measures 
Y_pred_two_class = model_two_class.predict(X_test)
print(classification_report(Y_test,Y_pred_two_class))
print("Confusion Matrix")
print(confusion_matrix(Y_test,Y_pred_two_class))
print("\n Accuracy")
print(accuracy_score(Y_test,Y_pred_two_class))

              precision    recall  f1-score   support

           A       0.93      0.91      0.92       229
           B       0.82      0.83      0.83       228
           C       0.91      0.88      0.89       220
           D       0.78      0.86      0.82       219
           E       0.84      0.87      0.85       232
           F       0.83      0.77      0.80       225
           G       0.87      0.80      0.83       234
           H       0.74      0.79      0.76       206
           I       0.88      0.92      0.90       236
           J       0.90      0.89      0.90       209
           K       0.83      0.84      0.84       213
           L       0.92      0.92      0.92       239
           M       0.91      0.91      0.91       240
           N       0.89      0.87      0.88       239
           O       0.90      0.81      0.85       243
           P       0.85      0.92      0.88       243
           Q       0.87      0.82      0.84       228
           R       0.81    

### Q5. Which method performs the best