## ENSEMBLE
* Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). 
* Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM classifier. 
* Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?


In [1]:
import numpy as np

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score, confusion_matrix

from sklearn.datasets import fetch_openml 


In [2]:
%%time
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

CPU times: user 13.9 s, sys: 300 ms, total: 14.2 s
Wall time: 14.2 s


dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [3]:
X = mnist["data"]
y = mnist["target"]

In [4]:
np.unique(y)

array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)

In [5]:
X_train, X_val, X_test, y_train, y_val, y_test = X[:50000], X[50000:60000], X[60000:], y[:50000],y[50000:60000], y[60000:]


In [6]:
X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape

((50000, 784), (10000, 784), (10000, 784), (50000,), (10000,), (10000,))

#### Scaling 

In [7]:
#Scale data to unit variance. This doesn't deter the performance of Random Forest and Extra trees.

# NB: If you scale the inputs, also scale that of the test set. Else it'll affect performance.
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_val = sc.transform(X_val)
X_test = sc.transform(X_test)

### SVM

In [8]:
%%time
svm_clf = SVC(kernel="rbf", random_state=42) #Random State for reproducibility of results 
svm_clf.fit(X_train, y_train)

CPU times: user 7min 59s, sys: 1.74 s, total: 8min 1s
Wall time: 8min 1s


SVC(random_state=42)

In [9]:
%%time
svm_acc = accuracy_score(y_val,svm_clf.predict(X_val))
print(f"The svm accuracy on the validation set is {np.round(svm_acc*100,2)}%")

The svm accuracy on the validation set is 96.87%
CPU times: user 2min 12s, sys: 71.7 ms, total: 2min 12s
Wall time: 2min 12s


#### Random Forest Classifier

In [10]:
%%time
rnd_clf = RandomForestClassifier(n_jobs=-1) #n_jobs = -1 enables the rnd_clf to use all available CPU resources
rnd_clf.fit(X_train, y_train)

CPU times: user 1min 6s, sys: 541 ms, total: 1min 7s
Wall time: 4.76 s


RandomForestClassifier(n_jobs=-1)

In [11]:
%%time
rnd_acc = accuracy_score(y_val,rnd_clf.predict(X_val))
print(f"The Random Forest accuracy on the validation set is {np.round(rnd_acc*100,2)}%")

The Random Forest accuracy on the validation set is 97.23%
CPU times: user 634 ms, sys: 48.6 ms, total: 683 ms
Wall time: 102 ms


#### Extra Trees Classifier

In [12]:
%%time
ext_clf = ExtraTreesClassifier(n_jobs=-1)
ext_clf.fit(X_train, y_train)

CPU times: user 1min 37s, sys: 1.32 s, total: 1min 38s
Wall time: 6.73 s


ExtraTreesClassifier(n_jobs=-1)

In [13]:
%%time
ext_acc = accuracy_score(y_val,ext_clf.predict(X_val))
print(f"The Extra tree accuracy on the validation set is {np.round(ext_acc*100,2)}%")

The Extra tree accuracy on the validation set is 97.38%
CPU times: user 841 ms, sys: 48.7 ms, total: 890 ms
Wall time: 114 ms


#### Ensemble with Hard voting

In [14]:
%%time
from sklearn.ensemble import VotingClassifier

rnd_clfV = RandomForestClassifier(n_jobs=-1) 
svm_clfV = SVC(kernel="rbf", random_state=42)
ext_clfV = ExtraTreesClassifier(n_jobs=-1)
voting_clf = VotingClassifier(estimators=[('ext', ext_clfV), 
                                          ('rf', rnd_clfV), 
                                          ('svc', svm_clfV)],    
                                          voting='hard') 
voting_clf.fit(X_train, y_train)

CPU times: user 8min 1s, sys: 4.15 s, total: 8min 6s
Wall time: 8min 19s


VotingClassifier(estimators=[('ext', ExtraTreesClassifier(n_jobs=-1)),
                             ('rf', RandomForestClassifier(n_jobs=-1)),
                             ('svc', SVC(random_state=42))])

In [15]:
%%time
voting_clf_acc = accuracy_score(y_val,voting_clf.predict(X_val))
print(f"The ensemble using prediction probability gave a validation accuracy of {np.round(voting_clf_acc*100,2)}%")

The ensemble using prediction probability gave a validation accuracy of 97.65%
CPU times: user 2min 13s, sys: 216 ms, total: 2min 13s
Wall time: 2min 12s


#### Ensemble with Soft voting

In [16]:
%%time
from sklearn.ensemble import VotingClassifier

rnd_clfVs = RandomForestClassifier(n_jobs=-1) 
svm_clfVs = SVC(kernel="rbf", random_state=42, probability=True)
ext_clfVs = ExtraTreesClassifier(n_jobs=-1)
voting_clf2 = VotingClassifier(estimators=[('ext', ext_clfVs), 
                                          ('rf', rnd_clfVs), 
                                          ('svc', svm_clfVs)],    
                                          voting='soft') 
voting_clf2.fit(X_train, y_train)

CPU times: user 41min 28s, sys: 6.41 s, total: 41min 34s
Wall time: 44min


VotingClassifier(estimators=[('ext', ExtraTreesClassifier(n_jobs=-1)),
                             ('rf', RandomForestClassifier(n_jobs=-1)),
                             ('svc', SVC(probability=True, random_state=42))],
                 voting='soft')

In [17]:
%%time
voting_clf2_acc = accuracy_score(y_val,voting_clf2.predict(X_val))
print(f"The ensemble using prediction probability gave a validation accuracy of {np.round(voting_clf2_acc*100,2)}%")

The ensemble using prediction probability gave a validation accuracy of 97.98%
CPU times: user 2min 13s, sys: 996 ms, total: 2min 14s
Wall time: 2min 12s


##### How much better does it perform compared to the individual classifiers?


In [18]:
print(f"The difference between the best ensemble classifier and svm is {(voting_clf2_acc-svm_acc)*100}%")
print(f"The difference between the best ensemble classifier and random forest is {(voting_clf2_acc-rnd_acc)*100}%")
print(f"The difference between the best ensemble classifier and extra trees is {(voting_clf2_acc-ext_acc)*100}%")

The difference between the best ensemble classifier and svm is 1.1099999999999999%
The difference between the best ensemble classifier and random forest is 0.7499999999999951%
The difference between the best ensemble classifier and extra trees is 0.6000000000000005%


### Evaluating on test set

In [19]:
y_pred = voting_clf2.predict(X_test)

acc_score = accuracy_score(y_test,y_pred)
pre_score = precision_score(y_test,y_pred, average='macro')
rec_score = recall_score(y_test,y_pred, average='macro')
f1 = f1_score(y_test,y_pred, average='macro')
cm = confusion_matrix(y_test,y_pred)



In [20]:
#DISPLAY RESULTS
print('THE BEST CLASSIFIER EVALUTED ON THE TEST GAVE:')
print(f'An overall accuracy score is {np.round(acc_score*100,2)}')
print(f'A precision score is {np.round(pre_score*100,2)}')
print(f'A recall score is {np.round(rec_score*100,2)}')
print(f'An f1-score is {np.round(f1*100,2)}')
print(f'The confusion matrix is \n {cm}')

THE BEST CLASSIFIER EVALUTED ON THE TEST GAVE:
An overall accuracy score is 97.34
A precision score is 97.32
A recall score is 97.32
An f1-score is 97.32
The confusion matrix is 
 [[ 972    0    1    0    0    2    2    1    2    0]
 [   0 1127    2    1    0    1    3    0    1    0]
 [   4    1 1001    4    2    0    3    8    8    1]
 [   0    0    2  985    0    6    0    8    7    2]
 [   1    0    3    0  962    0    4    0    2   10]
 [   2    0    0    8    1  866    7    1    6    1]
 [   7    3    0    0    4    7  932    0    5    0]
 [   2    7   14    2    1    0    0  991    1   10]
 [   4    0    4    6    3    7    1    6  939    4]
 [   5    5    4    8   13    2    1    5    7  959]]
