<a href="https://colab.research.google.com/github/zeynepsenatatli/MachineLearningExercises/blob/main/UE07_C7E8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Ensamble Learning

Load the MNIST data and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing).

In [None]:
from tensorflow import keras
import numpy as np

In [None]:
# Load the MNIST dataset
mnist = keras.datasets.mnist
(x_train_full, y_train_full), (x_test, y_test) = mnist.load_data()

x_train_full = x_train_full / 255
x_train_full = np.reshape(x_train_full, (x_train_full.shape[0], 28 * 28))

x_valid, x_train = x_train_full[:10000], x_train_full[10000:]
y_valid, y_train = y_train_full[:10000], y_train_full[10000:]

In [None]:
x_train.shape

(50000, 784)

In [None]:
x_test = x_test / 255
x_test = np.reshape(x_test, (x_test.shape[0], 28 * 28))

In [None]:
x_test.shape

(10000, 784)

Then train various classifiers, such as a random forest classifier, an extra-trees classifier, and an SVM classifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC

In [None]:
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(x_train, y_train)

etc = ExtraTreesClassifier(n_estimators=100, random_state=42)
etc.fit(x_train, y_train)

svm = SVC(kernel='linear', C=1, random_state=42)
svm.fit(x_train, y_train)

In [None]:
rfc.score(x_valid, y_valid)

0.9704

In [None]:
etc.score(x_valid, y_valid)

0.9715

In [None]:
svm.score(x_valid, y_valid)

0.9365

Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting.

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

*Hard voting:*

In [None]:
voting_clf = VotingClassifier(
    estimators=[('rfc', rfc), ('etc', etc), ('svm', svm)],
    voting='hard'
)

# Train the voting classifier on the training set
voting_clf.fit(x_train, y_train)

In [None]:
# Evaluate the accuracy of each classifier and the ensemble on the validation set

classifiers = [('Random Forest Classifier', rfc),
               ('Extra-Trees Classifier', etc),
               ('Linear SVM Classifier', svm),
               ('Voting Classifier', voting_clf)]

for name, clf in classifiers:
    y_pred = clf.predict(x_valid)
    accuracy = accuracy_score(y_valid, y_pred)
    print(name, "accuracy:", accuracy)

Random Forest Classifier accuracy: 0.9704
Extra-Trees Classifier accuracy: 0.9715
Linear SVM Classifier accuracy: 0.9365
Voting Classifier accuracy: 0.9715


*Soft voting:*

In [None]:
voting_clf.voting = "soft"
voting_clf.named_estimators["svm"].probability = True

voting_clf.fit(x_train, y_train)

In [None]:
classifiers = [('Random Forest Classifier', rfc),
               ('Extra-Trees Classifier', etc),
               ('Linear SVM Classifier', svm),
               ('Voting Classifier (soft)', voting_clf)]

for name, clf in classifiers:
    y_pred = clf.predict(x_valid)
    accuracy = accuracy_score(y_valid, y_pred)
    print(name, "accuracy:", accuracy)

Random Forest Classifier accuracy: 0.9704
Extra-Trees Classifier accuracy: 0.9715
Linear SVM Classifier accuracy: 0.9365
Voting Classifier (soft) accuracy: 0.9672


*Hard voting has better accuracy*

In [None]:
voting_clf.voting = "hard"

voting_clf.score(x_test, y_test)

0.9705

In [None]:
rfc.score(x_test, y_test)

0.9689

In [None]:
etc.score(x_test, y_test)

0.9709

In [None]:
svm.score(x_test, y_test)

0.9393