# Ensemble Methods

**Overview**<br>

1. Voting Ensemble
    *  Load MNIST data
    *  Split into train, val and test set
    *  Train various classifiers on train set
    *  Combine classifiers into an ensemble that outperforms all the individual classifiers, using hard or soft voting
    *  Measure performance on test set. How much better does it do than the individual classifiers?
<br><br>
2. Stacking Ensemble
    *  Run the previous individual classifiers on the val set and greate a new training set with the predictions
    *  Train classifier (blender) on new train set (vectors of the individual classifier's predictions)
    *  Evaluate the full ensemble on the test set
    *  How does it compare to the previous voting classifier?

In [315]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

%reset

Nothing done.


## Voting Ensemble
**Step 1: Load data**

In [316]:
digits = load_digits()
X = digits['data']
y = digits['target']

**Step 2: Train, val, and test sets**

In [317]:
train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.15

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1 - train_ratio)


X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 

**Step 3: Train various classifiers on train set**

In [318]:
# Random Forest
rf_clf = RandomForestClassifier(max_depth=20, random_state=42)
rf_clf.fit(X_train, y_train)
predictions = rf_clf.predict(X_val)
accuracy = accuracy_score(y_val, predictions)
print(f'Random Forest accuracy: {accuracy*100}%')

# Extra Trees
extra_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
extra_clf.fit(X_train, y_train)
predictions = extra_clf.predict(X_val)
accuracy = accuracy_score(y_val, predictions)
print(f'Extra Trees accuracy: {accuracy*100}%')

# SVC
svc = SVC(kernel='poly', probability=True)
svc.fit(X_train, y_train)
predictions = svc.predict(X_val)
accuracy = accuracy_score(y_val, predictions)
print(f'SVC accuracy: {accuracy*100}%')

Random Forest accuracy: 96.44444444444444%
Extra Trees accuracy: 97.33333333333334%
SVC accuracy: 98.22222222222223%


**Step 4: Combine classifiers into an ensemble**

In [319]:
voting_clf = VotingClassifier(
    estimators=[('rf', rf_clf), ('xtra', extra_clf), ('svc', svc)],
    voting='soft'
)

voting_clf.fit(X_train, y_train)

for clf in (rf_clf, extra_clf, svc, voting_clf):
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)
    print(f'{clf.__class__.__name__} {(accuracy_score(y_test, predictions)*100)}%')

RandomForestClassifier 98.22222222222223%
ExtraTreesClassifier 98.22222222222223%
SVC 99.11111111111111%
VotingClassifier 99.11111111111111%


## Stacking Ensemble
**Step 1: Generate new train set with the individual predictor's predictions**

In [323]:


def get_predictions(data):
    data_predictions = []

    for instance in X_val:
        instance_predictions = []

        for clf in (rf_clf, extra_clf, svc):
            prediction = clf.predict([instance])
            instance_predictions.append(prediction[0])
        
        data_predictions.append(instance_predictions)

        instance_predictions = []

    return data_predictions


# new train set
X_train_new = get_predictions(X_val)
y_train_new = y_val

**Step 2: Train blender on new train set**

In [324]:
blender = RandomForestClassifier(max_depth=20, random_state=42)
blender.fit(X_train_new, y_train_new)

RandomForestClassifier(max_depth=20, random_state=42)

**Step 3: Evaluate the full ensemble on the test set**

In [325]:
# this is the predictions of the individual classifiers on the test set
clf_predictions = get_predictions(X_test)

predictions = blender.predict(clf_predictions)
accuracy = accuracy_score(y_val, predictions)
print(f'Stacking ensemble accuracy: {accuracy*100}%')

Stacking ensemble accuracy: 99.55555555555556%
