<h1>
<center>Classifying Digits</center>
</h1>

<h1>
<center>Avery Lee</center>
</h1>

# SVM Classification and Regression 

Let's explore using SVM classification for the MNIST dataset. 

In [None]:
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False, cache=True)
mnist.target = mnist.target.astype(np.int8)
X_train = mnist["data"][:60000]
X_test  = mnist["data"][60000:]
y_train = mnist["target"][:60000]
y_test  = mnist["target"][60000:]

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

N = 2000
split_obj = StratifiedShuffleSplit(n_splits=1,
                               test_size=N/60000, random_state=42)
for other_idx, subsample_idx in split_obj.split(X_train, y_train):
    X = X_train[subsample_idx]
    y = y_train[subsample_idx]

Fit the linear SVM classifier (`LinearSVC`) with `max_iter=50000`. For this model, optimize the hyperparameter $C$ using 3-fold CV over the values $10^{-k}$, $k=0,1,\dots,9$, where the performance measure is accuracy. 

The best C is 1e-07 and the accuracy is 0.862497.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

svm_clf = LinearSVC(random_state=42, max_iter=50000)
param_distributions = [{'C':[10**(-k) for k in range(10)]}]
rnd_search_cv = GridSearchCV(svm_clf, param_distributions, cv=3, scoring='accuracy', n_jobs=-1)
rnd_search_cv.fit(X, y)
rnd_best_estim = rnd_search_cv.best_estimator_
rnd_accuracy = rnd_search_cv.best_score_
print(f'best estimator, accuracy: {rnd_best_estim}, {rnd_accuracy}')

best estimator, accuracy: LinearSVC(C=1e-07, max_iter=50000, random_state=42), 0.8624974299636969


Now let's try fitting a SVM with a Gaussian RBF kernel and `max_iter=50000`. For this model, optimize the hyperparameters $C$ over the distrbution `uniform(1,10)` and $\gamma$ over the distribution `reciprocal(0.001, 0.1)` with 10 random samples. Again, use 3-fold CV and the performance measure is accuracy.

The best hyperparameters are C=4.745401188473625 and gamma=0.07969454818643928, and the accuracy is 0.11250005627.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniform 
from sklearn.svm import SVC

rbfc_param_distributions = {"C": uniform(1, 10), "gamma": reciprocal(0.001, 0.1)}
rbfc_svm_clf = SVC(kernel='rbf', max_iter=50000, random_state=42)
rbfc_rnd_search_cv = RandomizedSearchCV(rbfc_svm_clf, 
                                        rbfc_param_distributions, 
                                        n_iter=10, 
                                        cv=3,
                                        scoring='accuracy',
                                        n_jobs=-1,
                                        random_state=42)
rbfc_rnd_search_cv.fit(X, y)
rbfc_best_estim = rbfc_rnd_search_cv.best_estimator_
rbfc_accuracy = rbfc_rnd_search_cv.best_score_
print(f'best estimator, accuracy: {rbfc_best_estim}, {rbfc_accuracy}')

best estimator, accuracy: SVC(C=4.745401188473625, gamma=0.07969454818643928, max_iter=50000,
    random_state=42), 0.11250005627816723


From the results above, we can see that the better model is the LinearSVC. The accuracy for the test set is 0.8873.

In [None]:
from sklearn.metrics import accuracy_score

# (a) had better accuracy than (b)
y_pred1 = rnd_best_estim.predict(X_test)
better_accuracy = accuracy_score(y_test, y_pred1)
better_accuracy

0.8873

# Voting Classifiers 

Now let's explore voting classifiers as well. To save computational time, split it into a smaller training set (the first 5000 observations) and a validation set (the next 1000 observations) as given by the following code.

In [None]:
N = 5000
M = 6000
X_train = mnist["data"][:N]
X_val  = mnist["data"][N:M]
y_train = mnist["target"][:N]
y_val = mnist["target"][N:M]

Train the following classifiers on the training set:

(i) a random forest classifier with arguments `n_estimators=100, n_jobs=-1, random_state=42`,

(ii) an extra-trees classifier with arguments `n_estimators=100, n_jobs=-1, random_state=42`,

(iii) an AdaBoost classifier `n_estimators=50, learning_rate=0.2, random_state=42`,

(iv) a gradient boosting classifier using the class `GradientBoostingClassifier()` with arguments `max_depth=2, n_estimators=10, learning_rate=0.25, random_state=42`.

In [None]:
def print_accuracy(classifiers):
    for clf in classifiers:
        clf.fit(X_train, y_train)
        clf_y_pred = clf.predict(X_val)
        print(clf.__class__.__name__, accuracy_score(y_val, clf_y_pred))

In [None]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier

# (i)
rf_clf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)

# (ii)
extra_clf = ExtraTreesClassifier(n_estimators=100, n_jobs=-1, random_state=42)

# (iii)
ada_clf = AdaBoostClassifier(n_estimators=50, learning_rate=0.2, random_state=42)

# (iv)
gb_clf = GradientBoostingClassifier(max_depth=2, n_estimators=10, learning_rate=0.25, random_state=42)

We can see from the results below that the ExtraTreesClassifier had the best training accuracy of 0.947.

In [None]:
# fit, predict, accuracy 
print_accuracy([rf_clf, extra_clf, ada_clf, gb_clf])

RandomForestClassifier 0.939
ExtraTreesClassifier 0.947
AdaBoostClassifier 0.736
GradientBoostingClassifier 0.834


Now let's train a hard-voting and a soft-voting ensemble classifier based on the models using SVM. Evaluate each voting classifier on the validation set. 

The hard voting and soft voting classifiers got 0.923 and 0.926 accuracies respectively. The performance of the ensemble model is similar to the individual models since the accuracies for both the hard and soft voting classifiers are in the middle (higher than adaboost and gradient boosting but lower than random forest and extratrees) of the accuracies for the four models. This makes sense as the voting classifiers are taking votes from all 4 of the classifier types, so it would work better than some but also worse than other models. 

In [None]:
from sklearn.ensemble import VotingClassifier

hard_voting_clf = VotingClassifier(estimators=[('rf', rf_clf), 
                                               ('extra', extra_clf), 
                                               ('ada', ada_clf), 
                                               ('gb', gb_clf)], 
                                   voting='hard')

soft_voting_clf = VotingClassifier(estimators=[('rf', rf_clf), 
                                               ('extra', extra_clf), 
                                               ('ada', ada_clf), 
                                               ('gb', gb_clf)], 
                                   voting='soft')

# fit, predict, accuracy 
hard_voting_clf.fit(X_train, y_train)
soft_voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('rf',
                              RandomForestClassifier(n_jobs=-1,
                                                     random_state=42)),
                             ('extra',
                              ExtraTreesClassifier(n_jobs=-1, random_state=42)),
                             ('ada',
                              AdaBoostClassifier(learning_rate=0.2,
                                                 random_state=42)),
                             ('gb',
                              GradientBoostingClassifier(learning_rate=0.25,
                                                         max_depth=2,
                                                         n_estimators=10,
                                                         random_state=42))],
                 voting='soft')

In [None]:
print_accuracy([hard_voting_clf, soft_voting_clf])

VotingClassifier 0.923
VotingClassifier 0.926


# Stacking 

For each of the four voting classifiers, let's make 5000 clean predictions on the training set with 3-fold cross validation using `sklearn.model_selection.cross_val_predict()`. You should end up with four predictions per observation. Next, apply one-hot encoding to `pred` since these predictions are class labels.

In [None]:
from sklearn.model_selection import cross_val_predict

# cross validation predict
rf_y_train_pred = cross_val_predict(rf_clf, X_train, y_train, cv=3, n_jobs=-1)
extra_y_train_pred = cross_val_predict(extra_clf, X_train, y_train, cv=3, n_jobs=-1)
ada_y_train_pred = cross_val_predict(ada_clf, X_train, y_train, cv=3, n_jobs=-1)
gb_y_train_pred = cross_val_predict(gb_clf, X_train, y_train, cv=3, n_jobs=-1)

train_preds = pd.DataFrame({'rf':rf_y_train_pred, 
                            'extra':extra_y_train_pred, 
                            'ada boost':ada_y_train_pred, 
                            'gradient boost':gb_y_train_pred})

In [None]:
train_preds[0:5]

Unnamed: 0,rf,extra,ada boost,gradient boost
0,5,5,3,3
1,0,0,5,0
2,4,4,4,4
3,1,1,1,1
4,9,9,9,9


In [None]:
from sklearn.preprocessing import OneHotEncoder

# one hot encoding
ohenc = OneHotEncoder()
ohencX_train = pd.DataFrame(ohenc.fit_transform(train_preds).toarray())
ohencX_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4996,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4997,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Use the predictions as features and the actual label of the observations as the target. Train a random forest classifier on the training set with the parameters `n_estimators=100, random_state=42`. This classifier is a blender. 

In [None]:
rf_blender = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_blender.fit(ohencX_train, y_train)

RandomForestClassifier(n_jobs=-1, random_state=42)

Obtain the predictions of the blender on the validation set by feeding predictions on the validation set from the four voting classifiers into the blender trained. These are called stacking predictions. Report the accuracy of your stacking predictions on the validation set and compare to the original predictions.

The stacking predictor had an accuracy of 0.947, which is better than both the initial hard and soft voting classifiers. 

In [None]:
rf_val_pred = rf_clf.predict(X_val)
extra_val_pred = extra_clf.predict(X_val)
ada_val_pred = ada_clf.predict(X_val)
gb_val_pred = gb_clf.predict(X_val)

val_preds = pd.DataFrame({'rf':rf_val_pred, 
                          'extra':extra_val_pred, 
                          'ada boost':ada_val_pred, 
                          'gradient boost':gb_val_pred})
                          
ohenc = OneHotEncoder()
ohencX_val = pd.DataFrame(ohenc.fit_transform(val_preds).toarray())
blender_val_preds = rf_blender.predict(ohencX_val)
blender_val_accuracy = accuracy_score(y_val, blender_val_preds)
blender_val_accuracy

0.947