# Preamble
This problem set is an extension of Problem Set 6.  You will need the following artifacts from last week:

* The MNIST 784 dataset from OpenML, with dimensionality reduced to about 75\%.
* The Support Vector Machine classifier.
* Recoded target variables, such that the target variable is 1 if the digit is less than 5, and 0 otherwise.

(You may copy these artifacts from the posted solutions if needed.)  As with last week, please use the first 60,000 observations as training data, and the remaining 10,000 images as test data.

In [6]:
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.model_selection import (cross_val_predict, RandomizedSearchCV)
from sklearn.metrics import f1_score
import numpy as np


In [7]:

#fetch OpenML data
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

#split into test/training sets
N=60000
X_train, y_train = X[:N, :], y[:N]
X_test, y_test = X[N:, :], y[N:]

#scale data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# PCA
pca = PCA(n_components = 0.75)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

#SVM classifier from last week
#May vary based on best classifer from last week
svm = SVC(coef0=0.3864834922270453, degree=4, kernel='poly')

# recoded target variable
recode_fn = lambda y: np.choose(np.isin(y, list("01234")), [-1,1])
y_test_rcd, y_train_rcd = (recode_fn(y) for y in [y_test, y_train])

We can create a smaller subset for hyperparameter optimization

In [8]:
N_hpo = int(1e4)
hpo_subset = lambda K : np.take(K, np.arange(0, N_hpo), 0)
y_hpo, X_hpo = map(hpo_subset, [y_train_rcd, X_train_pca])

# Problem 1 -- Classifiers

Construct 3 classifiers using different algorithms, not including the SVM model built last week, that classify the MNIST dataset with an $F_1$ score of at least 0.9.  At least one classifier must use gradient boosting (AdaBoost, Gradient Boost, or xgboost).  Show the $F_1$ score and classification report for each model.

In [9]:
classifiers = {'svm': svm}

In [10]:
from sklearn.utils.fixes import loguniform
from scipy.stats import randint, uniform
def test_classifier(clf, params, n_iter=20):
  grid = RandomizedSearchCV(clf, params, 
                            scoring='f1', cv=3, 
                            verbose=3, n_iter=n_iter)
  grid.fit(X_hpo, y_hpo)
  grid.best_estimator_.fit(X_train_pca, y_train_rcd)
  return grid.best_estimator_, grid.best_score_

### Gradient boosted tree

In [6]:
from xgboost import XGBClassifier
params = {
          'learning_rate': loguniform(1e-2, 0.2),
          'n_estimators': randint(50,500),
          'max_depth': randint(2, 10)
}
xgclf = XGBClassifier()
classifiers['xgboost'] = test_classifier(xgclf, params)
classifiers['xgboost']

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV 1/3] END learning_rate=0.08116387404614266, max_depth=4, n_estimators=436;, score=0.942 total time=  25.0s
[CV 2/3] END learning_rate=0.08116387404614266, max_depth=4, n_estimators=436;, score=0.954 total time=  24.2s
[CV 3/3] END learning_rate=0.08116387404614266, max_depth=4, n_estimators=436;, score=0.928 total time=  24.6s
[CV 1/3] END learning_rate=0.17901563859157918, max_depth=4, n_estimators=455;, score=0.944 total time=  24.7s
[CV 2/3] END learning_rate=0.17901563859157918, max_depth=4, n_estimators=455;, score=0.950 total time=  25.0s
[CV 3/3] END learning_rate=0.17901563859157918, max_depth=4, n_estimators=455;, score=0.935 total time=  24.4s
[CV 1/3] END learning_rate=0.013249149636252563, max_depth=2, n_estimators=90;, score=0.800 total time=   2.5s
[CV 2/3] END learning_rate=0.013249149636252563, max_depth=2, n_estimators=90;, score=0.815 total time=   2.6s
[CV 3/3] END learning_rate=0.013249149636252563, ma

(XGBClassifier(learning_rate=0.19299343762455148, max_depth=9, n_estimators=409),
 0.9485250880423896)

## Random forest model

In [7]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
params = {'max_features': uniform(),
          'max_depth': randint(1, 20),
          'min_samples_leaf': randint(1, 20)}
classifiers['rfc'] = test_classifier(rfc, params)
classifiers['rfc']

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV 1/3] END max_depth=2, max_features=0.65016555342138, min_samples_leaf=6;, score=0.748 total time=   9.2s
[CV 2/3] END max_depth=2, max_features=0.65016555342138, min_samples_leaf=6;, score=0.772 total time=   9.1s
[CV 3/3] END max_depth=2, max_features=0.65016555342138, min_samples_leaf=6;, score=0.758 total time=   9.2s
[CV 1/3] END max_depth=3, max_features=0.9452290487135915, min_samples_leaf=17;, score=0.782 total time=  18.9s
[CV 2/3] END max_depth=3, max_features=0.9452290487135915, min_samples_leaf=17;, score=0.810 total time=  18.6s
[CV 3/3] END max_depth=3, max_features=0.9452290487135915, min_samples_leaf=17;, score=0.778 total time=  20.0s
[CV 1/3] END max_depth=5, max_features=0.6793287538531652, min_samples_leaf=17;, score=0.858 total time=  21.1s
[CV 2/3] END max_depth=5, max_features=0.6793287538531652, min_samples_leaf=17;, score=0.864 total time=  21.3s
[CV 3/3] END max_depth=5, max_features=0.67932875385

(RandomForestClassifier(max_depth=18, max_features=0.26213576122784343,
                        min_samples_leaf=6), 0.9252254322144949)

## Logistic regression model
A logistic regression model by itself produces accuracy at best in the high 80s.  As we saw in Week 4, we can add dimensionality by adding polynomial features.  Last week, we may have observed the polynomial kernel performed well, which suggests that that adding polynomial features can improve predictive accuracy.

Note: ideally, we would test multiple degrees; however, we exhaust available memory in Colab using polynomial degree greater than 2.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

lrclf = make_pipeline(PolynomialFeatures(2), 
                        LogisticRegression(max_iter=2000))

params = {'logisticregression__C': loguniform(1e-4,1e4)}
classifiers['logreg'] = test_classifier(lrclf, params)
classifiers['logreg']

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV 1/3] END logisticregression__C=52.13339337121985;, score=0.949 total time=  22.9s
[CV 2/3] END logisticregression__C=52.13339337121985;, score=0.956 total time=  31.1s
[CV 3/3] END logisticregression__C=52.13339337121985;, score=0.940 total time=  27.8s
[CV 1/3] END logisticregression__C=0.00013211684329604987;, score=0.961 total time=   4.8s
[CV 2/3] END logisticregression__C=0.00013211684329604987;, score=0.967 total time=   7.2s
[CV 3/3] END logisticregression__C=0.00013211684329604987;, score=0.948 total time=   9.3s
[CV 1/3] END logisticregression__C=11.940107503366715;, score=0.949 total time=  21.9s
[CV 2/3] END logisticregression__C=11.940107503366715;, score=0.957 total time=  27.7s
[CV 3/3] END logisticregression__C=11.940107503366715;, score=0.941 total time=  32.8s
[CV 1/3] END logisticregression__C=0.0005352251994702991;, score=0.960 total time=   6.2s
[CV 2/3] END logisticregression__C=0.0005352251994702991;

(Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                 ('logisticregression',
                  LogisticRegression(C=0.00013211684329604987, max_iter=2000))]),
 0.9586893948324802)

# Problem 2 -- Voting ensemble model

Build a voting ensemble model that combines the three classifiers from the previous problem, in addition to the SVM model developed last week.  What is the $F_1$ score of the ensemble model?

In [None]:
from sklearn.ensemble import VotingClassifier



## Problem 3 -- Stacking ensemble model
Stacking uses a final classifier (often a logistic regression) that outputs an aggregate of the predictors. Repeat the previous problem using a `StackingClassifier` rather than voting to compute the final prediction.  What is the $F_1$ score of the stacking classifier?


## Problem 4 -- Evaluation

At this point in the assignment, you have six classifiers:

* the support vector classifier from last week,
* the three classifiers from problem 1,
* the voting classifier from problem 2, and
* the stacking classifier from problem 3

Identify the model with the highest $F_1$ score, and train this model with the full training dataset.  Finally, score the test data against this model.  Does the model demonstrate predictive validity (i.e., are the $F_1$ scores for the test data comparable to the training data)?