# Lab 7

Soroush Famili, James Lu, Nithanth Ram

## Problem 1: Revisiting Logistic Regression and MNIST.

### In Lab 5, you solved the handwriting recognition problem (the MNIST data set) using multi-class Logistic Regression.

1. Use Random Forests to try to get the best possible test accuracy on MNIST. This involves getting acquainted with how Random Forests work, understanding their parameters, and therefore using Cross Validation to find the best settings. How well can you do? You should use the accuracy metric, since this is what you used in Lab 5 – therefore this will allow you to compare your results from Random Forests with your results from L1- and L2- Regularized Logistic Regression. What are the hyperparameters of your best model?


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
import xgboost as xgb
import time

In [3]:
start_time = time.time()
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("%s minutes" % ((time.time() - start_time)/60))

1.5792430559794108 minutes


In [4]:
param_grid = {
    'n_estimators':[x for x in range(10, 80, 10)],
    'criterion':['gini', 'entropy']
}

In [5]:
forest = RandomForestClassifier(max_features='log2', max_depth=20)

In [None]:
# Takes approximately 16 minutes to run
forest_CV = GridSearchCV(estimator=forest, param_grid=param_grid, cv=10)
forest_CV.fit(X_train, y_train)
print(forest_CV.best_params_)

In [None]:
predictions = []

for i in range(X_test.shape[0]):
    predictions.append(forest_CV.predict([X_test[i, :]]))
    
print(accuracy_score(y_test, predictions))

2. Use Boosting to do the same. Take the time to understand how XGBoost works (and/or other boosting packages available). Try your best to tune your hyper-parameters. As added motivation: typically the winners and near-winners of the Kaggle competition are those that are best able to tune an cross validate XGBoost. What are the hyperparameters of your best model?

In [None]:
forest2 = xgb.XGBClassifier(n_jobs=-1)

In [None]:
param_grid_Boost = {
    'n_estimators':[x for x in range(10,40,10)],
    'max_depth':[x for x in range(1,5)],
    'learning_rate':[.05],
    'subsample':[.2]
}

In [None]:
start_time = time.time()
forest_GS = GridSearchCV(estimator=forest2, param_grid=param_grid_Boost, cv=5)
forest_GS.fit(X_train, y_train, eval_metric='auc')
print("%s minutes" % ((time.time() - start_time)/60))
print(forest_GS.best_params_)

In [None]:
print(accuracy_score(y_test, forest_GS.predict(X_test)))

## Problem 2: Revisiting Logistic Regression and CIFAR-10

### Now  that  you  have  your  pipeline  set  up,  it  should  be  easy  to  apply  the  above  procedure  toCIFAR-10. If you did something that takes significant computation time, keep in mind that CIFAR-10 is a few times larger.

1. What is the best accuracy you can get on the test data, by tuning Random Forests?  What are the hyperparameters of your best model?

In [None]:
start_time = time.time()
X, y = fetch_openml('cifar-10-small', return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("%s minutes" % ((time.time() - start_time)/60))

In [None]:
param_grid = {
    'n_estimators':[x for x in range(10, 80, 10)],
    'criterion':['gini', 'entropy']
}

In [None]:
forest = RandomForestClassifier(max_features='log2', max_depth=20)

In [None]:
# Takes approximately 16 minutes to run
forest_CV = GridSearchCV(estimator=forest, param_grid=param_grid, cv=10)
forest_CV.fit(X_train, y_train)
print(forest_CV.best_params_)

In [None]:
predictions = []

for i in range(X_test.shape[0]):
    predictions.append(forest_CV.predict([X_test[i, :]]))
    
print(accuracy_score(y_test, predictions))

2. What is the best accuracy you can get on the test data, by tuning XGBoost?  What are thehyperparameters of your best model?

In [None]:
forest2 = xgb.XGBClassifier(n_jobs=-1)

In [None]:
param_grid_Boost = {
    'n_estimators':[x for x in range(10,40,10)],
    'max_depth':[x for x in range(1,5)],
    'learning_rate':[.05],
    'subsample':[.2]
}

In [None]:
start_time = time.time()
forest_GS = GridSearchCV(estimator=forest2, param_grid=param_grid_Boost, cv=5)
forest_GS.fit(X_train, y_train, eval_metric='auc')
print("%s minutes" % ((time.time() - start_time)/60))
print(forest_GS.best_params_)

In [None]:
print(accuracy_score(y_test, forest_GS.predict(X_test)))