In [9]:
# import numpy as np
# import pandas as pd
# from sklearn.model_selection import GridSearchCV
# from keras.models import Sequential
# from keras.layers import Dense
# from keras.wrappers.scikit_learn import KerasClassifier
# from keras.optimizers import Adam
# from keras.callbacks import LearningRateScheduler

## Logistic Regression - Comparing *ADAM* and *AMSGrad* on MNIST

Here, I tune and train logistic regression models, to recreate the empirical results of section 5 of the paper.

As specified in the paper, I fix the parameter $\beta_1$ at .99, and tune the learning rate $\alpha$ and the hyperparameter $\beta_2$ using a gridsearch, as done in the paper. 

Note that to fit a logistic regression model, I'm using a one-layer feedforward neural network (they're equivalent). This is so that I can use the nice tools (including the ADAM optimizer) already implemented in the deep learning framework Keras. 

## 0. Load MNIST Dataset

I've already created train and test splits for the MNIST dataset. They are conviniently stored as compressed numpy arrays.

In [2]:
def load_np_file(path, mode = "rb"):
    with open(path, mode) as handle:
        return(np.load(path))

In [6]:
X_train = load_np_file("../data/MNIST/X_train.npy") 
X_test = load_np_file("../data/MNIST/X_test.npy") 
y_train = load_np_file("../data/MNIST/y_train.npy") 
y_test = load_np_file("../data/MNIST/y_test.npy") 

In [7]:
# sanity check - did all the shapes get preserved? 
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(60000, 784) (60000,)
(10000, 784) (10000,)


## 1. A framework for exhaustive gridsearch

The hyperpameters that I'll need to tune by gridsearch are: 

- $\beta_2$
- $\alpha$.

To do so in a neat fashion, and make use of all my cores (CPU training :( ) , I'll use the `GridSearchCV` class from `sklearn`, with the `KerasClassifier` wrapper.

The interface of this wrapper requres that I define a function that can be called with a set of hyperparameter options and create a `Sequential` model that can be compiled and trained. This is what I do here. 

Note the hyperparameters that I do not tune, as they are fixed by the authors:

- $\beta_1 = .9$
- Discount rate: $\alpha_t$ = $\frac{\alpha}{\sqrt{t}}$
- Batch size = 128

In [55]:
# A function that, when passed with hyperparameter options, returns a compiled model
# Note that if `amsgrad = True`, the method in the paper is used.
def create_model(lr=0.01, beta_2 = .99, amsgrad = False):
    # create model
    model = Sequential()
    model.add(Dense(10, input_dim=784, activation='sigmoid'))
    """
    Create a learning rate schedule, 
    so that alpha_t = alpha/sqrt(t), as specified in the paper.
    """

    
    # Compile model
    optimizer = Adam(lr=lr, beta_2 = beta_2, amsgrad = amsgrad, decay = .14)
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

## 2. Gridsearch: Adam optimizer

Here I specify the ranges I'll want to look over for $\alpha$ and $\beta_2$

In [47]:
beta2_range = np.arange(.99, .999, .0015)
print( beta2_range)

[0.99   0.9915 0.993  0.9945 0.996  0.9975 0.999 ]


In [28]:
alpha_range = [.00001*5**i for i in range(6)]
print (alpha_range)

[1e-05, 5e-05, 0.00025, 0.00125, 0.00625, 0.03125]


In [48]:
param_grid = dict(lr=alpha_range, beta_2=beta2_range)

In [64]:
adam_model = KerasClassifier(build_fn=create_model, epochs=100, batch_size=128, verbose=2)

In [65]:
adam_model

<keras.wrappers.scikit_learn.KerasClassifier at 0x1131d4240>

Finally, run the gridsearch. I'll do 3-fold cross validation to choose the hyperparameters.

In [66]:
adam_grid = GridSearchCV(estimator=adam_model, param_grid=param_grid, n_jobs=-1)

In [None]:
# adam_grid = GridSearchCV(estimator=adam_model, param_grid=param_grid, n_jobs=-1)
adam_grid_result = adam_grid.fit(X_train, y_train)









Epoch 1/100




Epoch 1/100




Epoch 1/100
Epoch 1/100
 - 2s - loss: 8.6858 - acc: 0.1332
Epoch 2/100
 - 2s - loss: 8.7259 - acc: 0.1345
Epoch 2/100
 - 2s - loss: 8.1767 - acc: 0.1352
Epoch 2/100
 - 3s - loss: 8.7753 - acc: 0.1310
Epoch 2/100
 - 2s - loss: 8.6256 - acc: 0.1334
Epoch 3/100
 - 2s - loss: 8.6639 - acc: 0.1348
Epoch 3/100
 - 2s - loss: 7.8773 - acc: 0.1357
Epoch 3/100
 - 2s - loss: 8.7135 - acc: 0.1312
Epoch 3/100
 - 1s - loss: 8.6019 - acc: 0.1334
Epoch 4/100
 - 2s - loss: 8.6391 - acc: 0.1349
Epoch 4/100
 - 2s - loss: 7.7596 - acc: 0.1361
Epoch 4/100
 - 2s - loss: 8.6892 - acc: 0.1313
Epoch 4/100
 - 2s - loss: 8.5866 - acc: 0.1333
Epoch 5/100
 - 2s - loss: 8.6231 - acc: 0.1348
Epoch 5/100
 - 2s - loss: 7.6837 - acc: 0.1361
Epoch 5/100
 - 2s - loss: 8.6733 - acc: 0.1312
Epoch 5/100
 - 2s - loss: 8.5752 - acc: 0.1334
Epoch 6/100
 - 2s - loss: 8.6112 - acc: 0.1347
Epoch 6/100
 - 2s - loss: 7.6264 - acc: 0.1363
Epoch 6/100
 - 2s - loss: 8.6615 - acc: 0.1313
Epoch 6/100
 - 2s - loss: 8.5660 - acc: 0.1333
E