** The Plot script for all optimizers and comparisons **

In [39]:
import os, pickle
%matplotlib notebook
import matplotlib.pyplot as plt
import torch
import numpy as np
optim_list = ['sgd', 'adam']

In [40]:
def get_curve_data(optimizer, dataset, model):
    folder_path = "./curves"
    model_name = "{}-{}-{}".format(optimizer, dataset, model)
    file_path = os.path.join(folder_path, model_name)
    #return {key: torch.load(fp) for key, fp in zip(keys, paths)}
    print (file_path)
    with open(file_path, "rb") as f:
        return pickle.load(f)

In [41]:
def plot(model, dataset, optimizers=None, curve_type='test'):
    assert model in ['lstm', 'resnet'], 'Invalid model name: {}'.format(model)
    assert curve_type in ['train', 'test'], 'Invalid curve type: {}'.format(curve_type)
    
    plt.figure()
    plt.title('{} Accuracy for {}-{}'.format(curve_type.capitalize(), model, dataset))
    plt.xlabel('Epoch')
    plt.ylabel('{} Accuracy %'.format(curve_type.capitalize()))
    plt.ylim(50, 101)
    
    for optim in optimizers:
        linestyle = '--' if 'clip' in optim else '-'
        curve_data = get_curve_data(optim, dataset=dataset, model=model)
        print (optim, curve_data)
        accuracies = np.array(curve_data['{}_acc'.format(curve_type)])
        plt.plot(accuracies, label=optim, ls=linestyle)
        
    plt.grid(ls='--')
    plt.legend()
    plt.show()

In [43]:
plot_acc(model='lstm', dataset='imdb', optimizers=optim_list, curve_type='test')

<IPython.core.display.Javascript object>

./curves/sgd-imdb-lstm
sgd {'train_loss': [0.692633470516972, 0.6711409819845308, 0.6282795626241067, 0.5594661044573217, 0.4803819325114081], 'test_loss': [0.6874748866608802, 0.6672026657043619, 0.63828229257401, 0.5580748400789626, 0.5255366994345442], 'test_acc': 75.31914893617021}


KeyError: 'sgd'

We can see that adaptive methods such as AdaGrad, Adam and AMSGrad appear to perform better than the non-adaptive ones early in training. 
But by epoch 150 when the learning rates are decayed, SGD begins to outperform those adaptive methods. 
As for our methods, AdaBound and AMSBound, they converge as fast as adaptive ones and achieve a higher accuracy than SGD on the test set at the end of training. 
In addition, compared with their prototypes, their performances on the unseen data (test set) are enhanced evidently.

You may observe a slight decline of the test accuracy with SGD.
Since the decline also occurs in training, this may not be caused by overfitting but some other reasons.
This behavior indicates the unstable performance of SGD to some extent, which is similar to the observation reported in [He et al. (2016)](https://arxiv.org/abs/1512.03385).

## DensetNet

Here're the results with DenseNet-121.

In [6]:
plot(use_pretrained=True, model='DenseNet', optimizers=LABELS, curve_type='train')
plot(use_pretrained=True, model='DenseNet', optimizers=LABELS, curve_type='test')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

As is expected, the overall performance of each algorithm on DenseNet-121 is similar to that on ResNet-34. 
Despite the relative bad generalization ability of adaptive methods, our proposed methods overcome this drawback by allocating bounds for their learning rates and obtain almost the best accuracy on the test set for both DenseNet and ResNet on CIFAR-10.