# Problem Set 4
Designed by Ben Usman, Kun He, and Sarah Adel Bargal, with help from Kate Saenko and Brian Kulis.

This assignment will introduce you to:
1. Building and training a convolutional network
2. Saving snapshots of your trained model
3. Reloading weights from a saved model
4. Fine-tuning a pre-trained network
5. Visualizations using Tensorboard

This code has been tested and should for Python 3.5 and 2.7 with tensorflow 1.0.*. Since recently, you can update to recent tensorflow version just by doing `pip install tensorflow`,  or `pip install tensorflow-gpu` if you want to use GPU.

**Note:** This notebook contains problem descriptions and demo/starter code. However, you're welcome to implement and submit .py files directly, if that's easier for you. Starter .py files are provided in the same `pset4/` directory.

## Part 1: Building and Training a ConvNet on SVHN
(25 points)

First we provide demo code that trains a convolutional network on the [SVHN Dataset](http://ufldl.stanford.edu/housenumbers/).. 

You will need to download   __Format 2__ from the link above.
- Create a directory named `svhn_mat/` in the working directory. Or, you can create it anywhere you want, but change the path in `svhn_dataset_generator` to match it.
- Download `train_32x32.mat` and `test_32x32.mat` to this directory.
- `extra_32x32.mat` is NOT needed.
- You may find the `wget` command useful for downloading on linux. 



The following defines a generator for the SVHN Dataset, yielding the next batch every time next is invoked.

In [1]:
import copy
import os
import math
import numpy as np
import scipy
import scipy.io
from six.moves import range
import read_data
from time import time

@read_data.restartable
def svhn_dataset_generator(dataset_name, batch_size):
    assert dataset_name in ['train', 'test']
    assert batch_size > 0 or batch_size == -1  # -1 for entire dataset
    
    path = './svhn_mat/' # path to the SVHN dataset you will download in Q1.1
    file_name = '%s_32x32.mat' % dataset_name
    file_dict = scipy.io.loadmat(os.path.join(path, file_name))
    X_all = file_dict['X'].transpose((3, 0, 1, 2))
    y_all = file_dict['y']
    data_len = X_all.shape[0]
    batch_size = batch_size if batch_size > 0 else data_len
    
    X_all_padded = np.concatenate([X_all, X_all[:batch_size]], axis=0)
    y_all_padded = np.concatenate([y_all, y_all[:batch_size]], axis=0)
    y_all_padded[y_all_padded == 10] = 0
    
    for slice_i in range(int(math.ceil(data_len / batch_size))):
        idx = slice_i * batch_size
        X_batch = X_all_padded[idx:idx + batch_size]
        y_batch = np.ravel(y_all_padded[idx:idx + batch_size])
        yield X_batch, y_batch
        
import tensorflow as tf

def cnn_map(x_):
    conv1 = tf.layers.conv2d(
            inputs=x_,
            filters=32,
            kernel_size=[5, 5],
            padding="same",
            activation=tf.nn.relu,
            name='conv1')
    
    pool1 = tf.layers.max_pooling2d(inputs=conv1, 
                                    pool_size=[2, 2], 
                                    strides=2)
    
    conv2 = tf.layers.conv2d(
            inputs=pool1,
            filters=32,
            kernel_size=[5, 5],
            padding="same",
            activation=tf.nn.relu,
            name='conv2')
    
    pool2 = tf.layers.max_pooling2d(inputs=conv2, 
                                    pool_size=[2, 2], 
                                    strides=2)
        
    pool_flat = tf.contrib.layers.flatten(pool2, scope='pool2flat')
    dense = tf.layers.dense(inputs=pool_flat, units=500, activation=tf.nn.relu)
    logits = tf.layers.dense(inputs=dense, units=10)
    return logits


def apply_classification_loss(model_function):
    with tf.Graph().as_default() as g:
        with tf.device("/gpu:0"):
            x_ = tf.placeholder(tf.float32, [None, 32, 32, 3])
            y_ = tf.placeholder(tf.int32, [None])
            y_logits = model_function(x_)
            
            y_dict = dict(labels=y_, logits=y_logits)
            losses = tf.nn.sparse_softmax_cross_entropy_with_logits(**y_dict)
            cross_entropy_loss = tf.reduce_mean(losses)
            trainer = tf.train.AdamOptimizer()
            train_op = trainer.minimize(cross_entropy_loss)
            
            y_pred = tf.argmax(tf.nn.softmax(y_logits), dimension=1)
            correct_prediction = tf.equal(tf.cast(y_pred, tf.int32), y_)
            accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name = 'accuracy')
    
    model_dict = {'graph': g, 'inputs': [x_, y_], 'train_op': train_op,
                  'accuracy': accuracy, 'loss': cross_entropy_loss}
    
    return model_dict

def train_model(model_dict, dataset_generators, epoch_n, print_every=287):
    with model_dict['graph'].as_default(), tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        
        acc_tmp = 0.0
        for epoch_i in range(epoch_n):
            for iter_i, data_batch in enumerate(dataset_generators['train']):
                train_feed_dict = dict(zip(model_dict['inputs'], data_batch))
                sess.run(model_dict['train_op'], feed_dict=train_feed_dict)
                
                if iter_i % print_every == print_every-1:
                    collect_arr = []
                    for test_batch in dataset_generators['test']:
                        test_feed_dict = dict(zip(model_dict['inputs'], test_batch))
                        to_compute = [model_dict['loss'], model_dict['accuracy']]
                        collect_arr.append(sess.run(to_compute, test_feed_dict))
                    averages = np.mean(collect_arr, axis=0)
                    fmt = (epoch_i+1, print_every) + tuple(averages)
                    print('epoch {:d}, iter: {:d}, loss: {:.3f}, '
                          'accuracy: {:.3f}'.format(*fmt))
                    
            # Early stopping with patience of 3 epoches
            if averages[1] < acc_tmp:
                patience_ct += 1
                if patience_ct == 3:
                    print('Early stopping!'); break
            else: patience_ct = 0
            loss_tmp = averages[0]; acc_tmp = averages[1]
            
dataset_generators = {
        'train': svhn_dataset_generator('train', 256),
        'test': svhn_dataset_generator('test', 256)
}

In [17]:
model_dict = apply_classification_loss(cnn_map)
# I used early stopping with patience of 3 epoches on both loss and accuracy.
train_model(model_dict, dataset_generators, epoch_n=50)

epoch 1, iter: 287, loss: 1.155, accuracy: 0.648
epoch 2, iter: 287, loss: 0.805, accuracy: 0.765
epoch 3, iter: 287, loss: 0.700, accuracy: 0.804
epoch 4, iter: 287, loss: 0.689, accuracy: 0.812
epoch 5, iter: 287, loss: 0.740, accuracy: 0.805
epoch 6, iter: 287, loss: 0.733, accuracy: 0.812
epoch 7, iter: 287, loss: 0.741, accuracy: 0.818
epoch 8, iter: 287, loss: 0.784, accuracy: 0.817
epoch 9, iter: 287, loss: 0.814, accuracy: 0.823
epoch 10, iter: 287, loss: 0.818, accuracy: 0.825
epoch 11, iter: 287, loss: 0.939, accuracy: 0.808
epoch 12, iter: 287, loss: 1.042, accuracy: 0.797
epoch 13, iter: 287, loss: 0.977, accuracy: 0.815
epoch 14, iter: 287, loss: 1.046, accuracy: 0.823
epoch 15, iter: 287, loss: 1.143, accuracy: 0.807
epoch 16, iter: 287, loss: 1.092, accuracy: 0.819
epoch 17, iter: 287, loss: 1.085, accuracy: 0.826
epoch 18, iter: 287, loss: 1.272, accuracy: 0.820
epoch 19, iter: 287, loss: 1.296, accuracy: 0.820
epoch 20, iter: 287, loss: 1.346, accuracy: 0.820
epoch 21,

### Q1.3 SVHN Net Variations
Now we vary the structure of the network. To keep things simple, we still use  two identical conv layers, but vary their parameters. 

Report the final test accuracy on 3 different number of filters, and 3 different number of strides. Each time when you vary one parameter, keep the other fixed at the original value.

|Stride|Accuracy|
|--|-------------------------------|
| 3 | 0.834 |
| 4 | 0.825 |
| 5 | 0.807 |

|Filters|Accuracy|
|--|-------------------------------|
| 28 | 0.850 |
| 36 | 0.772 |
| 40 | 0.826 |

A template for one sample modification is given below. 

**Note:** you're welcome to decide how many training epochs to use, if that gets you the same results but faster.

In [18]:
def cnn_modification(x_, filters, strides):
    conv1 = tf.layers.conv2d(
            inputs=x_,
            filters=filters,
            kernel_size=[5, 5],
            padding="same",
            activation=tf.nn.relu,
            name='conv1')
    
    pool1 = tf.layers.max_pooling2d(inputs=conv1, 
                                    pool_size=[2, 2], 
                                    strides=strides)
    
    conv2 = tf.layers.conv2d(
            inputs=pool1,
            filters=filters,
            kernel_size=[5, 5],
            padding="same",
            activation=tf.nn.relu,
            name='conv2')
    
    pool2 = tf.layers.max_pooling2d(inputs=conv2, 
                                    pool_size=[2, 2], 
                                    strides=strides)
        
    pool_flat = tf.contrib.layers.flatten(pool2, scope='pool2flat')
    dense = tf.layers.dense(inputs=pool_flat, units=500, activation=tf.nn.relu)
    logits = tf.layers.dense(inputs=dense, units=10)
    return logits

def apply_classification_loss_modification(model_function, cnn_input):
    with tf.Graph().as_default() as g:
        with tf.device("/gpu:0"):
            x_ = tf.placeholder(tf.float32, [None, 32, 32, 3])
            y_ = tf.placeholder(tf.int32, [None])
            y_logits = model_function(x_, *cnn_input)
            
            y_dict = dict(labels=y_, logits=y_logits)
            losses = tf.nn.sparse_softmax_cross_entropy_with_logits(**y_dict)
            cross_entropy_loss = tf.reduce_mean(losses)
            trainer = tf.train.AdamOptimizer()
            train_op = trainer.minimize(cross_entropy_loss)
            
            y_pred = tf.argmax(tf.nn.softmax(y_logits), dimension=1)
            correct_prediction = tf.equal(tf.cast(y_pred, tf.int32), y_)
            accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name = 'accuracy')
    
    model_dict = {'graph': g, 'inputs': [x_, y_], 'train_op': train_op,
                  'accuracy': accuracy, 'loss': cross_entropy_loss}
    
    return model_dict

In [19]:
print('Filter size = {}, stride size = {}'.format(32,3))
modified_model_dict = apply_classification_loss_modification(cnn_modification, [32,3])
train_model(modified_model_dict, dataset_generators, epoch_n=100)

Filter size = 32, stride size = 3
epoch 1, iter: 287, loss: 1.491, accuracy: 0.519
epoch 2, iter: 287, loss: 1.015, accuracy: 0.693
epoch 3, iter: 287, loss: 0.910, accuracy: 0.732
epoch 4, iter: 287, loss: 0.813, accuracy: 0.763
epoch 5, iter: 287, loss: 0.782, accuracy: 0.771
epoch 6, iter: 287, loss: 0.732, accuracy: 0.791
epoch 7, iter: 287, loss: 0.716, accuracy: 0.800
epoch 8, iter: 287, loss: 0.731, accuracy: 0.800
epoch 9, iter: 287, loss: 0.712, accuracy: 0.806
epoch 10, iter: 287, loss: 0.711, accuracy: 0.814
epoch 11, iter: 287, loss: 0.788, accuracy: 0.803
epoch 12, iter: 287, loss: 0.831, accuracy: 0.797
epoch 13, iter: 287, loss: 0.829, accuracy: 0.801
epoch 14, iter: 287, loss: 0.875, accuracy: 0.804
epoch 15, iter: 287, loss: 0.859, accuracy: 0.804
epoch 16, iter: 287, loss: 0.825, accuracy: 0.824
epoch 17, iter: 287, loss: 0.828, accuracy: 0.827
epoch 18, iter: 287, loss: 0.904, accuracy: 0.824
epoch 19, iter: 287, loss: 0.889, accuracy: 0.826
epoch 20, iter: 287, loss

In [20]:
print('Filter size = {}, stride size = {}'.format(32,4))
modified_model_dict = apply_classification_loss_modification(cnn_modification, [32,4])
train_model(modified_model_dict, dataset_generators, epoch_n=100)

Filter size = 32, stride size = 4
epoch 1, iter: 287, loss: 1.397, accuracy: 0.556
epoch 2, iter: 287, loss: 1.046, accuracy: 0.684
epoch 3, iter: 287, loss: 0.899, accuracy: 0.733
epoch 4, iter: 287, loss: 0.820, accuracy: 0.759
epoch 5, iter: 287, loss: 0.752, accuracy: 0.782
epoch 6, iter: 287, loss: 0.752, accuracy: 0.786
epoch 7, iter: 287, loss: 0.718, accuracy: 0.794
epoch 8, iter: 287, loss: 0.705, accuracy: 0.800
epoch 9, iter: 287, loss: 0.733, accuracy: 0.797
epoch 10, iter: 287, loss: 0.662, accuracy: 0.816
epoch 11, iter: 287, loss: 0.663, accuracy: 0.819
epoch 12, iter: 287, loss: 0.644, accuracy: 0.823
epoch 13, iter: 287, loss: 0.659, accuracy: 0.826
epoch 14, iter: 287, loss: 0.685, accuracy: 0.820
epoch 15, iter: 287, loss: 0.702, accuracy: 0.821
epoch 16, iter: 287, loss: 0.716, accuracy: 0.816
epoch 17, iter: 287, loss: 0.695, accuracy: 0.820
epoch 18, iter: 287, loss: 0.692, accuracy: 0.825
epoch 19, iter: 287, loss: 0.739, accuracy: 0.820
epoch 20, iter: 287, loss

In [21]:
print('Filter size = {}, stride size = {}'.format(32,5))
modified_model_dict = apply_classification_loss_modification(cnn_modification, [32,5])
train_model(modified_model_dict, dataset_generators, epoch_n=100)

Filter size = 32, stride size = 5
epoch 1, iter: 287, loss: 1.440, accuracy: 0.545
epoch 2, iter: 287, loss: 1.046, accuracy: 0.686
epoch 3, iter: 287, loss: 0.871, accuracy: 0.742
epoch 4, iter: 287, loss: 0.797, accuracy: 0.768
epoch 5, iter: 287, loss: 0.797, accuracy: 0.769
epoch 6, iter: 287, loss: 0.746, accuracy: 0.786
epoch 7, iter: 287, loss: 0.747, accuracy: 0.785
epoch 8, iter: 287, loss: 0.719, accuracy: 0.794
epoch 9, iter: 287, loss: 0.704, accuracy: 0.802
epoch 10, iter: 287, loss: 0.728, accuracy: 0.797
epoch 11, iter: 287, loss: 0.737, accuracy: 0.795
epoch 12, iter: 287, loss: 0.727, accuracy: 0.799
epoch 13, iter: 287, loss: 0.716, accuracy: 0.800
epoch 14, iter: 287, loss: 0.725, accuracy: 0.804
epoch 15, iter: 287, loss: 0.742, accuracy: 0.804
epoch 16, iter: 287, loss: 0.726, accuracy: 0.811
epoch 17, iter: 287, loss: 0.720, accuracy: 0.817
epoch 18, iter: 287, loss: 0.779, accuracy: 0.802
epoch 19, iter: 287, loss: 0.803, accuracy: 0.799
epoch 20, iter: 287, loss

In [22]:
print('Filter size = {}, stride size = {}'.format(28,2))
modified_model_dict = apply_classification_loss_modification(cnn_modification, [28,2])
train_model(modified_model_dict, dataset_generators, epoch_n=100)

Filter size = 28, stride size = 2
epoch 1, iter: 287, loss: 2.237, accuracy: 0.197
epoch 2, iter: 287, loss: 2.199, accuracy: 0.211
epoch 3, iter: 287, loss: 2.062, accuracy: 0.278
epoch 4, iter: 287, loss: 1.753, accuracy: 0.398
epoch 5, iter: 287, loss: 0.809, accuracy: 0.765
epoch 6, iter: 287, loss: 0.677, accuracy: 0.816
epoch 7, iter: 287, loss: 0.785, accuracy: 0.802
epoch 8, iter: 287, loss: 0.671, accuracy: 0.841
epoch 9, iter: 287, loss: 0.685, accuracy: 0.846
epoch 10, iter: 287, loss: 0.753, accuracy: 0.838
epoch 11, iter: 287, loss: 0.793, accuracy: 0.841
epoch 12, iter: 287, loss: 0.856, accuracy: 0.843
epoch 13, iter: 287, loss: 0.946, accuracy: 0.843
epoch 14, iter: 287, loss: 1.005, accuracy: 0.845
epoch 15, iter: 287, loss: 1.061, accuracy: 0.845
epoch 16, iter: 287, loss: 1.121, accuracy: 0.846
epoch 17, iter: 287, loss: 1.202, accuracy: 0.840
epoch 18, iter: 287, loss: 1.347, accuracy: 0.829
epoch 19, iter: 287, loss: 1.472, accuracy: 0.840
epoch 20, iter: 287, loss

In [23]:
print('Filter size = {}, stride size = {}'.format(36,2))
modified_model_dict = apply_classification_loss_modification(cnn_modification, [36,2])
train_model(modified_model_dict, dataset_generators, epoch_n=100)

Filter size = 36, stride size = 2
epoch 1, iter: 287, loss: 2.224, accuracy: 0.196
epoch 2, iter: 287, loss: 2.224, accuracy: 0.196
epoch 3, iter: 287, loss: 2.222, accuracy: 0.197
epoch 4, iter: 287, loss: 2.222, accuracy: 0.196
epoch 5, iter: 287, loss: 1.133, accuracy: 0.648
epoch 6, iter: 287, loss: 0.964, accuracy: 0.710
epoch 7, iter: 287, loss: 0.918, accuracy: 0.733
epoch 8, iter: 287, loss: 0.896, accuracy: 0.741
epoch 9, iter: 287, loss: 0.853, accuracy: 0.755
epoch 10, iter: 287, loss: 0.891, accuracy: 0.746
epoch 11, iter: 287, loss: 0.884, accuracy: 0.747
epoch 12, iter: 287, loss: 0.879, accuracy: 0.751
epoch 13, iter: 287, loss: 0.915, accuracy: 0.745
epoch 14, iter: 287, loss: 0.866, accuracy: 0.766
epoch 15, iter: 287, loss: 0.880, accuracy: 0.759
epoch 16, iter: 287, loss: 0.903, accuracy: 0.760
epoch 17, iter: 287, loss: 0.920, accuracy: 0.762
epoch 18, iter: 287, loss: 0.984, accuracy: 0.753
epoch 19, iter: 287, loss: 0.933, accuracy: 0.775
epoch 20, iter: 287, loss

In [24]:
print('Filter size = {}, stride size = {}'.format(40,2))
modified_model_dict = apply_classification_loss_modification(cnn_modification, [40,2])
train_model(modified_model_dict, dataset_generators, epoch_n=100)

Filter size = 40, stride size = 2
epoch 1, iter: 287, loss: 1.205, accuracy: 0.620
epoch 2, iter: 287, loss: 0.742, accuracy: 0.786
epoch 3, iter: 287, loss: 0.698, accuracy: 0.804
epoch 4, iter: 287, loss: 0.663, accuracy: 0.816
epoch 5, iter: 287, loss: 0.659, accuracy: 0.822
epoch 6, iter: 287, loss: 0.724, accuracy: 0.806
epoch 7, iter: 287, loss: 0.707, accuracy: 0.828
epoch 8, iter: 287, loss: 0.727, accuracy: 0.827
epoch 9, iter: 287, loss: 0.760, accuracy: 0.821
epoch 10, iter: 287, loss: 0.786, accuracy: 0.828
epoch 11, iter: 287, loss: 0.866, accuracy: 0.820
epoch 12, iter: 287, loss: 0.975, accuracy: 0.816
epoch 13, iter: 287, loss: 0.945, accuracy: 0.821
epoch 14, iter: 287, loss: 0.987, accuracy: 0.822
epoch 15, iter: 287, loss: 1.047, accuracy: 0.825
epoch 16, iter: 287, loss: 1.093, accuracy: 0.822
epoch 17, iter: 287, loss: 1.122, accuracy: 0.825
epoch 18, iter: 287, loss: 1.137, accuracy: 0.823
epoch 19, iter: 287, loss: 1.187, accuracy: 0.826
epoch 20, iter: 287, loss

## Part 2: Saving and Reloading Model Weights
(25 points)

In this section you learn to save the weights of a trained model, and to load the weights of a saved model. This is really useful when we would like to load an already trained model in order to continue training or to fine-tune it. Often times we save “snapshots” of the trained model as training progresses in case the training is interrupted, or in case we would like to fall back to an earlier model, this is called snapshot saving.

### Q2.1 Defining another network
Define a network with a slightly different structure in `def cnn_expanded(x_)` below. `cnn_expanded` is an expanded version of `cnn_model`. 
It should have: 
- a different size of kernel for the last convolutional layer, 
- followed by one additional convolutional layer, and 
- followed by one additional pooling layer.

The last fully-connected layer will stay the same.

In [2]:
# Define the new model (see cnn_map(x_) above for an example)
def cnn_expanded(x_):
    conv1 = tf.layers.conv2d(
            inputs=x_,
            filters=32,  # number of filters
            kernel_size=[5, 5],
            padding="same",
            activation=tf.nn.relu,
            name='conv1')
    
    pool1 = tf.layers.max_pooling2d(inputs=conv1, 
                                    pool_size=[2, 2], 
                                    strides=2)
    
    conv2 = tf.layers.conv2d(
            inputs=pool1,
            filters=32, # number of filters
            kernel_size=[6, 6],
            padding="same",
            activation=tf.nn.relu,
            name='conv2')
    
    pool2 = tf.layers.max_pooling2d(inputs=conv2, 
                                    pool_size=[2, 2], 
                                    strides=2)
    
    conv3 = tf.layers.conv2d(
            inputs=pool2,
            filters=32, # number of filters
            kernel_size=[7, 7],
            padding="same",
            activation=tf.nn.relu,
            name='conv3')
    
    pool3 = tf.layers.max_pooling2d(inputs=conv3, 
                                    pool_size=[2, 2], 
                                    strides=2)
        
    pool_flat = tf.contrib.layers.flatten(pool3, scope='pool2flat')
    dense = tf.layers.dense(inputs=pool_flat, units=500, activation=tf.nn.relu)
    logits = tf.layers.dense(inputs=dense, units=10)
    return logits


### Q2.2 Saving and Loading Weights
`new_train_model()` below has two additional parameters `save_model=False, load_model=False` than `train_model` defined previously. Modify `new_train_model()` such that it would 
- save weights after the training is complete if `save_model` is `True`, and
- load weights on start-up before training if `load_model` is `True`.

*Hint:*  `tf.train.Saver()`.

Note: if you are unable to load weights into `cnn_expanded` network, use `cnn_map` in order to continue the assingment.

In [3]:
def new_train_model(model_dict, dataset_generators, epoch_n, print_every=287,
                    save_model=False, load_model=False):
    
    with model_dict['graph'].as_default(), tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
#         writer = tf.summary.FileWriter('./graph/', graph=tf.get_default_graph())
        saver = tf.train.Saver(tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1'))
            
        if load_model:
            saver.restore(sess, 'checkpoints/checkpoint.ckpt')
            print('Model loaded')
        
        loss_tmp = 1000; acc_tmp = 0; patience_ct = 0
        for epoch_i in range(epoch_n):
            for iter_i, data_batch in enumerate(dataset_generators['train']):
                train_feed_dict = dict(zip(model_dict['inputs'], data_batch))
                sess.run(model_dict['train_op'], feed_dict=train_feed_dict)
                
                if iter_i % print_every == print_every-1:
                    collect_arr = []
                    for test_batch in dataset_generators['test']:
                        test_feed_dict = dict(zip(model_dict['inputs'], test_batch))
                        to_compute = [model_dict['loss'], model_dict['accuracy']]
                        collect_arr.append(sess.run(to_compute, test_feed_dict))
                    averages = np.mean(collect_arr, axis=0)
                    fmt = (epoch_i+1, print_every, ) + tuple(averages)
                    print('iteration {:d} {:d}\t loss: {:.3f}, '
                          'accuracy: {:.3f}'.format(*fmt))
                    
            # Early stopping with patience of 4 epoches
            if averages[0] > loss_tmp and averages[1] < acc_tmp:
                patience_ct += 1
                if patience_ct == 3:
                    print('Early stopping!'); break
            else: patience_ct = 0
            loss_tmp = averages[0]; acc_tmp = averages[1]
        
        if save_model:
            saver.save(sess, 'checkpoints/checkpoint.ckpt')
            print('Model saved')
    

def test_saving():
    model_dict = apply_classification_loss(cnn_map)
    new_train_model(model_dict, dataset_generators, epoch_n=100, print_every=287, save_model=True)
    cnn_expanded_dict = apply_classification_loss(cnn_expanded)
    new_train_model(cnn_expanded_dict, dataset_generators, epoch_n=10,print_every=287, load_model=True)

In [12]:
# Early stopping with patience of 4 epoches
test_saving()

iteration 1 287	 loss: 2.179, accuracy: 0.226
iteration 2 287	 loss: 1.197, accuracy: 0.629
iteration 3 287	 loss: 0.963, accuracy: 0.716
iteration 4 287	 loss: 0.872, accuracy: 0.747
iteration 5 287	 loss: 0.863, accuracy: 0.753
iteration 6 287	 loss: 0.851, accuracy: 0.758
iteration 7 287	 loss: 0.814, accuracy: 0.772
iteration 8 287	 loss: 0.796, accuracy: 0.780
iteration 9 287	 loss: 0.821, accuracy: 0.778
iteration 10 287	 loss: 0.834, accuracy: 0.785
iteration 11 287	 loss: 0.867, accuracy: 0.785
iteration 12 287	 loss: 0.882, accuracy: 0.781
iteration 13 287	 loss: 0.881, accuracy: 0.791
iteration 14 287	 loss: 0.921, accuracy: 0.798
iteration 15 287	 loss: 0.944, accuracy: 0.799
iteration 16 287	 loss: 0.961, accuracy: 0.802
iteration 17 287	 loss: 1.137, accuracy: 0.785
iteration 18 287	 loss: 1.111, accuracy: 0.788
iteration 19 287	 loss: 1.104, accuracy: 0.794
iteration 20 287	 loss: 1.010, accuracy: 0.811
iteration 21 287	 loss: 1.094, accuracy: 0.813
iteration 22 287	 loss

## Part 3: Fine-tuning a Pre-trained Network on CIFAR-10
(20 points)

[CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) is another popular benchmark for image classification.
We provide you with modified verstion of the file cifar10.py from [https://github.com/Hvass-Labs/TensorFlow-Tutorials](https://github.com/Hvass-Labs/TensorFlow-Tutorials).


In [5]:
import read_cifar10 as cf10

We also provide a generator for the CIFAR-10 Dataset, yielding the next batch every time next is invoked.

In [6]:
@read_data.restartable
def cifar10_dataset_generator(dataset_name, batch_size, restrict_size=1000):
    assert dataset_name in ['train', 'test']
    assert batch_size > 0 or batch_size == -1  # -1 for entire dataset
    
    X_all_unrestricted, y_all = (cf10.load_training_data() if dataset_name == 'train'
                                 else cf10.load_test_data())
    
    actual_restrict_size = restrict_size if dataset_name == 'train' else int(1e10)
    X_all = X_all_unrestricted[:actual_restrict_size]
    data_len = X_all.shape[0]
    batch_size = batch_size if batch_size > 0 else data_len
    
    X_all_padded = np.concatenate([X_all, X_all[:batch_size]], axis=0)
    y_all_padded = np.concatenate([y_all, y_all[:batch_size]], axis=0)
    
    for slice_i in range(math.ceil(data_len / batch_size)):
        idx = slice_i * batch_size
        X_batch = X_all_padded[idx:idx + batch_size]*255
        y_batch = np.ravel(y_all_padded[idx:idx + batch_size])
        yield X_batch.astype(np.uint8), y_batch.astype(np.uint8)

cifar10_dataset_generators = {
    'train': cifar10_dataset_generator('train', 1000),
    'test': cifar10_dataset_generator('test', -1)
}


### Q3.1 Fine-tuning
Let's fine-tune SVHN net on **1000 examples** from CIFAR-10. 
Compare test accuracies of the following scenarios: 
  - Training `cnn_map` from scratch on the 1000 CIFAR-10 examples
  - Fine-tuning SVHN net (`cnn_map` trained on SVHN dataset) on 1000 exampes from CIFAR-10. Use `new_train_model()` defined above to load SVHN net weights, but train on the CIFAR-10 examples.
  
**Important:** please do not change the `restrict_size=1000` parameter.

In [9]:
## train Cifar-10 from scratch and save first conv layer's weight
cnn_expanded_dict = apply_classification_loss(cnn_map)
new_train_model(cnn_expanded_dict, cifar10_dataset_generators, epoch_n=100, print_every=1, save_model=True)

iteration 1 1	 loss: 41.097, accuracy: 0.101
iteration 2 1	 loss: 31.596, accuracy: 0.105
iteration 3 1	 loss: 22.548, accuracy: 0.122
iteration 4 1	 loss: 13.890, accuracy: 0.102
iteration 5 1	 loss: 7.505, accuracy: 0.101
iteration 6 1	 loss: 5.358, accuracy: 0.100
iteration 7 1	 loss: 4.123, accuracy: 0.108
iteration 8 1	 loss: 3.340, accuracy: 0.104
iteration 9 1	 loss: 2.878, accuracy: 0.101
iteration 10 1	 loss: 2.561, accuracy: 0.105
iteration 11 1	 loss: 2.386, accuracy: 0.116
iteration 12 1	 loss: 2.327, accuracy: 0.123
iteration 13 1	 loss: 2.306, accuracy: 0.125
iteration 14 1	 loss: 2.288, accuracy: 0.133
iteration 15 1	 loss: 2.281, accuracy: 0.139
iteration 16 1	 loss: 2.278, accuracy: 0.142
iteration 17 1	 loss: 2.265, accuracy: 0.146
iteration 18 1	 loss: 2.251, accuracy: 0.152
iteration 19 1	 loss: 2.239, accuracy: 0.161
iteration 20 1	 loss: 2.228, accuracy: 0.165
iteration 21 1	 loss: 2.208, accuracy: 0.180
iteration 22 1	 loss: 2.186, accuracy: 0.193
iteration 23 1	

In [16]:
# fine-tuning SVHN Net using Cifar-10 weights saved above
new_train_model(cnn_expanded_dict, dataset_generators, epoch_n=100, print_every=287, load_model=True)

Model loaded
iteration 1 287	 loss: 0.664, accuracy: 0.810
iteration 2 287	 loss: 0.538, accuracy: 0.851
iteration 3 287	 loss: 0.494, accuracy: 0.864
iteration 4 287	 loss: 0.475, accuracy: 0.870
iteration 5 287	 loss: 0.468, accuracy: 0.879
iteration 6 287	 loss: 0.483, accuracy: 0.878
iteration 7 287	 loss: 0.529, accuracy: 0.874
iteration 8 287	 loss: 0.515, accuracy: 0.878
iteration 9 287	 loss: 0.521, accuracy: 0.885
iteration 10 287	 loss: 0.596, accuracy: 0.870
iteration 11 287	 loss: 0.631, accuracy: 0.873
iteration 12 287	 loss: 0.642, accuracy: 0.870
iteration 13 287	 loss: 0.669, accuracy: 0.871
iteration 14 287	 loss: 0.694, accuracy: 0.876
iteration 15 287	 loss: 0.686, accuracy: 0.883
iteration 16 287	 loss: 0.727, accuracy: 0.877
iteration 17 287	 loss: 0.820, accuracy: 0.879
iteration 18 287	 loss: 0.770, accuracy: 0.886
iteration 19 287	 loss: 0.787, accuracy: 0.886
iteration 20 287	 loss: 0.888, accuracy: 0.881
iteration 21 287	 loss: 0.863, accuracy: 0.883
iteration

In [12]:
# fine-tuning Cifar-10 using Cifar-10 weights saved above
new_train_model(cnn_expanded_dict, cifar10_dataset_generators, epoch_n=100, print_every=1, load_model=True)

Model loaded
iteration 1 1	 loss: 14.328, accuracy: 0.115
iteration 2 1	 loss: 10.007, accuracy: 0.115
iteration 3 1	 loss: 7.000, accuracy: 0.117
iteration 4 1	 loss: 4.704, accuracy: 0.157
iteration 5 1	 loss: 3.532, accuracy: 0.147
iteration 6 1	 loss: 2.832, accuracy: 0.144
iteration 7 1	 loss: 2.485, accuracy: 0.151
iteration 8 1	 loss: 2.349, accuracy: 0.149
iteration 9 1	 loss: 2.305, accuracy: 0.135
iteration 10 1	 loss: 2.290, accuracy: 0.127
iteration 11 1	 loss: 2.278, accuracy: 0.128
iteration 12 1	 loss: 2.267, accuracy: 0.128
iteration 13 1	 loss: 2.255, accuracy: 0.132
iteration 14 1	 loss: 2.243, accuracy: 0.137
iteration 15 1	 loss: 2.228, accuracy: 0.148
iteration 16 1	 loss: 2.213, accuracy: 0.160
iteration 17 1	 loss: 2.198, accuracy: 0.171
iteration 18 1	 loss: 2.176, accuracy: 0.195
iteration 19 1	 loss: 2.149, accuracy: 0.221
iteration 20 1	 loss: 2.120, accuracy: 0.235
iteration 21 1	 loss: 2.089, accuracy: 0.249
iteration 22 1	 loss: 2.056, accuracy: 0.263
iter

In [15]:
# fine-tuning Cifar-10 using SVHN's pretrain weights from Q2
new_train_model(cnn_expanded_dict, cifar10_dataset_generators, epoch_n=100, print_every=1, load_model=True)

Model loaded
iteration 1 1	 loss: 45.584, accuracy: 0.127
iteration 2 1	 loss: 45.845, accuracy: 0.107
iteration 3 1	 loss: 44.642, accuracy: 0.111
iteration 4 1	 loss: 31.459, accuracy: 0.113
iteration 5 1	 loss: 19.414, accuracy: 0.132
iteration 6 1	 loss: 11.665, accuracy: 0.130
iteration 7 1	 loss: 6.654, accuracy: 0.126
iteration 8 1	 loss: 4.165, accuracy: 0.120
iteration 9 1	 loss: 3.077, accuracy: 0.114
iteration 10 1	 loss: 2.629, accuracy: 0.111
iteration 11 1	 loss: 2.448, accuracy: 0.107
iteration 12 1	 loss: 2.374, accuracy: 0.101
iteration 13 1	 loss: 2.341, accuracy: 0.101
iteration 14 1	 loss: 2.326, accuracy: 0.103
iteration 15 1	 loss: 2.318, accuracy: 0.105
iteration 16 1	 loss: 2.313, accuracy: 0.107
iteration 17 1	 loss: 2.309, accuracy: 0.108
iteration 18 1	 loss: 2.307, accuracy: 0.112
iteration 19 1	 loss: 2.306, accuracy: 0.109
iteration 20 1	 loss: 2.304, accuracy: 0.108
iteration 21 1	 loss: 2.303, accuracy: 0.109
iteration 22 1	 loss: 2.302, accuracy: 0.111


| Pretraining on | Fine tunning on | Accuracy |
|----------------|----------------|----------|
| SVHN | SVHN | 0.874 |
| Cifar-10 | SVHN | 0.868 |
| Cifar-10 | Cifar-10 | 0.375 |
| SVHN | Cifar-10 | 0.241 |

__Since SVHN and Cifar-10 are very different image set, fine tunning either of them using the pretrained weight from the other actually lower the accuaracy. However, using Cifar-10 pretrained first layer to initialize Cifer-10 network helps narrowing down the search direction of the network, therefore, the much better accuracy is achieved.__

## Part 4: TensorBoard
(30 points)

[TensorBoard](https://www.tensorflow.org/get_started/summaries_and_tensorboard) is a very helpful tool for visualization of neural networks. 

### Q4.1 Plotting
Present at least one visualization for each of the following:
  - Filters
  - Loss
  - Accuracy

Modify code you have wrote above to also have summary writers. To  run tensorboard, the command is

tensorboard --logdir=./graph --port 6006

In [5]:
''' 
Thsi code is from: 
https://gist.github.com/kukuruza/03731dc494603ceab0c5
'''

from math import sqrt

def put_kernels_on_grid (kernel, pad = 2):

    '''Visualize conv. features as an image (mostly for the 1st layer).
    Place kernel into a grid, with some paddings between adjacent filters.
    Args:
      kernel:            tensor of shape [Y, X, NumChannels, NumKernels]
      (grid_Y, grid_X):  shape of the grid. Require: NumKernels == grid_Y * grid_X
                           User is responsible of how to break into two multiples.
      pad:               number of black pixels around each filter (between them)
    Return:
      Tensor of shape [(Y+2*pad)*grid_Y, (X+2*pad)*grid_X, NumChannels, 1].
    '''
    # get shape of the grid. NumKernels == grid_Y * grid_X
    def factorization(n):
        for i in range(int(sqrt(float(n))), 0, -1):
            if n % i == 0:
                if i == 1: print('Who would enter a prime number of filters')
                return (i, int(n / i))
    (grid_Y, grid_X) = factorization (kernel.get_shape()[3].value)
    print ('grid: %d = (%d, %d)' % (kernel.get_shape()[3].value, grid_Y, grid_X))

    x_min = tf.reduce_min(kernel)
    x_max = tf.reduce_max(kernel)

    kernel1 = (kernel - x_min) / (x_max - x_min)

    # pad X and Y
    x1 = tf.pad(kernel1, tf.constant( [[pad,pad],[pad, pad],[0,0],[0,0]] ), mode = 'CONSTANT')

    # X and Y dimensions, w.r.t. padding
    Y = kernel1.get_shape()[0] + 2 * pad
    X = kernel1.get_shape()[1] + 2 * pad

    channels = kernel1.get_shape()[2]

    # put NumKernels to the 1st dimension
    x2 = tf.transpose(x1, (3, 0, 1, 2))
    # organize grid on Y axis
    x3 = tf.reshape(x2, tf.stack([grid_X, Y * grid_Y, X, channels]))

    # switch X and Y axes
    x4 = tf.transpose(x3, (0, 2, 1, 3))
    # organize grid on X axis
    x5 = tf.reshape(x4, tf.stack([1, X * grid_X, Y * grid_Y, channels]))

    # back to normal order (not combining with the next step for clarity)
    x6 = tf.transpose(x5, (2, 1, 3, 0))

    # to tf.image_summary order [batch_size, height, width, channels],
    #   where in this case batch_size == 1
    x7 = tf.transpose(x6, (3, 0, 1, 2))

    # scaling to [0, 255] is not necessary for tensorboard
    return x7

In [6]:
def visualize(dataset_generators, epoch_n, print_every=287):
    model_dict = apply_classification_loss(cnn_map)
    
    with model_dict['graph'].as_default(), tf.Session() as sess:
        # Define sumaries
        tf.summary.scalar('accuracy', model_dict['accuracy'])
        tf.summary.scalar('loss', model_dict['loss'])
        
        kernel = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1')[0]
        grid = put_kernels_on_grid (kernel)
        tf.summary.image('conv1/features', grid, max_outputs=1)
    
        tf.summary.histogram('filter_weight', tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1')[0])
        tf.summary.histogram('filter_bias', tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1')[1])
        merged = tf.summary.merge_all()
        # Initial writer
        train_writer = tf.summary.FileWriter('./graph' + '/train',
                                      sess.graph)
        test_writer = tf.summary.FileWriter('./graph' + '/test')
        
        sess.run(tf.global_variables_initializer())
        
        loss_tmp = 1000; acc_tmp = 0.0
        for epoch_i in range(epoch_n):
            for iter_i, data_batch in enumerate(dataset_generators['train']):
                train_feed_dict = dict(zip(model_dict['inputs'], data_batch))
                sess.run([model_dict['train_op'], ], feed_dict=train_feed_dict)
                
                # Vistualize training
                summary_train = sess.run(merged, feed_dict=train_feed_dict)
                train_writer.add_summary(summary_train, epoch_i) 
                
                if iter_i % print_every == print_every-1:
                    collect_arr = []
                    for test_batch in dataset_generators['test']:
                        test_feed_dict = dict(zip(model_dict['inputs'], test_batch))
                        to_compute = [model_dict['loss'], model_dict['accuracy']]
                        collect_arr.append(sess.run(to_compute, test_feed_dict)) 
                        
                        # Vistualize testing
                        summary_test = sess.run(merged, feed_dict=test_feed_dict)
                        test_writer.add_summary(summary_test, epoch_i)
                        
                    averages = np.mean(collect_arr, axis=0)
                    fmt = (epoch_i+1, print_every) + tuple(averages)
                    print('epoch {:d}, iter: {:d}, loss: {:.3f}, '
                          'accuracy: {:.3f}'.format(*fmt))

In [7]:
visualize(dataset_generators, epoch_n=10)

grid: 32 = (4, 8)
epoch 1, iter: 287, loss: 0.880, accuracy: 0.741
epoch 2, iter: 287, loss: 0.806, accuracy: 0.769
epoch 3, iter: 287, loss: 0.750, accuracy: 0.792
epoch 4, iter: 287, loss: 0.700, accuracy: 0.808
epoch 5, iter: 287, loss: 0.748, accuracy: 0.798
epoch 6, iter: 287, loss: 0.793, accuracy: 0.796
epoch 7, iter: 287, loss: 0.815, accuracy: 0.801
epoch 8, iter: 287, loss: 0.847, accuracy: 0.797
epoch 9, iter: 287, loss: 0.903, accuracy: 0.783
epoch 10, iter: 287, loss: 0.847, accuracy: 0.802


__Loss and accuracy:__

<img width='600px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/vanila%20loss%20accuracy.jpeg">

__Graph:__

<img width='800px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/vanilla%20graph.jpeg">

__Conv1 image:__

<img width='800px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/vanilla%20image.jpeg">

__Weights and bias histogram:__

<img width='400px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/vanilla%20hist.jpeg">

__Weights and bias distribution:__

<img width='400px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/vanilla%20dist.jpeg">


__The result shows significant overfitting after only about 6 epoch. Weight change is not much in conv 1, but the bias is increasing.__

## Part 5: Bonus
(20 points)

__I tried both RESNet-like and VGGNet-like architecture, and use SVHN pretraining from question 2 to narrow down the search direction of the network, giving it a kick start.__

### 5.1 shallow non-bottle-necked ResNet

<img width='400px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/res%20graph%201.jpeg">

<img width='400px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/res%20graph%202.jpeg">

In [40]:
def cnn_map_ResNet(x_, keep_prob, reg, res_layers, dense_layers):
    conv1 = tf.layers.conv2d(
            inputs=x_,
            filters=32,
            kernel_size=[5, 5],
            padding="same",
            kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
            activation=tf.nn.relu,
            name='conv1')
    
    maxpool = tf.layers.max_pooling2d(inputs=conv1, 
                                    pool_size=[2, 2], 
                                    strides=2)
    
    # Block 1
    with tf.variable_scope('block1'): 
        conv2 = tf.layers.conv2d(
                inputs=maxpool,
                filters=res_layers[0],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv3 = tf.layers.conv2d(
                inputs=conv2,
                filters=res_layers[0],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv4 = tf.layers.conv2d(
                inputs=conv3,
                filters=res_layers[0],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=None)

        combine1 = tf.nn.relu(tf.add(conv4, conv2))

        conv5 = tf.layers.conv2d(
                inputs=conv4,
                filters=res_layers[0],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv6 = tf.layers.conv2d(
                inputs=conv5,
                filters=res_layers[0],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=None)

        combine2 = tf.nn.relu(tf.add(conv6, combine1))
    
    # Block 2
    with tf.variable_scope('block2'):
        conv7 = tf.layers.conv2d(
                inputs=combine2,
                filters=res_layers[1],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv8 = tf.layers.conv2d(
                inputs=conv7,
                filters=res_layers[1],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv9 = tf.layers.conv2d(
                inputs=conv8,
                filters=res_layers[1],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=None)

        combine3 = tf.nn.relu(tf.add(conv7, conv9))

        conv10 = tf.layers.conv2d(
                inputs=combine3,
                filters=res_layers[1],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv11 = tf.layers.conv2d(
                inputs=conv10,
                filters=res_layers[1],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=None)

        combine4 = tf.nn.relu(tf.add(conv11, combine3))
    
    # Block 3
    with tf.variable_scope('block3'):
        conv12 = tf.layers.conv2d(
                inputs=combine4,
                filters=res_layers[2],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv13 = tf.layers.conv2d(
                inputs=conv12,
                filters=res_layers[2],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv14 = tf.layers.conv2d(
                inputs=conv13,
                filters=res_layers[2],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=None)

        combine5 = tf.nn.relu(tf.add(conv14, conv12))

        conv15 = tf.layers.conv2d(
                inputs=combine5,
                filters=res_layers[2],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv16 = tf.layers.conv2d(
                inputs=conv15,
                filters=res_layers[2],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=None)

        combine6 = tf.nn.relu(tf.add(conv16, combine5))

        conv17 = tf.layers.conv2d(
                inputs=combine6,
                filters=res_layers[2],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv18 = tf.layers.conv2d(
                inputs=conv17,
                filters=res_layers[2],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=None)

        combine7 = tf.nn.relu(tf.add(conv18, combine6))
    
    # Block 4
    with tf.variable_scope('block4'):
        conv19 = tf.layers.conv2d(
                inputs=combine7,
                filters=res_layers[3],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv20 = tf.layers.conv2d(
                inputs=conv19,
                filters=res_layers[3],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv21 = tf.layers.conv2d(
                inputs=conv20,
                filters=res_layers[3],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=None)

        combine8 = tf.nn.relu(tf.add(conv19, conv21))

        conv22 = tf.layers.conv2d(
                inputs=combine8,
                filters=res_layers[3],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv23 = tf.layers.conv2d(
                inputs=conv22,
                filters=res_layers[3],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=None)

        combine9 = tf.nn.relu(tf.add(conv23, combine8))

        conv24 = tf.layers.conv2d(
                inputs=combine9,
                filters=res_layers[3],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv25 = tf.layers.conv2d(
                inputs=conv24,
                filters=res_layers[3],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=None,
                name='final_layer')

        combine10 = tf.nn.relu(tf.add(conv25, combine9))
    
    avepool = tf.layers.average_pooling2d(inputs=combine10, 
                                       pool_size=[2, 2], 
                                       strides=2)
        
    pool_flat = tf.contrib.layers.flatten(avepool, scope='pool2flat')
    dense1 = tf.layers.dense(inputs=pool_flat, units=dense_layers, 
                             kernel_regularizer=tf.contrib.layers.l2_regularizer(reg), activation=tf.nn.relu)
    dense1 = tf.nn.dropout(dense1, keep_prob, seed=2)
    logits = tf.layers.dense(inputs=dense1, units=10)
    return logits


def apply_classification_loss_ResNet(model_function, reg, res_layers, dense_layers, lr, epsi):
    with tf.Graph().as_default() as g:
        with tf.device("/gpu:0"):
            x_ = tf.placeholder(tf.float32, [None, 32, 32, 3])
            y_ = tf.placeholder(tf.int32, [None])
            keep_prob = tf.placeholder(tf.float32)
            
            y_logits = model_function(x_, keep_prob, reg, res_layers, dense_layers)
            
            y_dict = dict(labels=y_, logits=y_logits)
            losses = tf.nn.sparse_softmax_cross_entropy_with_logits(**y_dict)
            cross_entropy_loss = tf.reduce_mean(losses)
            trainer = tf.train.AdamOptimizer(learning_rate=lr, beta1=0.9, beta2=0.999, epsilon=epsi)
            train_op = trainer.minimize(cross_entropy_loss)
            
            y_pred = tf.argmax(tf.nn.softmax(y_logits), dimension=1)
            correct_prediction = tf.equal(tf.cast(y_pred, tf.int32), y_)
            accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name = 'accuracy')
    
    model_dict = {'graph': g, 'inputs': [x_, y_, keep_prob], 
                  'train_op': train_op, 'accuracy': accuracy, 'loss': cross_entropy_loss}
    
    return model_dict

def train_model_ResNet(model_dict, dataset_generators, epoch_n, 
                         keep_prob, print_every=287,save_model=False, load_model=False):
    
    with model_dict['graph'].as_default(), tf.Session() as sess:
        # Define sumaries
        tf.summary.scalar('accuracy', model_dict['accuracy'])
        tf.summary.scalar('loss', model_dict['loss'])
        
        kernel = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1')[0]
        grid = put_kernels_on_grid (kernel)
        tf.summary.image('conv1/features', grid, max_outputs=1)
    
        tf.summary.histogram('filter_weight', tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1')[0])
        tf.summary.histogram('filter_bias', tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1')[1])
        merged = tf.summary.merge_all()
        # Initial writer
        train_writer = tf.summary.FileWriter('./graph' + '/train',
                                      sess.graph)
        test_writer = tf.summary.FileWriter('./graph' + '/test')
    
        sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver(tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1'))
            
        if load_model:
            saver.restore(sess, 'checkpoints/checkpoint.ckpt')
            print('Model loaded')
        
        loss_tmp = 1000; acc_tmp = 0; patience_ct = 0
        for epoch_i in range(epoch_n):
            for iter_i, data_batch in enumerate(dataset_generators['train']):
                train_feed_dict = dict(zip(model_dict['inputs'][:2], data_batch))
                train_feed_dict[model_dict['inputs'][2]] = keep_prob
                sess.run(model_dict['train_op'], feed_dict=train_feed_dict)
                
                # Vistualize training
                summary_train = sess.run(merged, feed_dict=train_feed_dict)
                train_writer.add_summary(summary_train, epoch_i) 
                
                if iter_i % print_every == print_every-1:
                    collect_arr = []
                    for test_batch in dataset_generators['test']:
                        test_feed_dict = dict(zip(model_dict['inputs'][:2], test_batch))
                        test_feed_dict[model_dict['inputs'][2]] = 1
                        to_compute = [model_dict['loss'], model_dict['accuracy']]
                        collect_arr.append(sess.run(to_compute, test_feed_dict)) 
                        
                        # Vistualize testing
                        summary_test = sess.run(merged, feed_dict=test_feed_dict)
                        test_writer.add_summary(summary_test, epoch_i)
                        
                    averages = np.mean(collect_arr, axis=0)
                    fmt = (epoch_i+1, print_every, ) + tuple(averages)
                    print('iteration {:d} {:d}\t loss: {:.3f}, '
                          'accuracy: {:.3f}'.format(*fmt))
                    
            # Early stopping with patience of 3 epoches
            if averages[1] < acc_tmp:
                patience_ct += 1
                if patience_ct == 2:
                    print('Early stopping!'); break
            else: patience_ct = 0
            loss_tmp = averages[0]; acc_tmp = averages[1]

def SVHN_plusplus_ResNet(keep_prob, reg, res_layers, dense_layers, lr, epsi):
    model_dict = apply_classification_loss_ResNet(cnn_map_ResNet, reg, res_layers, dense_layers, lr, epsi)
    train_model_ResNet(model_dict, dataset_generators, 30, # set to 5 for tunning 
                         keep_prob, load_model=True)

In [46]:
SVHN_plusplus_ResNet(0.5, # dropout
                     0.0005, # L2-regularizer
                     [64,96,128,160], # Resnet filter size
                     400, # dense layer width
                     0.0005, # learning rate
                     1e-8) # Adam epsilon

grid: 32 = (4, 8)
Model loaded
iteration 1 287	 loss: 0.594, accuracy: 0.822
iteration 2 287	 loss: 0.441, accuracy: 0.869
iteration 3 287	 loss: 0.340, accuracy: 0.904
iteration 4 287	 loss: 0.289, accuracy: 0.918
iteration 5 287	 loss: 0.273, accuracy: 0.925
iteration 6 287	 loss: 0.264, accuracy: 0.931
iteration 7 287	 loss: 0.261, accuracy: 0.935
iteration 8 287	 loss: 0.281, accuracy: 0.933
iteration 9 287	 loss: 0.285, accuracy: 0.935
iteration 10 287	 loss: 0.260, accuracy: 0.938
iteration 11 287	 loss: 0.303, accuracy: 0.933
iteration 12 287	 loss: 0.298, accuracy: 0.935
iteration 13 287	 loss: 0.301, accuracy: 0.939
iteration 14 287	 loss: 0.294, accuracy: 0.936
iteration 15 287	 loss: 0.327, accuracy: 0.937
iteration 16 287	 loss: 0.329, accuracy: 0.939
iteration 17 287	 loss: 0.327, accuracy: 0.938
iteration 18 287	 loss: 0.360, accuracy: 0.936
Early stopping!


__Even though it has much higher accuray than 3-layer cnn, but it was very time consuming to train. And overfitting might prevent it to learn effectively. However, it has great potential if tuned right, as proven in the paper
 " Deep residual learning for image recognition".__
 
<img width='500px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/res%20loss%20accuracy.jpeg">

<img width='600px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/res%20image.jpeg">

<img width='300px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/res%20hist.jpeg">

### 5.2 shallow VGGNet

<img width='400px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/vgg%20graph%201.jpeg">

<img width='400px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/vgg%20graph%202.jpeg">

In [34]:
def cnn_map_VGGNet(x_, keep_prob, reg, vgg_layers, dense_layers):
    # Block 1
    conv1 = tf.layers.conv2d(
            inputs=x_,
            filters=32,
            kernel_size=[5, 5],
            padding="same",
            kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
            activation=tf.nn.relu,
            name='conv1')

    conv2 = tf.layers.conv2d(
            inputs=conv1,
            filters=32,
            kernel_size=[3, 3],
            padding="same",
            kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
            activation=tf.nn.relu)

    pool1 = tf.layers.max_pooling2d(inputs=conv2, 
                                    pool_size=[2, 2], 
                                    strides=2)
    
    # Block 2
    with tf.variable_scope('block2'):
        conv3 = tf.layers.conv2d(
                inputs=pool1,
                filters=vgg_layers[0],
                kernel_size=[5, 5],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv4 = tf.layers.conv2d(
                inputs=conv3,
                filters=vgg_layers[0],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        pool2 = tf.layers.max_pooling2d(inputs=conv4, 
                                        pool_size=[2, 2], 
                                        strides=2)
    
    # Block 3
    with tf.variable_scope('block3'):
        conv5 = tf.layers.conv2d(
                inputs=pool2,
                filters=vgg_layers[1],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv6 = tf.layers.conv2d(
                inputs=conv5,
                filters=vgg_layers[1],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv7 = tf.layers.conv2d(
                inputs=conv6,
                filters=vgg_layers[1],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        pool3 = tf.layers.max_pooling2d(inputs=conv7, 
                                        pool_size=[2, 2], 
                                        strides=2)
    
    # Block 4
    with tf.variable_scope('block4'):
        conv8 = tf.layers.conv2d(
                inputs=pool3,
                filters=vgg_layers[2],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv9 = tf.layers.conv2d(
                inputs=conv8,
                filters=vgg_layers[2],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv10 = tf.layers.conv2d(
                inputs=conv9,
                filters=vgg_layers[2],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        pool4 = tf.layers.max_pooling2d(inputs=conv10, 
                                        pool_size=[2, 2], 
                                        strides=2)
    
    # Block 5
    with tf.variable_scope('block5'):
        conv11 = tf.layers.conv2d(
                inputs=pool4,
                filters=vgg_layers[3],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv12 = tf.layers.conv2d(
                inputs=conv11,
                filters=vgg_layers[3],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu)

        conv13 = tf.layers.conv2d(
                inputs=conv12,
                filters=vgg_layers[3],
                kernel_size=[3, 3],
                padding="same",
                kernel_regularizer = tf.contrib.layers.l2_regularizer(reg),
                activation=tf.nn.relu,
                name='final_layer')

        pool5 = tf.layers.max_pooling2d(inputs=conv13, 
                                        pool_size=[2, 2], 
                                        strides=2)
        
    pool_flat = tf.contrib.layers.flatten(pool5, scope='pool2flat')
    dense1 = tf.layers.dense(inputs=pool_flat, units=dense_layers[0], 
                             kernel_initializer=tf.random_uniform_initializer(minval=-6*np.sqrt(1.0/(2000+dense_layers[0])), 
                                                                             maxval=6*np.sqrt(1.0/(2000+dense_layers[0])), seed=21), 
                             bias_initializer=tf.truncated_normal_initializer(mean=0.1, stddev=1e-4),
                             kernel_regularizer=tf.contrib.layers.l2_regularizer(reg), activation=tf.nn.relu)
    dense1 = tf.nn.dropout(dense1, keep_prob, seed=2)
    dense2 = tf.layers.dense(inputs=dense1, units=dense_layers[1], 
                             kernel_initializer=tf.random_uniform_initializer(minval=-6*np.sqrt(1.0/(dense_layers[1]+dense_layers[0])), 
                                                                             maxval=6*np.sqrt(1.0/(dense_layers[1]+dense_layers[0])), seed=11),
                             bias_initializer=tf.truncated_normal_initializer(mean=0.1, stddev=1e-4),
                             kernel_regularizer=tf.contrib.layers.l2_regularizer(reg), activation=tf.nn.relu)
    dense3 = tf.layers.dense(inputs=dense2, units=dense_layers[2], 
                             kernel_initializer=tf.random_uniform_initializer(minval=-6*np.sqrt(1.0/(dense_layers[1]+dense_layers[2])), 
                                                                             maxval=6*np.sqrt(1.0/(dense_layers[1]+dense_layers[2])), seed=1),
                             kernel_regularizer=tf.contrib.layers.l2_regularizer(reg), activation=tf.nn.relu)
    logits = tf.layers.dense(inputs=dense3, units=10)
    return logits


def apply_classification_loss_VGGNet(model_function, reg, vgg_layers, dense_layers, lr, epsi):
    with tf.Graph().as_default() as g:
        with tf.device("/gpu:0"):
            x_ = tf.placeholder(tf.float32, [None, 32, 32, 3])
            y_ = tf.placeholder(tf.int32, [None])
            keep_prob = tf.placeholder(tf.float32)
            y_logits = model_function(x_, keep_prob, reg, vgg_layers, dense_layers)
            
            y_dict = dict(labels=y_, logits=y_logits)
            losses = tf.nn.sparse_softmax_cross_entropy_with_logits(**y_dict)
            cross_entropy_loss = tf.reduce_mean(losses)
            trainer = tf.train.AdamOptimizer(learning_rate=lr, beta1=0.9, beta2=0.999, epsilon=epsi)
            train_op = trainer.minimize(cross_entropy_loss)
            
            y_pred = tf.argmax(tf.nn.softmax(y_logits), dimension=1)
            correct_prediction = tf.equal(tf.cast(y_pred, tf.int32), y_)
            accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name = 'accuracy')
    
    model_dict = {'graph': g, 'inputs': [x_, y_, keep_prob], 'train_op': train_op,
                  'accuracy': accuracy, 'loss': cross_entropy_loss}
    
    return model_dict

def train_model_VGGNet(model_dict, dataset_generators, epoch_n, keep_prob, print_every=287,
                    save_model=False, load_model=False):
    
    with model_dict['graph'].as_default(), tf.Session() as sess:
        # Define sumaries
        tf.summary.scalar('accuracy', model_dict['accuracy'])
        tf.summary.scalar('loss', model_dict['loss'])
        
        kernel = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1')[0]
        grid = put_kernels_on_grid (kernel)
        tf.summary.image('conv1/features', grid, max_outputs=1)
    
        tf.summary.histogram('filter_weight', tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1')[0])
        tf.summary.histogram('filter_bias', tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1')[1])
        merged = tf.summary.merge_all()
        # Initial writer
        train_writer = tf.summary.FileWriter('./graph' + '/train',
                                      sess.graph)
        test_writer = tf.summary.FileWriter('./graph' + '/test')
        
        sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver(tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1'))
            
        if load_model:
            saver.restore(sess, 'checkpoints/checkpoint.ckpt')
            print('Model loaded')
        
        loss_tmp = 1000; acc_tmp = 0; patience_ct = 0
        for epoch_i in range(epoch_n):
            for iter_i, data_batch in enumerate(dataset_generators['train']):
                train_feed_dict = dict(zip(model_dict['inputs'][:2], data_batch))
                train_feed_dict[model_dict['inputs'][2]] = keep_prob
                sess.run(model_dict['train_op'], feed_dict=train_feed_dict)
                
                # Vistualize training
                summary_train = sess.run(merged, feed_dict=train_feed_dict)
                train_writer.add_summary(summary_train, epoch_i) 
                
                if iter_i % print_every == print_every-1:
                    collect_arr = []
                    for test_batch in dataset_generators['test']:
                        test_feed_dict = dict(zip(model_dict['inputs'][:2], test_batch))
                        test_feed_dict[model_dict['inputs'][2]] = 1
                        to_compute = [model_dict['loss'], model_dict['accuracy']]
                        collect_arr.append(sess.run(to_compute, test_feed_dict)) 
                        
                        # Vistualize testing
                        summary_test = sess.run(merged, feed_dict=test_feed_dict)
                        test_writer.add_summary(summary_test, epoch_i)
                        
                    averages = np.mean(collect_arr, axis=0)
                    fmt = (epoch_i+1, print_every, ) + tuple(averages)
                    print('iteration {:d} {:d}\t loss: {:.3f}, '
                          'accuracy: {:.3f}'.format(*fmt))
                    
            # Early stopping with patience of 2 epoches
            if averages[1] < acc_tmp:
                patience_ct += 1
                if patience_ct == 2:
                    print('Early stopping!'); break
            else: patience_ct = 0
            loss_tmp = averages[0]; acc_tmp = averages[1]

def SVHN_plusplus_VGGNet(keep_prob, reg, vgg_layers, dense_layers, lr, epsi):
    model_dict = apply_classification_loss_VGGNet(cnn_map_VGGNet, reg, vgg_layers, dense_layers, lr, epsi)
    train_model_VGGNet(model_dict, dataset_generators, 50, keep_prob,
                       print_every=287, save_model=False, load_model=True)

In [37]:
SVHN_plusplus_VGGNet(0.7, # dropout
                     0.005, # L2-regularizer
                     [64,128,256,512], # # VGGnet blocks' filter size
                     [1000,500,100], # 3 dense layers' width
                     0.0006, # learning rate
                     1e-8) # Adam epsilon

grid: 32 = (4, 8)
Model loaded
iteration 1 287	 loss: 0.504, accuracy: 0.849
iteration 2 287	 loss: 0.381, accuracy: 0.895
iteration 3 287	 loss: 0.348, accuracy: 0.904
iteration 4 287	 loss: 0.304, accuracy: 0.920
iteration 5 287	 loss: 0.326, accuracy: 0.911
iteration 6 287	 loss: 0.302, accuracy: 0.920
iteration 7 287	 loss: 0.308, accuracy: 0.920
iteration 8 287	 loss: 0.277, accuracy: 0.927
iteration 9 287	 loss: 0.288, accuracy: 0.928
iteration 10 287	 loss: 0.315, accuracy: 0.921
iteration 11 287	 loss: 0.302, accuracy: 0.930
iteration 12 287	 loss: 0.320, accuracy: 0.927
iteration 13 287	 loss: 0.316, accuracy: 0.933
iteration 14 287	 loss: 0.345, accuracy: 0.930
iteration 15 287	 loss: 0.337, accuracy: 0.929
Early stopping!


__This VGGNet was much faster to train and less memory consuming than RESNet, and yielded less accuracy. But it was still an improvement over vanilla CNN and AlexNet.__

<img width='400px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/vgg%20loss%20accuracy.jpeg">

<img width='600px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/vgg%20image.jpeg">

<img width='300px' src="https://raw.githubusercontent.com/GordonCai/BU-EC500K-Deep-Learning/master/Homework/hw4/Vistualization/vgg%20hist.jpeg">