## Part I: Tensorflow

In [86]:
import os
import tensorflow as tf
import numpy as np
import math
import timeit
import matplotlib.pyplot as plt

%matplotlib inline

In [87]:
def load_cifar10(num_training=49000, num_validation=1000, num_test=10000):
    """
    Fetch the CIFAR-10 dataset from the web and perform preprocessing to prepare
    it for the two-layer neural net classifier.
    """
    # Load the raw CIFAR-10 dataset and use appropriate data types and shapes
    cifar10 = tf.keras.datasets.cifar10.load_data()
    (X_train, y_train), (X_test, y_test) = cifar10
    X_train = np.asarray(X_train, dtype=np.float32)
    y_train = np.asarray(y_train, dtype=np.int32).flatten()
    X_test = np.asarray(X_test, dtype=np.float32)
    y_test = np.asarray(y_test, dtype=np.int32).flatten()

    # Subsample the data
    mask = range(num_training, num_training + num_validation)
    X_val = X_train[mask]
    y_val = y_train[mask]
    mask = range(num_training)
    X_train = X_train[mask]
    y_train = y_train[mask]
    mask = range(num_test)
    X_test = X_test[mask]
    y_test = y_test[mask]

    # Normalize the data: subtract the mean pixel and divide by std
    mean_pixel = X_train.mean(axis=(0, 1, 2), keepdims=True)
    std_pixel = X_train.std(axis=(0, 1, 2), keepdims=True)
    X_train = (X_train - mean_pixel) / std_pixel
    X_val = (X_val - mean_pixel) / std_pixel
    X_test = (X_test - mean_pixel) / std_pixel

    return X_train, y_train, X_val, y_val, X_test, y_test

# If there are errors with SSL downloading involving self-signed certificates,
# it may be that your Python version was recently installed on the current machine.
# See: https://github.com/tensorflow/tensorflow/issues/10779
# To fix, run the command: /Applications/Python\ 3.7/Install\ Certificates.command
#   ...replacing paths as necessary.

# Invoke the above function to get our data.
NHW = (0, 1, 2)
X_train, y_train, X_val, y_val, X_test, y_test = load_cifar10()
print('Train data shape: ', X_train.shape)
print('Train labels shape: ', y_train.shape, y_train.dtype)
print('Validation data shape: ', X_val.shape)
print('Validation labels shape: ', y_val.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

Train data shape:  (49000, 32, 32, 3)
Train labels shape:  (49000,) int32
Validation data shape:  (1000, 32, 32, 3)
Validation labels shape:  (1000,)
Test data shape:  (10000, 32, 32, 3)
Test labels shape:  (10000,)


In [88]:
class Dataset(object):
    def __init__(self, X, y, batch_size, shuffle=False):
        """
        Construct a Dataset object to iterate over data X and labels y

        Inputs:
        - X: Numpy array of data, of any shape
        - y: Numpy array of labels, of any shape but with y.shape[0] == X.shape[0]
        - batch_size: Integer giving number of elements per minibatch
        - shuffle: (optional) Boolean, whether to shuffle the data on each epoch
        """
        assert X.shape[0] == y.shape[0], 'Got different numbers of data and labels'
        self.X, self.y = X, y
        self.batch_size, self.shuffle = batch_size, shuffle

    def __iter__(self):
        N, B = self.X.shape[0], self.batch_size
        idxs = np.arange(N)
        if self.shuffle:
            np.random.shuffle(idxs)
        return iter((self.X[i:i+B], self.y[i:i+B]) for i in range(0, N, B))


train_dset = Dataset(X_train, y_train, batch_size=64, shuffle=True)
val_dset = Dataset(X_val, y_val, batch_size=64, shuffle=False)
test_dset = Dataset(X_test, y_test, batch_size=64)

In [89]:
# We can iterate through a dataset like this:
for t, (x, y) in enumerate(train_dset):
    print(t, x.shape, y.shape)
    if t > 5: break

0 (64, 32, 32, 3) (64,)
1 (64, 32, 32, 3) (64,)
2 (64, 32, 32, 3) (64,)
3 (64, 32, 32, 3) (64,)
4 (64, 32, 32, 3) (64,)
5 (64, 32, 32, 3) (64,)
6 (64, 32, 32, 3) (64,)


In [90]:
# Set up some global variables
USE_GPU = False

if USE_GPU:
    device = '/device:GPU:0'
else:
    device = '/cpu:0'

# Constant to control how often we print when training models
print_every = 100

print('Using device: ', device)

Using device:  /cpu:0


In [91]:
## Helper Functions
def flatten(x):
    """
    Input:
    - TensorFlow Tensor of shape (N, D1, ..., DM)

    Output:
    - TensorFlow Tensor of shape (N, D1 * ... * DM)
    """
    N = tf.shape(x)[0]
    return tf.reshape(x, (N, -1))

def test_flatten():
    # Construct concrete values of the input data x using numpy
    x_np = np.arange(24).reshape((2, 3, 4))
    print('x_np:\n', x_np, '\n')
    # Compute a concrete output value.
    x_flat_np = flatten(x_np)
    print('x_flat_np:\n', x_flat_np, '\n')

test_flatten()

x_np:
 [[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]] 

x_flat_np:
 tf.Tensor(
[[ 0  1  2  3  4  5  6  7  8  9 10 11]
 [12 13 14 15 16 17 18 19 20 21 22 23]], shape=(2, 12), dtype=int32) 



### Define a TwoLayer Network

We will now implement our first neural network with TensorFlow: a fully-connected ReLU network with two hidden layers and no biases on the CIFAR10 dataset. For now we will use only low-level TensorFlow operators to define the network; later we will see how to use the higher-level abstractions provided by tf.keras to simplify the process.

We will define the forward pass of the network in the function TwoLayerFCNeuralNetwork; this will accept TensorFlow Tensors for the inputs and weights of the network, and return a TensorFlow Tensor for the scores.

After defining the network architecture in the TwoLayerFCNeuralNetwork function, we will test the implementation by checking the shape of the output.

It's important that you read and understand this implementation.

In [92]:
def TwoLayerFCNeuralNetwork(x, params):
    """
    A fully-connected neural network; the architecture is:
    fully-connected layer -> ReLU -> fully connected layer.
    Note that we only need to define the forward pass here; TensorFlow will take
    care of computing the gradients for us.

    The input to the network will be a minibatch of data, of shape
    (N, d1, ..., dM) where d1 * ... * dM = D. The hidden layer will have H units,
    and the output layer will produce scores for C classes.

    Inputs:
    - x: A TensorFlow Tensor of shape (N, d1, ..., dM) giving a minibatch of
      input data.
    - params: A list [w1, w2] of TensorFlow Tensors giving weights for the
      network, where w1 has shape (D, H) and w2 has shape (H, C).

    Returns:
    - scores: A TensorFlow Tensor of shape (N, C) giving classification scores
      for the input data x.
    """
    w1, w2 = params                   # Unpack the parameters
    x = flatten(x)                    # Flatten the input; now x has shape (N, D)
    h = tf.nn.relu(tf.matmul(x, w1))  # Hidden layer: h has shape (N, H)
    scores = tf.matmul(h, w2)         # Compute scores of shape (N, C)
    return scores

In [93]:
def TwoLayerFCNeuralNetwork_test():
    hidden_layer_size = 42

    # Scoping our TF operations under a tf.device context manager
    # lets us tell TensorFlow where we want these Tensors to be
    # multiplied and/or operated on, e.g. on a CPU or a GPU.
    with tf.device(device):
        x = tf.zeros((64, 32, 32, 3))
        w1 = tf.zeros((32 * 32 * 3, hidden_layer_size))
        w2 = tf.zeros((hidden_layer_size, 10))

        # Call our TwoLayerFCNeuralNetwork function for the forward pass of the network.
        scores = TwoLayerFCNeuralNetwork(x, [w1, w2])

    print(scores.shape)

TwoLayerFCNeuralNetwork_test()

(64, 10)


## Three-Layer ConvNet
Here you will complete the implementation of the function ThreeLayerConvNet which will perform the forward pass of a three-layer convolutional network. The network should have the following architecture:

A convolutional layer (with bias) with channel_1 filters, each with shape KW1 x KH1, and zero-padding of two
ReLU nonlinearity
A convolutional layer (with bias) with channel_2 filters, each with shape KW2 x KH2, and zero-padding of one
ReLU nonlinearity
Fully-connected layer with bias, producing scores for C classes.

HINT: For convolutions: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/nn/conv2d; be careful with padding!

HINT: For biases: https://www.tensorflow.org/performance/xla/broadcasting

In [94]:
def ThreeLayerConvNet(x, params):
    """
    A three-layer convolutional network with the architecture described above.

    Inputs:
    - x: A TensorFlow Tensor of shape (N, H, W, 3) giving a minibatch of images
    - params: A list of TensorFlow Tensors giving the weights and biases for the
      network; should contain the following:
      - conv_w1: TensorFlow Tensor of shape (KH1, KW1, 3, channel_1) giving
        weights for the first convolutional layer.
      - conv_b1: TensorFlow Tensor of shape (channel_1,) giving biases for the
        first convolutional layer.
      - conv_w2: TensorFlow Tensor of shape (KH2, KW2, channel_1, channel_2)
        giving weights for the second convolutional layer
      - conv_b2: TensorFlow Tensor of shape (channel_2,) giving biases for the
        second convolutional layer.
      - fc_w: TensorFlow Tensor giving weights for the fully-connected layer.
        Can you figure out what the shape should be?
      - fc_b: TensorFlow Tensor giving biases for the fully-connected layer.
        Can you figure out what the shape should be?
    """
    conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b = params
    scores = None
    ############################################################################
    # TODO: Implement the forward pass for the three-layer ConvNet.            #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    conv1 = tf.nn.conv2d(x, conv_w1, strides=[1, 1, 1, 1], padding='SAME') + conv_b1
    relu1 = tf.nn.relu(conv1)
    #pool1 = tf.nn.max_pool2d(relu1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

    # Second convolutional layer: Conv -> ReLU -> Pool
    conv2 = tf.nn.conv2d(relu1, conv_w2, strides=[1, 1, 1, 1], padding='SAME') + conv_b2
    relu2 = tf.nn.relu(conv2)
    #pool2 = tf.nn.max_pool2d(relu2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

    # Flatten the output from the second pooling layer
    flat = flatten(relu2)
    N = tf.shape(flat)[0]  # batch size
    flat_shape = tf.shape(flat)[1]  # Compute the size of the flattened vector
    
    # Reshape the flattened output to match the fully connected layer weight matrix
    flat = tf.reshape(flat, (tf.shape(flat)[0], flat_shape))  # Ensure the shape is [batch_size, flat_shape]

    # Fully connected layer: Affine transformation (matrix multiplication)
    scores = tf.matmul(flat, fc_w) + fc_b
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                              END OF YOUR CODE                            #
    ############################################################################
    return scores

After defing the forward pass of the three-layer ConvNet above, run the following cell to test your implementation. Like the two-layer network, we run the graph on a batch of zeros just to make sure the function doesn't crash, and produces outputs of the correct shape.

When you run this function, scores_np should have shape (64, 10).

In [95]:
def ThreeLayerConvNetTest():

    with tf.device(device):
        x = tf.zeros((64, 32, 32, 3))
        conv_w1 = tf.zeros((5, 5, 3, 6))
        conv_b1 = tf.zeros((6,))
        conv_w2 = tf.zeros((3, 3, 6, 9))
        conv_b2 = tf.zeros((9,))
        fc_w = tf.zeros((32 * 32 * 9, 10))
        fc_b = tf.zeros((10,))
        params = [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b]
        scores = ThreeLayerConvNet(x, params)

    # Inputs to convolutional layers are 4-dimensional arrays with shape
    # [batch_size, height, width, channels]
    print('scores_np has shape: ', scores.shape)

ThreeLayerConvNetTest()

scores_np has shape:  (64, 10)


### Training Step
We now define the training_step function performs a single training step. This will take three basic steps:

Compute the loss
Compute the gradient of the loss with respect to all network weights
Make a weight update step using (stochastic) gradient descent.
We need to use a few new TensorFlow functions to do all of this:

For computing the cross-entropy loss we'll use tf.nn.
sparse_softmax_cross_entropy_with_logits: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits

For averaging the loss across a minibatch of data we'll use tf.reduce_mean:

https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/reduce_mean

For computing gradients of the loss with respect to the weights we'll use tf.GradientTape (useful for Eager execution): https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/GradientTape

We'll mutate the weight values stored in a TensorFlow Tensor using tf.compat.v1.assign_sub ("sub" is for subtraction): https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/compat/v1/assign_sub

In [96]:
def trainingStep(model_fn, x, y, params, learning_rate):
    with tf.GradientTape() as tape:
        scores = model_fn(x, params) # Forward pass of the model
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=scores)
        total_loss = tf.reduce_mean(loss)
        grad_params = tape.gradient(total_loss, params)

        # Make a vanilla gradient descent step on all of the model parameters
        # Manually update the weights using assign_sub()
        for w, grad_w in zip(params, grad_params):
            w.assign_sub(learning_rate * grad_w)

        return total_loss

In [97]:
def trainingStep2(model_fn, init_fn, learning_rate):
    """
    Train a model on CIFAR-10.

    Inputs:
    - model_fn: A Python function that performs the forward pass of the model
      using TensorFlow; it should have the following signature:
      scores = model_fn(x, params) where x is a TensorFlow Tensor giving a
      minibatch of image data, params is a list of TensorFlow Tensors holding
      the model weights, and scores is a TensorFlow Tensor of shape (N, C)
      giving scores for all elements of x.
    - init_fn: A Python function that initializes the parameters of the model.
      It should have the signature params = init_fn() where params is a list
      of TensorFlow Tensors holding the (randomly initialized) weights of the
      model.
    - learning_rate: Python float giving the learning rate to use for SGD.
    """
    params = init_fn()  # Initialize the model parameters

    for t, (x_np, y_np) in enumerate(train_dset):
        # Run the graph on a batch of training data.
        loss = trainingStep(model_fn, x_np, y_np, params, learning_rate)

        # Periodically print the loss and check accuracy on the val set.
        if t % print_every == 0:
            print('Iteration %d, loss = %.4f' % (t, loss))
            check_accuracy(val_dset, x_np, model_fn, params)

In [98]:
def check_accuracy(dset, x, model_fn, params):
    """
    Check accuracy on a classification model, e.g. for validation.

    Inputs:
    - dset: A Dataset object against which to check accuracy
    - x: A TensorFlow placeholder Tensor where input images should be fed
    - model_fn: the Model we will be calling to make predictions on x
    - params: parameters for the model_fn to work with

    Returns: Nothing, but prints the accuracy of the model
    """
    num_correct, num_samples = 0, 0
    for x_batch, y_batch in dset:
        scores_np = model_fn(x_batch, params).numpy()
        y_pred = scores_np.argmax(axis=1)
        num_samples += x_batch.shape[0]
        num_correct += (y_pred == y_batch).sum()
    acc = float(num_correct) / num_samples
    print('Got %d / %d correct (%.2f%%)' % (num_correct, num_samples, 100 * acc))

### Initialization
We'll use the following utility method to initialize the weight matrices for our models using Kaiming's normalization method.

[1] He et al, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , ICCV 2015, https://arxiv.org/abs/1502.01852

In [99]:
def create_matrix_with_kaiming_normal(shape):
    if len(shape) == 2:
        fan_in, fan_out = shape[0], shape[1]
    elif len(shape) == 4:
        fan_in, fan_out = np.prod(shape[:3]), shape[3]
    return tf.keras.backend.random_normal(shape) * np.sqrt(2.0 / fan_in)

### Train a Two-Layer Network
We are finally ready to use all of the pieces defined above to train a two-layer fully-connected network on CIFAR-10.

We just need to define a function to initialize the weights of the model, and call train_part2.

Defining the weights of the network introduces another important piece of TensorFlow API: tf.Variable. A TensorFlow Variable is a Tensor whose value is stored in the graph and persists across runs of the computational graph; however unlike constants defined with tf.zeros or tf.random_normal, the values of a Variable can be mutated as the graph runs; these mutations will persist across graph runs. Learnable parameters of the network are usually stored in Variables.

You don't need to tune any hyperparameters, but you should achieve validation accuracies above 40% after one epoch of training.

In [100]:
def TwoLayerFCNeuralNetwork_init():
    """
    Initialize the weights of a two-layer network, for use with the
    TwoLayerFCNeuralNetwork function defined above.
    You can use the `create_matrix_with_kaiming_normal` helper!

    Inputs: None

    Returns: A list of:
    - w1: TensorFlow tf.Variable giving the weights for the first layer
    - w2: TensorFlow tf.Variable giving the weights for the second layer
    """
    hidden_layer_size = 4000
    w1 = tf.Variable(create_matrix_with_kaiming_normal((3 * 32 * 32, 4000)))
    w2 = tf.Variable(create_matrix_with_kaiming_normal((4000, 10)))
    return [w1, w2]

learning_rate = 1e-2
trainingStep2(TwoLayerFCNeuralNetwork, TwoLayerFCNeuralNetwork_init, learning_rate)

Iteration 0, loss = 2.8848
Got 149 / 1000 correct (14.90%)
Iteration 100, loss = 2.0034
Got 378 / 1000 correct (37.80%)
Iteration 200, loss = 1.5548
Got 415 / 1000 correct (41.50%)
Iteration 300, loss = 1.9043
Got 377 / 1000 correct (37.70%)
Iteration 400, loss = 1.7129
Got 405 / 1000 correct (40.50%)
Iteration 500, loss = 1.7573
Got 429 / 1000 correct (42.90%)
Iteration 600, loss = 1.8939
Got 427 / 1000 correct (42.70%)
Iteration 700, loss = 1.9229
Got 447 / 1000 correct (44.70%)


### Train a three-layer ConvNet
We will now use TensorFlow to train a three-layer ConvNet on CIFAR-10.

You need to implement the ThreeLayerConvNet_init function. Recall that the architecture of the network is:

Convolutional layer (with bias) with 32 5x5 filters, with zero-padding 2
ReLU
Convolutional layer (with bias) with 16 3x3 filters, with zero-padding 1
ReLU
Fully-connected layer (with bias) to compute scores for 10 classes
You don't need to do any hyperparameter tuning, but you should see validation accuracies above 43% after one epoch of training.


In [102]:
def ThreeLayerConvNet_init():
    """
    Initialize the weights of a Three-Layer ConvNet, for use with the
    ThreeLayerConvNet function defined above.
    You can use the `create_matrix_with_kaiming_normal` helper!

    Inputs: None

    Returns a list containing:
    - conv_w1: TensorFlow tf.Variable giving weights for the first conv layer
    - conv_b1: TensorFlow tf.Variable giving biases for the first conv layer
    - conv_w2: TensorFlow tf.Variable giving weights for the second conv layer
    - conv_b2: TensorFlow tf.Variable giving biases for the second conv layer
    - fc_w: TensorFlow tf.Variable giving weights for the fully-connected layer
    - fc_b: TensorFlow tf.Variable giving biases for the fully-connected layer
    """
    params = None
    ############################################################################
    # TODO: Initialize the parameters of the three-layer network.              #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    conv_w1 = tf.Variable(create_matrix_with_kaiming_normal([5, 5, 3, 32]))  
    conv_b1 = tf.Variable(tf.zeros(32))  
    
    # Second convolutional layer: (KH2, KW2, channel_1, channel_2)
    conv_w2 = tf.Variable(create_matrix_with_kaiming_normal([3, 3, 32, 16]))  
    conv_b2 = tf.Variable(tf.zeros(16))  
    
    # Fully connected layer: Size needs to match the output from the final pooling layer
    fc_w = tf.Variable(create_matrix_with_kaiming_normal([32 * 32 * 16, 10])) 
    fc_b = tf.Variable(tf.zeros(10)) 
    params = [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b]


    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################
    return params

learning_rate = 3e-3
trainingStep2(ThreeLayerConvNet, ThreeLayerConvNet_init, learning_rate)

Iteration 0, loss = 2.9589
Got 133 / 1000 correct (13.30%)
Iteration 100, loss = 1.7472
Got 362 / 1000 correct (36.20%)
Iteration 200, loss = 1.5707
Got 393 / 1000 correct (39.30%)
Iteration 300, loss = 1.6880
Got 403 / 1000 correct (40.30%)
Iteration 400, loss = 1.7265
Got 443 / 1000 correct (44.30%)
Iteration 500, loss = 1.6660
Got 447 / 1000 correct (44.70%)
Iteration 600, loss = 1.6297
Got 467 / 1000 correct (46.70%)
Iteration 700, loss = 1.6675
Got 474 / 1000 correct (47.40%)


## Part III: Keras Model Subclassing API
Implementing a neural network using the low-level TensorFlow API is a good way to understand how TensorFlow works, but it's a little inconvenient - we had to manually keep track of all Tensors holding learnable parameters. This was fine for a small network, but could quickly become unweildy for a large complex model.

Fortunately TensorFlow 2.0 provides higher-level APIs such as tf.keras which make it easy to build models out of modular, object-oriented layers. Further, TensorFlow 2.0 uses eager execution that evaluates operations immediately, without explicitly constructing any computational graphs. This makes it easy to write and debug models, and reduces the boilerplate code.

In this part of the notebook we will define neural network models using the tf.keras.Model API. To implement your own model, you need to do the following:

Define a new class which subclasses tf.keras.Model. Give your class an intuitive name that describes it, like twoLayerFC or threeLayerConvNet.
In the initializer __init__() for your new class, define all the layers you need as class attributes. The tf.keras.layers package provides many common neural-network layers, like tf.keras.layers.Dense for fully-connected layers and tf.keras.layers.Conv2D for convolutional layers. Under the hood, these layers will construct Variable Tensors for any learnable parameters. Warning: Don't forget to call super(YourModelName, self).__init__() as the first line in your initializer!
Implement the call() method for your class; this implements the forward pass of your model, and defines the connectivity of your network. Layers defined in __init__() implement __call__() so they can be used as function objects that transform input Tensors into output Tensors. Don't define any new layers in call(); any layers you want to use in the forward pass should be defined in __init__().
After you define your tf.keras.Model subclass, you can instantiate it and use it like the model functions from Part II.

Keras Model Subclassing API: Two-Layer Network
Here is a concrete example of using the tf.keras.Model API to define a two-layer network. There are a few new bits of API to be aware of here:

We use an Initializer object to set up the initial values of the learnable parameters of the layers; in particular tf.initializers.VarianceScaling gives behavior similar to the Kaiming initialization method we used in Part II. You can read more about it here: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/initializers/VarianceScaling

We construct tf.keras.layers.Dense objects to represent the two fully-connected layers of the model. In addition to multiplying their input by a weight matrix and adding a bias vector, these layer can also apply a nonlinearity for you. For the first layer we specify a ReLU activation function by passing activation='relu' to the constructor; the second layer uses softmax activation function. Finally, we use tf.keras.layers.Flatten to flatten the output from the previous fully-connected layer.

In [103]:
class twoLayerFC(tf.keras.Model):
    def __init__(self, hidden_size, num_classes):
        super(twoLayerFC, self).__init__()
        initializer = tf.initializers.VarianceScaling(scale=2.0)
        self.fc1 = tf.keras.layers.Dense(hidden_size, activation='relu',
                                   kernel_initializer=initializer)
        self.fc2 = tf.keras.layers.Dense(num_classes, activation='softmax',
                                   kernel_initializer=initializer)
        self.flatten = tf.keras.layers.Flatten()

    def call(self, x, training=False):
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.fc2(x)
        return x


def twoLayerFC_test():
    """ A small unit test to exercise the TwoLayerFC model above. """
    input_size, hidden_size, num_classes = 50, 42, 10
    x = tf.zeros((64, input_size))
    model = twoLayerFC(hidden_size, num_classes)
    with tf.device(device):
        scores = model(x)
        print(scores.shape)

twoLayerFC_test()

(64, 10)


### Keras Model Subclassing API: Three-Layer ConvNet
Now it's your turn to implement a three-layer ConvNet using the tf.keras.Model API. Your model should have the same architecture used in Part II:

Convolutional layer with 5 x 5 kernels, with zero-padding of 2
ReLU nonlinearity
Convolutional layer with 3 x 3 kernels, with zero-padding of 1
ReLU nonlinearity
Fully-connected layer to give class scores
Softmax nonlinearity
You should initialize the weights of your network using the same initialization method as was used in the two-layer network above.

Hint: Refer to the documentation for tf.keras.layers.Conv2D and tf.keras.layers.Dense:

https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Conv2D

https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Dense

In [106]:
class threeLayerConvNet(tf.keras.Model):
    def __init__(self, channel_1, channel_2, num_classes):
        super(threeLayerConvNet, self).__init__()
        ########################################################################
        # TODO: Implement the __init__ method for a three-layer ConvNet. You   #
        # should instantiate layer objects to be used in the forward pass.     #
        ########################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
        self.conv1 = tf.keras.layers.Conv2D(channel_1, (5, 5), padding='same', activation='relu')
        self.conv2 = tf.keras.layers.Conv2D(channel_2, (3, 3), padding='same', activation='relu')
        self.flatten = tf.keras.layers.Flatten()
        self.fc = tf.keras.layers.Dense(num_classes)



        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ########################################################################
        #                           END OF YOUR CODE                           #
        ########################################################################

    def call(self, x, training=False):
        scores = None
        ########################################################################
        # TODO: Implement the forward pass for a three-layer ConvNet. You      #
        # should use the layer objects defined in the __init__ method.         #
        ########################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        x = self.conv1(x)
        x = self.conv2(x)
        x = self.flatten(x)
        scores = self.fc(x)




        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ########################################################################
        #                           END OF YOUR CODE                           #
        ########################################################################
        return scores

Once you complete the implementation of the threeLayerConvNet above you can run the following to ensure that your implementation does not crash and produces outputs of the expected shape.

In [107]:
def threeLayerConvNet_test():
    channel_1, channel_2, num_classes = 12, 8, 10
    model = threeLayerConvNet(channel_1, channel_2, num_classes)
    with tf.device(device):
        x = tf.zeros((64, 32, 32, 3))
        scores = model(x)
        print(scores.shape)

threeLayerConvNet_test()

(64, 10)


### Keras Model Subclassing API: Eager Training
While keras models have a builtin training loop (using the model.fit), sometimes you need more customization. Here's an example, of a training loop implemented with eager execution.

In particular, notice tf.GradientTape. Automatic differentiation is used in the backend for implementing backpropagation in frameworks like TensorFlow. During eager execution, tf.GradientTape is used to trace operations for computing gradients later. A particular tf.GradientTape can only compute one gradient; subsequent calls to tape will throw a runtime error.

TensorFlow 2.0 ships with easy-to-use built-in metrics under tf.keras.metrics module. Each metric is an object, and we can use update_state() to add observations and reset_state() to clear all observations. We can get the current result of a metric by calling result() on the metric object.

In [112]:
def training_part3(model_init_fn, optimizer_init_fn, num_epochs=1, is_training=False):
    """
    Simple training loop for use with models defined using tf.keras. It trains
    a model for one epoch on the CIFAR-10 training set and periodically checks
    accuracy on the CIFAR-10 validation set.

    Inputs:
    - model_init_fn: A function that takes no parameters; when called it
      constructs the model we want to train: model = model_init_fn()
    - optimizer_init_fn: A function which takes no parameters; when called it
      constructs the Optimizer object we will use to optimize the model:
      optimizer = optimizer_init_fn()
    - num_epochs: The number of epochs to train for

    Returns: Nothing, but prints progress during trainingn
    """
    with tf.device(device):

        # Compute the loss like we did in Part II
        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()

        model = model_init_fn()
        optimizer = optimizer_init_fn()

        train_loss = tf.keras.metrics.Mean(name='train_loss')
        train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

        val_loss = tf.keras.metrics.Mean(name='val_loss')
        val_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='val_accuracy')

        t = 0
        for epoch in range(num_epochs):

            # Reset the metrics - https://www.tensorflow.org/alpha/guide/migration_guide#new-style_metrics
            train_loss.reset_state()
            train_accuracy.reset_state()

            for x_np, y_np in train_dset:
                with tf.GradientTape() as tape:

                    # Use the model function to build the forward pass.
                    scores = model(x_np, training=is_training)
                    loss = loss_fn(y_np, scores)

                    gradients = tape.gradient(loss, model.trainable_variables)
                    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

                    # Update the metrics
                    train_loss.update_state(loss)
                    train_accuracy.update_state(y_np, scores)

                    if t % print_every == 0:
                        val_loss.reset_state()
                        val_accuracy.reset_state()
                        for test_x, test_y in val_dset:
                            # During validation at end of epoch, training set to False
                            prediction = model(test_x, training=False)
                            t_loss = loss_fn(test_y, prediction)

                            val_loss.update_state(t_loss)
                            val_accuracy.update_state(test_y, prediction)

                        template = 'Iteration {}, Epoch {}, Loss: {:.4f}, Accuracy: {:.4f}, Val Loss: {:.4f}, Val Accuracy: {:.4f}'
                        print (template.format(t, epoch+1,
                                             train_loss.result(),
                                             train_accuracy.result()*100,
                                             val_loss.result(),
                                             val_accuracy.result()*100))
                    t += 1

### Keras Model Subclassing API: Train a Two-Layer Network
We can now use the tools defined above to train a two-layer network on CIFAR-10. We define the model_init_fn and optimizer_init_fn that construct the model and optimizer respectively when called. Here we want to train the model using stochastic gradient descent with no momentum, so we construct a tf.keras.optimizers.SGD function; you can read about it here.

You don't need to tune any hyperparameters here, but you should achieve validation accuracies above 40% after one epoch of training.

In [113]:
hidden_size, num_classes = 4000, 10
learning_rate = 1e-2

def model_init_fn():
    return twoLayerFC(hidden_size, num_classes)

def optimizer_init_fn():
    return tf.keras.optimizers.SGD(learning_rate=learning_rate)

training_part3(model_init_fn, optimizer_init_fn)

Iteration 0, Epoch 1, Loss: 3.2726, Accuracy: 6.2500, Val Loss: 2.8284, Val Accuracy: 12.8000
Iteration 100, Epoch 1, Loss: 2.2496, Accuracy: 28.4035, Val Loss: 1.8974, Val Accuracy: 37.3000
Iteration 200, Epoch 1, Loss: 2.0893, Accuracy: 32.1828, Val Loss: 1.8206, Val Accuracy: 39.8000
Iteration 300, Epoch 1, Loss: 2.0108, Accuracy: 33.9909, Val Loss: 1.8763, Val Accuracy: 38.5000
Iteration 400, Epoch 1, Loss: 1.9413, Accuracy: 35.7894, Val Loss: 1.7407, Val Accuracy: 41.2000
Iteration 500, Epoch 1, Loss: 1.8973, Accuracy: 36.9137, Val Loss: 1.6783, Val Accuracy: 43.4000
Iteration 600, Epoch 1, Loss: 1.8669, Accuracy: 37.7444, Val Loss: 1.7031, Val Accuracy: 43.1000
Iteration 700, Epoch 1, Loss: 1.8401, Accuracy: 38.4272, Val Loss: 1.6418, Val Accuracy: 44.8000


### Keras Model Subclassing API: Train a Three-Layer ConvNet
Here you should use the tools we've defined above to train a three-layer ConvNet on CIFAR-10. Your ConvNet should use 32 filters in the first convolutional layer and 16 filters in the second layer.

To train the model you should use gradient descent with Nesterov momentum 0.9.

HINT: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/optimizers/SGD

You don't need to perform any hyperparameter tuning, but you should achieve validation accuracies above 50% after training for one epoch.

In [114]:
learning_rate = 3e-3
channel_1, channel_2, num_classes = 32, 16, 10

def model_init_fn():
    model = None
    ############################################################################
    # TODO: Complete the implementation of model_fn.                           #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    model = threeLayerConvNet(channel_1, channel_2, num_classes)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                           END OF YOUR CODE                               #
    ############################################################################
    return model

def optimizer_init_fn():
    optimizer = None
    ############################################################################
    # TODO: Complete the implementation of model_fn.                           #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    learning_rate = 3e-3  # As specified
    momentum = 0.9  # Nesterov momentum
    optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate, momentum=momentum, nesterov=True)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                           END OF YOUR CODE                               #
    ############################################################################
    return optimizer

training_part3(model_init_fn, optimizer_init_fn)

Iteration 0, Epoch 1, Loss: 12.0450, Accuracy: 4.6875, Val Loss: 11.8503, Val Accuracy: 10.5000
Iteration 100, Epoch 1, Loss: 3.3134, Accuracy: 10.5972, Val Loss: 2.3026, Val Accuracy: 10.1000
Iteration 200, Epoch 1, Loss: 2.8105, Accuracy: 10.5955, Val Loss: 2.3026, Val Accuracy: 10.1000
Iteration 300, Epoch 1, Loss: 2.6418, Accuracy: 10.4288, Val Loss: 2.3026, Val Accuracy: 10.1000
Iteration 400, Epoch 1, Loss: 2.5572, Accuracy: 10.4465, Val Loss: 2.3026, Val Accuracy: 10.1000
Iteration 500, Epoch 1, Loss: 2.5064, Accuracy: 10.2888, Val Loss: 2.3026, Val Accuracy: 10.1000
Iteration 600, Epoch 1, Loss: 2.4725, Accuracy: 10.2277, Val Loss: 2.3026, Val Accuracy: 10.1000
Iteration 700, Epoch 1, Loss: 2.4482, Accuracy: 10.1574, Val Loss: 2.3026, Val Accuracy: 10.1000


## Part IV: Keras Sequential API
In Part III we introduced the tf.keras.Model API, which allows you to define models with any number of learnable layers and with arbitrary connectivity between layers.

However for many models you don't need such flexibility - a lot of models can be expressed as a sequential stack of layers, with the output of each layer fed to the next layer as input. If your model fits this pattern, then there is an even easier way to define your model: using tf.keras.Sequential. You don't need to write any custom classes; you simply call the tf.keras.Sequential constructor with a list containing a sequence of layer objects.

One complication with tf.keras.Sequential is that you must define the shape of the input to the model by passing a value to the input_shape of the first layer in your model.

Keras Sequential API: Two-Layer Network
In this subsection, we will rewrite the two-layer fully-connected network using tf.keras.Sequential, and train it using the training loop defined above.

You don't need to perform any hyperparameter tuning here, but you should see validation accuracies above 40% after training for one epoch.

In [115]:
learning_rate = 1e-2

def model_init_fn():
    input_shape = (32, 32, 3)
    hidden_layer_size, num_classes = 4000, 10
    initializer = tf.initializers.VarianceScaling(scale=2.0)
    layers = [
        tf.keras.layers.Flatten(input_shape=input_shape),
        tf.keras.layers.Dense(hidden_layer_size, activation='relu',
                              kernel_initializer=initializer),
        tf.keras.layers.Dense(num_classes, activation='softmax',
                              kernel_initializer=initializer),
    ]
    model = tf.keras.Sequential(layers)
    return model

def optimizer_init_fn():
    return tf.keras.optimizers.SGD(learning_rate=learning_rate)

training_part3(model_init_fn, optimizer_init_fn)

Iteration 0, Epoch 1, Loss: 3.1494, Accuracy: 7.8125, Val Loss: 2.9321, Val Accuracy: 12.4000
Iteration 100, Epoch 1, Loss: 2.2213, Accuracy: 28.4189, Val Loss: 1.9004, Val Accuracy: 37.5000
Iteration 200, Epoch 1, Loss: 2.0733, Accuracy: 32.2761, Val Loss: 1.8821, Val Accuracy: 38.0000
Iteration 300, Epoch 1, Loss: 1.9968, Accuracy: 34.2452, Val Loss: 1.8683, Val Accuracy: 37.3000
Iteration 400, Epoch 1, Loss: 1.9276, Accuracy: 36.1245, Val Loss: 1.7215, Val Accuracy: 42.3000
Iteration 500, Epoch 1, Loss: 1.8857, Accuracy: 37.1039, Val Loss: 1.6550, Val Accuracy: 43.3000
Iteration 600, Epoch 1, Loss: 1.8569, Accuracy: 37.9498, Val Loss: 1.6994, Val Accuracy: 41.8000
Iteration 700, Epoch 1, Loss: 1.8302, Accuracy: 38.6145, Val Loss: 1.6378, Val Accuracy: 43.6000


### Abstracting Away the Training Loop
In the previous examples, we used a customised training loop to train models (e.g. train_part34). Writing your own training loop is only required if you need more flexibility and control during training your model. Alternately, you can also use built-in APIs like tf.keras.Model.fit() and tf.keras.Model.evaluate to train and evaluate a model. Also remember to configure your model for training by calling `tf.keras.Model.compile.

You don't need to perform any hyperparameter tuning here, but you should see validation and test accuracies above 42% after training for one epoch.

In [116]:
model = model_init_fn()
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=learning_rate),
              loss='sparse_categorical_crossentropy',
              metrics=[tf.keras.metrics.sparse_categorical_accuracy])
model.fit(X_train, y_train, batch_size=64, epochs=1, validation_data=(X_val, y_val))
model.evaluate(X_test, y_test)

[1m766/766[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 22ms/step - loss: 2.0068 - sparse_categorical_accuracy: 0.3404 - val_loss: 1.6609 - val_sparse_categorical_accuracy: 0.4130
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - loss: 1.6507 - sparse_categorical_accuracy: 0.4322


[1.6566872596740723, 0.42750000953674316]

### Keras Sequential API: Three-Layer ConvNet
Here you should use tf.keras.Sequential to reimplement the same three-layer ConvNet architecture used in Part II and Part III. As a reminder, your model should have the following architecture:

Convolutional layer with 32 5x5 kernels, using zero padding of 2
ReLU nonlinearity
Convolutional layer with 16 3x3 kernels, using zero padding of 1
ReLU nonlinearity
Fully-connected layer giving class scores
Softmax nonlinearity
You should initialize the weights of the model using a tf.initializers.VarianceScaling as above.

You should train the model using Nesterov momentum 0.9.

You don't need to perform any hyperparameter search, but you should achieve accuracy above 45% after training for one epoch.

In [117]:
def model_init_fn():
    model = None
    ############################################################################
    # TODO: Construct a three-layer ConvNet using tf.keras.Sequential.         #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****


    model = tf.keras.Sequential([
        # First convolutional layer: 32 filters, 5x5 kernel, padding=2, ReLU
        tf.keras.layers.Conv2D(32, (5, 5), padding='same', activation='relu', 
                               kernel_initializer=tf.initializers.VarianceScaling()),

        # Second convolutional layer: 16 filters, 3x3 kernel, padding=1, ReLU
        tf.keras.layers.Conv2D(16, (3, 3), padding='same', activation='relu', 
                               kernel_initializer=tf.initializers.VarianceScaling()),

        # Flatten before the fully connected layer
        tf.keras.layers.Flatten(),

        # Fully connected layer: output 10 classes (CIFAR-10)
        tf.keras.layers.Dense(10, activation='softmax',
                              kernel_initializer=tf.initializers.VarianceScaling())
    ])


    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                            END OF YOUR CODE                              #
    ############################################################################
    return model

learning_rate = 5e-4
def optimizer_init_fn():
    optimizer = None
    ############################################################################
    # TODO: Complete the implementation of model_fn.                           #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    optimizer = tf.keras.optimizers.SGD(learning_rate=5e-4, momentum=0.9, nesterov=True)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                           END OF YOUR CODE                               #
    ############################################################################
    return optimizer

training_part3(model_init_fn, optimizer_init_fn)

Iteration 0, Epoch 1, Loss: 2.3888, Accuracy: 14.0625, Val Loss: 2.3621, Val Accuracy: 11.1000
Iteration 100, Epoch 1, Loss: 2.0370, Accuracy: 28.0786, Val Loss: 1.8745, Val Accuracy: 35.2000
Iteration 200, Epoch 1, Loss: 1.9268, Accuracy: 32.2683, Val Loss: 1.7288, Val Accuracy: 40.7000
Iteration 300, Epoch 1, Loss: 1.8600, Accuracy: 34.7280, Val Loss: 1.6847, Val Accuracy: 42.4000
Iteration 400, Epoch 1, Loss: 1.7965, Accuracy: 36.9623, Val Loss: 1.6233, Val Accuracy: 44.6000
Iteration 500, Epoch 1, Loss: 1.7533, Accuracy: 38.3078, Val Loss: 1.5804, Val Accuracy: 46.6000
Iteration 600, Epoch 1, Loss: 1.7249, Accuracy: 39.4239, Val Loss: 1.5434, Val Accuracy: 47.0000
Iteration 700, Epoch 1, Loss: 1.6986, Accuracy: 40.4289, Val Loss: 1.5085, Val Accuracy: 47.9000


We will also train this model with the built-in training loop APIs provided by TensorFlow.

In [118]:
model = model_init_fn()
model.compile(optimizer='sgd',
              loss='sparse_categorical_crossentropy',
              metrics=[tf.keras.metrics.sparse_categorical_accuracy])
model.fit(X_train, y_train, batch_size=64, epochs=1, validation_data=(X_val, y_val))
model.evaluate(X_test, y_test)

[1m766/766[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 10ms/step - loss: 1.7753 - sparse_categorical_accuracy: 0.3707 - val_loss: 1.3596 - val_sparse_categorical_accuracy: 0.5240
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 1.3568 - sparse_categorical_accuracy: 0.5251


[1.3711076974868774, 0.5151000022888184]

### Part V: BONUS POINT - CIFAR-10 open-ended challenge
In this section you can experiment with whatever ConvNet architecture you'd like on CIFAR-10.

You should experiment with architectures, hyperparameters, loss functions, regularization, or anything else you can think of to train a model that achieves at least 70% accuracy on the validation set within 10 epochs. You can use the built-in train function, the training_part3 function from above, or implement your own training loop.

Describe what you did at the end of the notebook.

Some things you can try:
Filter size: Above we used 5x5 and 3x3; is this optimal?
Number of filters: Above we used 16 and 32 filters. Would more or fewer do better?
Pooling: We didn't use any pooling above. Would this improve the model?
Normalization: Would your model be improved with batch normalization, layer normalization, group normalization, or some other normalization strategy?
Network architecture: The ConvNet above has only three layers of trainable parameters. Would a deeper model do better?
Global average pooling: Instead of flattening after the final convolutional layer, would global average pooling do better? This strategy is used for example in Google's Inception network and in Residual Networks.
Regularization: Would some kind of regularization improve performance? Maybe weight decay or dropout?
NOTE: Batch Normalization / Dropout
If you are using Batch Normalization and Dropout, remember to pass is_training=True if you use the training_part3() function. BatchNorm and Dropout layers have different behaviors at training and inference time. training is a specific keyword argument reserved for this purpose in any tf.keras.Model's call() function. Read more about this here : https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/BatchNormalization#methods https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Dropout#methods

Tips for training
For each network architecture that you try, you should tune the learning rate and other hyperparameters. When doing this there are a couple important things to keep in mind:

If the parameters are working well, you should see improvement within a few hundred iterations
Remember the coarse-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.
Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.
You should use the validation set for hyperparameter search, and save your test set for evaluating your architecture on the best parameters as selected by the validation set.
Going above and beyond
If you are feeling adventurous there are many other features you can implement to try and improve your performance. You are not required to implement any of these, but don't miss the fun if you have time!

Alternative optimizers: you can try Adam, Adagrad, RMSprop, etc.
Alternative activation functions such as leaky ReLU, parametric ReLU, ELU, or MaxOut.
Model ensembles
Data augmentation
New Architectures
ResNets where the input from the previous layer is added to the output.
DenseNets where inputs into previous layers are concatenated together.
Have fun and happy training!