## Part 2: Introduction to Feed Forward Networks

### 1. What is a neural network?

#### 1.1 Neurons

A neuron is software that is roughly modeled after the neuons in your brain. In software, we model it with an _affine function_ and an _activation function_. 

One type of neuron is the perceptron, which outputs a binary output 0 or 1 given an input [7]:

<img src="perceptron.jpg" width="600" height="480" />

You can add an activation function to the end isntead of simply thresholding values to clip values from 0 to 1. One common activiation function is the logistic function.

<img src="sigmoid_neuron.jpg" width="600" height="480" />

The most common activation function used nowadays is the rectified linear unit, which is simply max(0, z) where z = w * x + b, or the neurons output.

#### 1.2 Hidden layers and multi-layer perceptrons

A multi-layer perceptron (MLP) is quite simply layers on these perceptrons that are wired together. The layers between the input layer and the output layer are known as the hidden layers. The below is a four layer network with two hidden layers [7]:


<img src="hidden_layers.jpg" width="600" height="480" />

### 2. Tensorflow
Tensorflow (https://www.tensorflow.org/install/) is an extremely popular deep learning library built by Google and will be the main library used for of the rest of these notebooks (in the last lesson, we briefly used numpy, a numerical computation library that's useful but does not have deep learning functionality). NOTE: Other popular deep learning libraries include Pytorch and Caffe2. Keras is another popular one, but its API has since been absorbed into Tensorflow. Tensorflow is chosen here because:

* it has the most active community on Github
* it's well supported by Google in terms of core features
* it has Tensorflow serving, which allows you to serve your models online (something we'll see in a future notebook)
* it has Tensorboard for visualization (which we will use in this lesson)

Let's train our first model to get a sense of how powerful Tensorflow can be!

In [1]:
# Some initial setup. Borrowed from:
# https://github.com/ageron/handson-ml/blob/master/09_up_and_running_with_tensorflow.ipynb

# Common imports
import numpy as np
import os
import tensorflow as tf

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "tensorflow"

def save_fig(fig_id):
  path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
  print("Saving figure", fig_id)
  plt.tight_layout()
  plt.savefig(path, format='png', dpi=300)

def stabilize_output():
  tf.reset_default_graph()
  # needed to avoid the following error: https://github.com/RasaHQ/rasa_core/issues/80
  tf.keras.backend.clear_session()
  tf.set_random_seed(seed=42)
  np.random.seed(seed=42)

print "Done"

Done


Below we will train our first model using the example from the Tensorflow tutorial: https://www.tensorflow.org/tutorials/

This will show you the basics of training a model!

In [2]:
# The example below is also in https://colab.research.google.com/github/tensorflow/models/blob/master/samples/core/get_started/_index.ipynb

# to ensure relatively stable output across sessions
stabilize_output()

mnist = tf.keras.datasets.mnist
# load data (requires Internet connection)
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# build a model
model = tf.keras.models.Sequential([
  # flattens the input
  tf.keras.layers.Flatten(),
  # 1 "hidden" layer with 512 units - more on this in the next notebook
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  # example of regularization - dropout is a way of dropping hidden units at a certain factor
  # this essentially results in a model averaging across a large set of possible configurations of the hidden layer above
  # and results in model that should generalize better
  tf.keras.layers.Dropout(0.2),
  # 10 because there's possible didigts - 0 to 9
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
# train a model (using 5 epochs -> notice the accuracy improving with each epoch)
model.fit(x_train, y_train, epochs=5)

print model.metrics_names  # see https://keras.io/models/model/ for the full API
# evaluate model accuracy
model.evaluate(x_test, y_test)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
['loss', 'acc']


[0.0647361186181428, 0.9807]

You should see something similar to [0.06788356024027743, 0.9806]. The first number is the final loss and the second number is the accuracy.

Congratulations, it means you've trained a classifier that classifies digit images in the MNIST Dataset with __98% accuracy__! We'll break down how the model is optimizing to achieve this accuracy below.

### 3.  More Training of Neural Networks in Tensorflow

#### 3.1: Data Preparation

We load the CIFAR-10 dataset using the tf.keras API. 

In [3]:
# Borrowed from http://cs231n.github.io/assignments2018/assignment2/
def load_cifar10(num_training=49000, num_validation=1000, num_test=10000):
    """
    Fetch the CIFAR-10 dataset from the web and perform preprocessing to prepare
    it for the two-layer neural net classifier. These are the same steps as
    we used for the SVM, but condensed to a single function.
    """
    # Load the raw CIFAR-10 dataset and use appropriate data types and shapes
    # NOTE: Download will take a few minutes but once downloaded, it should be cached.
    cifar10 = tf.keras.datasets.cifar10.load_data()
    (X_train, y_train), (X_test, y_test) = cifar10
    X_train = np.asarray(X_train, dtype=np.float32)
    y_train = np.asarray(y_train, dtype=np.int32).flatten()
    X_test = np.asarray(X_test, dtype=np.float32)
    y_test = np.asarray(y_test, dtype=np.int32).flatten()

    # Subsample the data
    mask = range(num_training, num_training + num_validation)
    X_val = X_train[mask]
    y_val = y_train[mask]
    mask = range(num_training)
    X_train = X_train[mask]
    y_train = y_train[mask]
    mask = range(num_test)
    X_test = X_test[mask]
    y_test = y_test[mask]

    # Normalize the data: subtract the mean pixel and divide by std
    mean_pixel = X_train.mean(axis=(0, 1, 2), keepdims=True)
    std_pixel = X_train.std(axis=(0, 1, 2), keepdims=True)
    X_train = (X_train - mean_pixel) / std_pixel
    X_val = (X_val - mean_pixel) / std_pixel
    X_test = (X_test - mean_pixel) / std_pixel

    return X_train, y_train, X_val, y_val, X_test, y_test


# Invoke the above function to get our data.
# N - index of the number of datapoints (minibatch size)
# H - index of the the height of the feature map
# W - index of the width of the feature map
NHW = (0, 1, 2)
X_train, y_train, X_val, y_val, X_test, y_test = load_cifar10()
print('Train data shape: ', X_train.shape)
print('Train labels shape: ', y_train.shape, y_train.dtype)
print('Validation data shape: ', X_val.shape)
print('Validation labels shape: ', y_val.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

('Train data shape: ', (49000, 32, 32, 3))
('Train labels shape: ', (49000,), dtype('int32'))
('Validation data shape: ', (1000, 32, 32, 3))
('Validation labels shape: ', (1000,))
('Test data shape: ', (10000, 32, 32, 3))
('Test labels shape: ', (10000,))


#### 3.2 Preparation: Dataset object

Borr CS231N [2], we will define a lightweight `Dataset` class which lets us iterate over data and labels. This is not the most flexible or most efficient way to iterate through data, but it will serve our purposes."

In [5]:
class Dataset(object):
    def __init__(self, X, y, batch_size, shuffle=False):
        """
        Construct a Dataset object to iterate over data X and labels y
        
        Inputs:
        - X: Numpy array of data, of any shape
        - y: Numpy array of labels, of any shape but with y.shape[0] == X.shape[0]
        - batch_size: Integer giving number of elements per minibatch
        - shuffle: (optional) Boolean, whether to shuffle the data on each epoch
        """
        assert X.shape[0] == y.shape[0], 'Got different numbers of data and labels'
        self.X, self.y = X, y
        self.batch_size, self.shuffle = batch_size, shuffle

    def __iter__(self):
        N, B = self.X.shape[0], self.batch_size
        idxs = np.arange(N)
        if self.shuffle:
            np.random.shuffle(idxs)
        return iter((self.X[i:i+B], self.y[i:i+B]) for i in range(0, N, B))


train_dset = Dataset(X_train, y_train, batch_size=64, shuffle=True)
val_dset = Dataset(X_val, y_val, batch_size=64, shuffle=False)
test_dset = Dataset(X_test, y_test, batch_size=64)
print "Done"

Done


In [6]:
# We can iterate through a dataset like this:
for t, (x, y) in enumerate(train_dset):
    print(t, x.shape, y.shape)
    if t > 5: break
        
# You can also optionally set GPU to true if you are working on AWS/Google Cloud (more on that later). For now,
# we to false

# Set up some global variables
USE_GPU = False

if USE_GPU:
    device = '/device:GPU:0'
else:
    device = '/cpu:0'

# Constant to control how often we print when training models
print_every = 100

print('Using device: ', device)

(0, (64, 32, 32, 3), (64,))
(1, (64, 32, 32, 3), (64,))
(2, (64, 32, 32, 3), (64,))
(3, (64, 32, 32, 3), (64,))
(4, (64, 32, 32, 3), (64,))
(5, (64, 32, 32, 3), (64,))
(6, (64, 32, 32, 3), (64,))
('Using device: ', '/cpu:0')


In [9]:
# Borrowed fromcs231n.github.io/assignments2018/assignment2/
# We define a flatten utility function to help us flatten our image data - the 32x32x3 (or 32 x 32 image size with three
# channels for RGB) flattens into 3072 
def flatten(x):
    """    
    Input:
    - TensorFlow Tensor of shape (N, D1, ..., DM)
    
    Output:
    - TensorFlow Tensor of shape (N, D1 * ... * DM)
    """
    N = tf.shape(x)[0]
    return tf.reshape(x, (N, -1))

def two_layer_fc(x, params):
    """
    A fully-connected neural network; the architecture is:
    fully-connected layer -> ReLU -> fully connected layer.
    Note that we only need to define the forward pass here; TensorFlow will take
    care of computing the gradients for us.
    
    The input to the network will be a minibatch of data, of shape
    (N, d1, ..., dM) where d1 * ... * dM = D. The hidden layer will have H units,
    and the output layer will produce scores for C classes.

    Inputs:
    - x: A TensorFlow Tensor of shape (N, d1, ..., dM) giving a minibatch of
      input data.
    - params: A list [w1, w2] of TensorFlow Tensors giving weights for the
      network, where w1 has shape (D, H) and w2 has shape (H, C).
    
    Returns:
    - scores: A TensorFlow Tensor of shape (N, C) giving classification scores
      for the input data x.
    """
    w1, w2 = params  # Unpack the parameters
    x = flatten(x)   # Flatten the input; now x has shape (N, D)
    h = tf.nn.relu(tf.matmul(x, w1)) # Hidden layer: h has shape (N, H)
    scores = tf.matmul(h, w2)        # Compute scores of shape (N, C)
    return scores

def two_layer_fc_test():
    # TensorFlow's default computational graph is essentially a hidden global
    # variable. To avoid adding to this default graph when you rerun this cell,
    # we clear the default graph before constructing the graph we care about.
    tf.reset_default_graph()
    hidden_layer_size = 42

    # Scoping our computational graph setup code under a tf.device context
    # manager lets us tell TensorFlow where we want these Tensors to be
    # placed.
    with tf.device(device):
        # Set up a placehoder for the input of the network, and constant
        # zero Tensors for the network weights. Here we declare w1 and w2
        # using tf.zeros instead of tf.placeholder as we've seen before - this
        # means that the values of w1 and w2 will be stored in the computational
        # graph itself and will persist across multiple runs of the graph; in
        # particular this means that we don't have to pass values for w1 and w2
        # using a feed_dict when we eventually run the graph.
        x = tf.placeholder(tf.float32)
        w1 = tf.zeros((32 * 32 * 3, hidden_layer_size))
        w2 = tf.zeros((hidden_layer_size, 10))
        
        # Call our two_layer_fc function to set up the computational
        # graph for the forward pass of the network.
        scores = two_layer_fc(x, [w1, w2])
    
    # Use numpy to create some concrete data that we will pass to the
    # computational graph for the x placeholder.
    x_np = np.zeros((64, 32, 32, 3))
    with tf.Session() as sess:
        # The calls to tf.zeros above do not actually instantiate the values
        # for w1 and w2; the following line tells TensorFlow to instantiate
        # the values of all Tensors (like w1 and w2) that live in the graph.
        sess.run(tf.global_variables_initializer())
        
        # Here we actually run the graph, using the feed_dict to pass the
        # value to bind to the placeholder for x; we ask TensorFlow to compute
        # the value of the scores Tensor, which it returns as a numpy array.
        scores_np = sess.run(scores, feed_dict={x: x_np})
        print scores_np
        print(scores_np.shape)

two_layer_fc_test()
# should print a bunch of zeros
# should print {64, 10}
print "Done"

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0.

#### 3.3 Training

We will now train using the gradient descent algorithm explained in the previous notebook. The check_accuracy function below lets us check the accuracy of our neural network.

In [11]:
# Borrowed from cs231n.github.io/assignments2018/assignment2/
def training_step(scores, y, params, learning_rate):
    """
    Set up the part of the computational graph which makes a training step.

    Inputs:
    - scores: TensorFlow Tensor of shape (N, C) giving classification scores for
      the model.
    - y: TensorFlow Tensor of shape (N,) giving ground-truth labels for scores;
      y[i] == c means that c is the correct class for scores[i].
    - params: List of TensorFlow Tensors giving the weights of the model
    - learning_rate: Python scalar giving the learning rate to use for gradient
      descent step.
      
    Returns:
    - loss: A TensorFlow Tensor of shape () (scalar) giving the loss for this
      batch of data; evaluating the loss also performs a gradient descent step
      on params (see above).
    """
    # First compute the loss; the first line gives losses for each example in
    # the minibatch, and the second averages the losses acros the batch
    losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=scores)
    loss = tf.reduce_mean(losses)

    # Compute the gradient of the loss with respect to each parameter of the the
    # network. This is a very magical function call: TensorFlow internally
    # traverses the computational graph starting at loss backward to each element
    # of params, and uses backpropagation to figure out how to compute gradients;
    # it then adds new operations to the computational graph which compute the
    # requested gradients, and returns a list of TensorFlow Tensors that will
    # contain the requested gradients when evaluated.
    grad_params = tf.gradients(loss, params)
    
    # Make a gradient descent step on all of the model parameters.
    new_weights = []   
    for w, grad_w in zip(params, grad_params):
        new_w = tf.assign_sub(w, learning_rate * grad_w)
        new_weights.append(new_w)

    # Insert a control dependency so that evaluting the loss causes a weight
    # update to happen; see the discussion above.
    with tf.control_dependencies(new_weights):
        return tf.identity(loss)

def train_part2(model_fn, init_fn, learning_rate):
    """
    Train a model on CIFAR-10.
    
    Inputs:
    - model_fn: A Python function that performs the forward pass of the model
      using TensorFlow; it should have the following signature:
      scores = model_fn(x, params) where x is a TensorFlow Tensor giving a
      minibatch of image data, params is a list of TensorFlow Tensors holding
      the model weights, and scores is a TensorFlow Tensor of shape (N, C)
      giving scores for all elements of x.
    - init_fn: A Python function that initializes the parameters of the model.
      It should have the signature params = init_fn() where params is a list
      of TensorFlow Tensors holding the (randomly initialized) weights of the
      model.
    - learning_rate: Python float giving the learning rate to use for SGD.
    """
    # First clear the default graph
    tf.reset_default_graph()
    is_training = tf.placeholder(tf.bool, name='is_training')
    # Set up the computational graph for performing forward and backward passes,
    # and weight updates.
    with tf.device(device):
        # Set up placeholders for the data and labels
        x = tf.placeholder(tf.float32, [None, 32, 32, 3])
        y = tf.placeholder(tf.int32, [None])
        params = init_fn()           # Initialize the model parameters
        scores = model_fn(x, params) # Forward pass of the model
        loss = training_step(scores, y, params, learning_rate)

    # Now we actually run the graph many times using the training data
    with tf.Session() as sess:
        # Initialize variables that will live in the graph
        sess.run(tf.global_variables_initializer())
        for t, (x_np, y_np) in enumerate(train_dset):
            # Run the graph on a batch of training data; recall that asking
            # TensorFlow to evaluate loss will cause an SGD step to happen.
            feed_dict = {x: x_np, y: y_np}
            loss_np = sess.run(loss, feed_dict=feed_dict)
            
            # Periodically print the loss and check accuracy on the val set
            if t % print_every == 0:
                print('Iteration %d, loss = %.4f' % (t, loss_np))
                check_accuracy(sess, val_dset, x, scores, is_training)
                
def check_accuracy(sess, dset, x, scores, is_training=None):
    """
    Check accuracy on a classification model.
    
    Inputs:
    - sess: A TensorFlow Session that will be used to run the graph
    - dset: A Dataset object on which to check accuracy
    - x: A TensorFlow placeholder Tensor where input images should be fed
    - scores: A TensorFlow Tensor representing the scores output from the
      model; this is the Tensor we will ask TensorFlow to evaluate.
      
    Returns: Nothing, but prints the accuracy of the model
    """
    num_correct, num_samples = 0, 0
    for x_batch, y_batch in dset:
        feed_dict = {x: x_batch, is_training: 0}
        scores_np = sess.run(scores, feed_dict=feed_dict)
        y_pred = scores_np.argmax(axis=1)
        num_samples += x_batch.shape[0]
        num_correct += (y_pred == y_batch).sum()
    acc = float(num_correct) / num_samples
    print('Got %d / %d correct (%.2f%%)' % (num_correct, num_samples, 100 * acc))
print "Done"

Done


In [12]:
# Borrowed from cs231n.github.io/assignments2018/assignment2/
# We initialize the weight matrices for our models using a method known as Kaiming's normalization method [8]
def kaiming_normal(shape):
    if len(shape) == 2:
        fan_in, fan_out = shape[0], shape[1]
    elif len(shape) == 4:
        fan_in, fan_out = np.prod(shape[:3]), shape[3]
    return tf.random_normal(shape) * np.sqrt(2.0 / fan_in)


def two_layer_fc_init():
    """
    Initialize the weights of a two-layer network (one hidden layer), for use with the
    two_layer_network function defined above.
    
    Inputs: None
    
    Returns: A list of:
    - w1: TensorFlow Variable giving the weights for the first layer
    - w2: TensorFlow Variable giving the weights for the second layer
    """
    # Numer of neurons in hidden layer
    hidden_layer_size = 4000
    # Now we initialize the weights of our two layer network using tf.Variable
    # "A TensorFlow Variable is a Tensor whose value is stored in the graph and persists across runs of the 
    # computational graph; however unlike constants defined with `tf.zeros` or `tf.random_normal`, 
    # the values of a Variable can be mutated as the graph runs; these mutations will persist across graph runs. 
   # Learnable parameters of the network are usually stored in Variables."
    w1 = tf.Variable(kaiming_normal((3 * 32 * 32, hidden_layer_size)))
    w2 = tf.Variable(kaiming_normal((hidden_layer_size, 10)))
    return [w1, w2]
print "Done"

Done


In [14]:
# Now we actually train our model with one *epoch* (an epoch in this case consists of 700 iterations
# of gradient descent but can be tuned)! We use a learning rate of 0.01
learning_rate = 1e-2
train_part2(two_layer_fc, two_layer_fc_init, learning_rate)

# You should see an accuracy of >40% with just one epoch

Iteration 0, loss = 2.8786
Got 139 / 1000 correct (13.90%)
Iteration 100, loss = 1.9534
Got 381 / 1000 correct (38.10%)
Iteration 200, loss = 1.4373
Got 384 / 1000 correct (38.40%)
Iteration 300, loss = 1.8520
Got 360 / 1000 correct (36.00%)
Iteration 400, loss = 1.8234
Got 422 / 1000 correct (42.20%)
Iteration 500, loss = 1.7975
Got 443 / 1000 correct (44.30%)
Iteration 600, loss = 1.8558
Got 426 / 1000 correct (42.60%)
Iteration 700, loss = 2.0239
Got 453 / 1000 correct (45.30%)


#### 3.3 Keras

Note in the first cell, we used the tf.keras Sequential API to make a neural network but here we use "barebones" Tensorflow. One of the good (and possibly bad) things about Tensorflow is that there are several ways to create a neural network and train it. Here are some possible ways:
* Barebones tensorflow
* tf.keras Model API
* tf.keras Sequential API

Here is a table of comparison borrowed from [2]:

| API           | Flexibility | Convenience |
|---------------|-------------|-------------|
| Barebone      | High        | Low         |
| `tf.keras.Model`     | High        | Medium      |
| `tf.keras.Sequential` | Low         | High        |


Note that with the tf.keras Model API, you have the options of using the **object-oriented API**, where each layer of the neural network is represented as a Python object (like `tf.layers.Dense`) or the **functional API**, where each layer is a Python function (like `tf.layers.dense`). We will only use the Sequential API and skip the Model API in the cells below because we will simply trade off lots of flexiblity for convenience.

In [15]:
# Now we will train the same model using the Sequential API. 
# First we set up our training and model initializiation functions
def train_part34(model_init_fn, optimizer_init_fn, num_epochs=1):
    """
    Simple training loop for use with models defined using tf.keras. It trains
    a model for one epoch on the CIFAR-10 training set and periodically checks
    accuracy on the CIFAR-10 validation set.
    
    Inputs:
    - model_init_fn: A function that takes no parameters; when called it
      constructs the model we want to train: model = model_init_fn()
    - optimizer_init_fn: A function which takes no parameters; when called it
      constructs the Optimizer object we will use to optimize the model:
      optimizer = optimizer_init_fn()
    - num_epochs: The number of epochs to train for
    
    Returns: Nothing, but prints progress during trainingn
    """
    tf.reset_default_graph()    
    with tf.device(device):
        # Construct the computational graph we will use to train the model. We
        # use the model_init_fn to construct the model, declare placeholders for
        # the data and labels
        x = tf.placeholder(tf.float32, [None, 32, 32, 3])
        y = tf.placeholder(tf.int32, [None])
        
        # We need a place holder to explicitly specify if the model is in the training
        # phase or not. This is because a number of layers behaves differently in
        # training and in testing, e.g., dropout and batch normalization.
        # We pass this variable to the computation graph through feed_dict as shown below.
        is_training = tf.placeholder(tf.bool, name='is_training')
        
        # Use the model function to build the forward pass.
        scores = model_init_fn(x, is_training)

        # Compute the loss like we did in Part II
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=scores)
        loss = tf.reduce_mean(loss)

        # Use the optimizer_fn to construct an Optimizer, then use the optimizer
        # to set up the training step. Asking TensorFlow to evaluate the
        # train_op returned by optimizer.minimize(loss) will cause us to make a
        # single update step using the current minibatch of data.
        
        # Note that we use tf.control_dependencies to force the model to run
        # the tf.GraphKeys.UPDATE_OPS at each training step. tf.GraphKeys.UPDATE_OPS
        # holds the operators that update the states of the network.
        # For example, the tf.layers.batch_normalization function adds the running mean
        # and variance update operators to tf.GraphKeys.UPDATE_OPS.
        optimizer = optimizer_init_fn()
        update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
        with tf.control_dependencies(update_ops):
            train_op = optimizer.minimize(loss)

    # Now we can run the computational graph many times to train the model.
    # When we call sess.run we ask it to evaluate train_op, which causes the
    # model to update.
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        t = 0
        for epoch in range(num_epochs):
            print('Starting epoch %d' % epoch)
            for x_np, y_np in train_dset:
                feed_dict = {x: x_np, y: y_np, is_training:1}
                loss_np, _ = sess.run([loss, train_op], feed_dict=feed_dict)
                if t % print_every == 0:
                    print('Iteration %d, loss = %.4f' % (t, loss_np))
                    check_accuracy(sess, val_dset, x, scores, is_training=is_training)
                    print()
                t += 1
                
def model_init_fn(inputs, is_training):
    input_shape = (32, 32, 3)
    hidden_layer_size, num_classes = 4000, 10
    initializer = tf.variance_scaling_initializer(scale=2.0)
    layers = [
        tf.layers.Flatten(input_shape=input_shape),
        tf.layers.Dense(hidden_layer_size, activation=tf.nn.relu,
                        kernel_initializer=initializer),
        tf.layers.Dense(num_classes, kernel_initializer=initializer),
    ]
    model = tf.keras.Sequential(layers)
    return model(inputs)

def optimizer_init_fn():
    return tf.train.GradientDescentOptimizer(learning_rate)
print "Done"

Done


In [16]:
# Now the actual training
learning_rate = 1e-2
train_part34(model_init_fn, optimizer_init_fn)

# Again, you should see accuracy > 40% after one epoch (700 iterations) of gradient descent

Starting epoch 0
Iteration 0, loss = 2.6382
Got 120 / 1000 correct (12.00%)
()
Iteration 100, loss = 1.8520
Got 384 / 1000 correct (38.40%)
()
Iteration 200, loss = 1.4591
Got 411 / 1000 correct (41.10%)
()
Iteration 300, loss = 1.7857
Got 383 / 1000 correct (38.30%)
()
Iteration 400, loss = 1.8122
Got 423 / 1000 correct (42.30%)
()
Iteration 500, loss = 1.7382
Got 455 / 1000 correct (45.50%)
()
Iteration 600, loss = 1.7749
Got 426 / 1000 correct (42.60%)
()
Iteration 700, loss = 1.8358
Got 447 / 1000 correct (44.70%)
()


### 4. Backpropagation

You'll often hear the term "backpropagation" or "backprop," which is a way of updating a neural network. Google has a great demo that walks you through the backpropagation algorithm in detail. I encourage you to check it out!

https://google-developers.appspot.com/machine-learning/crash-course/backprop-scroll/

See also this seminar by Geoffrey Hinton, a premier deep learning researcher, on whether the brain can do back-propagation. It's an interesting lecture with relatively : https://www.youtube.com/watch?v=VIRCybGgHts



### 5. References

<pre>
  [1] Fast.ai (http://course.fast.ai/)  
  [2] CS231N (http://cs231n.github.io/)  
  [3] CS224D (http://cs224d.stanford.edu/syllabus.html)  
  [4] Hands on Machine Learning (https://github.com/ageron/handson-ml)  
  [5] Deep learning with Python Notebooks (https://github.com/fchollet/deep-learning-with-python-notebooks)  
  [6] Deep learning by Goodfellow et. al (http://www.deeplearningbook.org/)  
  [7] Neural networks online book (http://neuralnetworksanddeeplearning.com/)
  [8] He et al, *Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
*, ICCV 2015, https://arxiv.org/abs/1502.01852
</pre>