# Training Networks: Backpropagation

### Recall Stochastic Gradient Descent

---
> __SGD and Friends__

> * Define error in terms of some params: ${Err = f(p_1, p_2, p_3, ...)}$

> * Look at error gradient, ${\nabla Err}$ with components ${{\partial Err \over \partial p_1},{\partial Err \over \partial p_2}, ...}$

> * Update params per learning rate (${\eta}$)

> * Repeat...

---

With a 1-layer network, we could use SGD or a related algorithm to derive the weights, since the error depends directly on those weights.

With a deeper network, we have a couple of challenges:

* The error is computed from the final layer, so the gradient of the error doesn't tell us immediately about problems in other-layer weights
* There are -- even in our 2-layer diamonds model -- thousands of weights. Each of those weights may need to move a little at a time, and we have to watch out for underflow or undersignificance situations.

__In a deep network, the nth layer errors are "caused" by errors in the (n-1)th layer and are detected in the errors in the (n+1)th layer__

### The insight is to iteratively calculate errors, one layer at a time, starting at the output. This is called Backpropagation. It is neither magical nor surprising. The challenge is just doing it fast.

<img src="http://i.imgur.com/bjlYwjM.jpg" width=800>

# Since we are differentiating a composition of functions,<br>this is just the "Chain Rule" of high school calculus

## Don't be waylaid by backpropagation!

There's *less* here than meets the eye. I've included some links to more detailed posts below, so that you can help solidify your intuition by making the calculations concrete. However, don't let any individual article or explanation distract you from the very simple concept.

Many presentations about backprop are confused or cluttered by one or more of the following problems -- don't let them catch you off guard :)

* Playing fast/loose with vector calculus and matrix notation (don't get caught up in the notation, it's not the most important thing)
* Assuming a specific neuron activation function, then using concrete formulas based on differentiating that particular function (don't worry about the specific function -- backprop needs to work in general, and inserting a specific derivative confuses things)
* Mixing implementation techniques (how to calculate these derivatives conveniently or quickly, which is another issue)

#### With the warnings out of the way, here are some resources on backprop:
* https://sebastianraschka.com/faq/docs/visual-backpropagation.html (Overview)
* http://neuralnetworksanddeeplearning.com/chap2.html (Detail)
* https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ (Concrete example with numbers)
* http://machinelearningmastery.com/implement-backpropagation-algorithm-scratch-python/ (Coding it)

*You will likely never need to code backprop, let alone implement an optimized general version. Do it for fun if you like, but don't let it keep you out of the game.*

## If this Works Now, Why Didn't it Work 20 Years Ago?

After all, backprop is decades old... at least to the 1970s, and Geoff Hinton used it successfully in the 1980s.

The general challenges were 

* Diffusion of information through the sheer quantity of parameters
* Butterfly effect of small fluctuations through the system
* Flat/vanishing/unstable gradients
* Saturating units

The improvements are largely not changes to theory but a ton of incremental, practical fixes starting with lots of horsepower and lots of data.

Beyond those obvious pieces, we discuss major learnings next.

## How TensorFlow et al. Make this Work Well

There are a few approaches to differentiation

* Symbolic differentiation -- machine version of what we learned in high school
* Numeric differentiation -- calculate approximate slopes
* __*Autodifferentiation*__ -- use the chain rule to build derivatives as we build our output function

Consider the calculation $${\begin{aligned}z&=f(x_{1},x_{2})\\&=x_{1}x_{2}+\sin x_{1}\\&=w_{1}w_{2}+\sin w_{1}\\&=w_{3}+w_{4}\\&=w_{5}\end{aligned}}$$

We can track the derivatives as we combine and compose functions:

<img src="images/autodiff.svg">

One additional trick is that we can do this forward or backward and it requires less computation to do this backward, also called "reverse autodifferentiation":

<img src="images/reverse.svg">

(Why is there less computation in reverse? This is not generally true, but for our machine learning models where we have many parameters in, and few or just one parameter out, then this case holds; if we were computing from ${ \Bbb R^n \to \Bbb R^m }$ with ${ m \gg n}$ then we would want forward-mode autodiff.)

#### In TensorFlow, this code has been moved to C++ and you can find it here:

* https://github.com/tensorflow/tensorflow/tree/master/tensorflow/cc/gradients
* https://github.com/tensorflow/tensorflow/tree/master/tensorflow/cc/framework

You'll probably never need this detail unless you are adding a new custom Op to TF (in which case you'll probably want to add gradient support)

In [None]:
import tensorflow as tf
import numpy as np

x = tf.placeholder("float")
y = tf.placeholder("float")

m = tf.Variable([1.0], name="m-slope-coefficient")
b = tf.Variable([1.0], name="b-intercept")

y_model = tf.multiply(x, m) + b

error = tf.square(y - y_model)

opt = tf.train.GradientDescentOptimizer(0.01)
train_op = opt.minimize(error)

model = tf.global_variables_initializer()

with tf.Session() as session:
    session.run(model)
    for i in range(10):
        x_value = np.random.rand()
        y_value = x_value * 2 + 6
        session.run(train_op, feed_dict={x: x_value, y: y_value})
        
        grads_and_vars = opt.compute_gradients(error, [m, b])
        for v in grads_and_vars:
            print (v[0].eval(feed_dict={x: x_value, y: y_value}), v[1].name)
        print("------")

    out = session.run([m, b])
    print(out)
    print("Model: {r:.3f}x + {s:.3f}".format(r=out[0][0], s=out[1][0]))

---
> __ASIDE__ The current resurgence of neural networks and deep learning started in the early 2000s, when Geoffrey Hinton demonstrated an alternative approach to training multiple layers in a deep neural network.

> Geoff Hinton used __Restricted Boltzmann Machines__ to pre-train weights one layer at a time. The RBM would learn weights to produce distribution for the n+1 layer at minimum cost relative to a given a distribution in the n layer. Hinton used a procedure called contrastive divergence.

> By doing this one layer at a time, reasonable weights could be derived for a network as a whole or to make backpropagation into a tractable fine-tuning step.

---

# Training Practicalities (Part I)

## High-capacity topologies (deep vs. wide)
* Goal is to make training time shorter/tractable

In the terminal, start `training-error-plot.py` 

This script is similar to the last one, but has some code on the end to plot the error when we're finished.

---

## Suitable Activation Functions

---

## Increasing Data Set Size
### "Unreasonable Effectiveness of Data"
- Peter Norvig 
(https://www.youtube.com/watch?v=yvDCzhbjYWs)

* More data
* Slightly different data
    * Can we increase our data set in a way that parallels the sort of datasets human learn to work with?
    * Images: Translate, Rotate, Skew, Stretch, Blur...
    * Sound: Faster, slower, pitch change...
    * Self-Driving Cars and Grand Theft Auto
    * Noise...

---

## Proper Weight Initialization

Lab: in the terminal, change the initialization of weights in your network. Start with 'zero'.

While it's running, take a look at the options in Keras: https://keras.io/initializations/

* Why does the weight initialization matter?
* Where do we start in our activations? Where do we need to "move to"?
* What happens to the magnitude of the gradient as we backpropagate?

How is this connected to the number of weights (and neurons)?

*By now you should have some empirical observations from the 'zero' initialization. Try it again with 'uniform'.*

---


## Hardware Support

E.g., fast, low-precision GPU math https://www.theregister.co.uk/2016/09/13/nvidia_p4_p40_gpu_ai/



## Overfitting and Regularizing

### Early Stopping

---

### Weight Decay / L2 

Lagrangian formulation

---

### L1 / Lasso

Does this exacerbate challenges with backprop?

---

## Dropout and DropConnect

Remove connections of neurons -- results in "less specialization" of neurons.

We can add this in Keras by adding a `Dropout` object with a fraction of the units to drop.

Add a 50% dropout before the last hidden layer.

There are a variety of interpretations of Dropout/DropConnect, including the idea that it forces a bunch of ensembles, as well as that in the large scale, it just changes the weights. For more detail, see http://www.deeplearningbook.org/contents/regularization.html

In your lab experiment, if you look at the resulting error as well as the error plot, you should see that dropout
* made things worse
* made the training less smooth

<img src="images/dropout.png" width=500>

Why? When might this help instead of hinder?

(We'll use this again in another module soon ... so if you're not convinced yet, you'll get to try it)

---

## Batch normalization

Address skew across training data batches, which gets amplified through deep networks.


---

## Adversarial Training

Dataset augmentation with intentionally problematic data samples.

This may help, but it turns out not to solve certain key robustness and security concerns -- we'll revisit those apsects later.