# L2 Deep Neural Networks


### Linear Model Complexity
Calculate the number of parameters in the linear model in the previous lesson (each input is a 28x28 pixel image and there are 10 classes: letters from A to J.)
* Answer: 28x28x10 (W) + 1x10 = 7850

Generally have (N+1)*X

* Small number of parameters: generally (N+1)\* K parameters, where N = numberof inputs, K = number of outputs.
* Interactions of inputs are limited because model is linear

But they are 
* efficient (which makes them cheap and fast to run) and 
* stable. -> Can show mathematically that small changes in input can never yield big changes in output (|W| is bounded) Derivatives are also constant. 

Want to keep parameters in linear functions but want entire function to be non-linear. Cannot just multiply $W_1W_2W_3$ because that's equivalent to one linear function. So we have to introduce **non-linearities**.

### Rectified Linear Units (RELU)
Non-linear function: Simplest non-linear function
y = 0 when x <0. y = x when x>= 0.

* How to use this (refer to our linear classifier process): Insert a RELU in the middle. Now have two matrices: One from the input to the RELU and oneo from the RELU to the output.
* New parameter **H**: the number of RELUs you insert.

Build network by stacking up simple operations to make the maths simple (**Chain Rule**).

Can write the chain rule in a way that is computationally efficient.

### How to compute derivatives: Back-Propagation 

Stochastic Gradient Descent: 
1. For each batch of data run forward prop and then back prop.
2. That will give you gradients for each of the weights in your model.
3. Apply gradients and learning rates to the original weight sand update them.
4. Repeat 1-3 many times to optimise your models.

Note: Each block of the backprop typically takes twice the memory and computation of the forward prop blocks. -> Important for sizing your model and fitting it in memory.

### Training a Deep Neural Network
Increasing H is not efficient: you need to make it very big and then it gets hard to train.

Instead, you can add more layers. A deep model is often preferred for two reasons:

1. Parameter efficiency: Can generally get better performance with more parameters if you go deeper rather than wider.
2. Many natural phenomena have a hierarchical structure. (E.g. Lines and edges -> Geometric shapes -> Objects in image recognition. Model matches abstractions you see in your data.)

**Why did deep networks only become popular recently?** 
Deep models only really shine if you have large amounts of data to train them with. 

## Regularisation

Analogy: Skinny jeans are hard to get into, so people usually wear jeans that are a bit too big. Similarly, networks that's just the right size for your data are hard to train. (**Why?**) So in practice we train networks that are way too big for our data and then try our best to prevent them from overfitting.

### Ways to prevent overfitting
1. Early termination: Looking at performance on validation set and stop training once performance stops improving.
2. Regularisation: Putting artificial constraints on your network that implicitly reduce the number of free parameters without making it more difficult to optimise. // Stretch pants.

Two methods of regularisation:
1. L2 Regularisation
2. Dropout

### L2 Regularisation
Add term to the loss that penalises large weights, typically by adding L2 norm of your weights multiplied by a small constant to your loss. -> Additional hyperparameter to tune.
* L2 norm: Sum of the squares of the elements in a vector.
$$ L' = L + \beta\frac{1}{2}||W||^2_2 $$

Pros:
* Simple because you just add it to the loss. You don't need to change the structure of your network. 

### Dropout
Recent new technique for regularisation. 
* Imagine if you have one layer connected to another layer. The values that go from one layer to the next are often called **activations**.
* Take activations and for every example you train your network on, set half of them to zero. I.e. Randomly destroy half the data that's flowing through your network. Do this over and over. 
* -> Network cannot rely on any given activation to be present because it may get destroyed at any moment.
* -> This forces network to learn redundant representations to ensure that at least some information remains. -> Seems inefficient but this makes things more robust and prevents overfitting. It also makes your network act as though it's taking a consensus over an ensemble of networks.

Op: If dropout doesn't work for you, you should probably be using a bigger network.

### Evaluating a Dropout-Trained Network
When you evaluate the network that's been trained with dropout, you don't want randomness - you want something deterministic. You want the consensus. You get this by averaging activations. 

How do you get this? 
* During training, don't only zero -> Scale remaining activations by factor of two. 
* When evaluating, remove dropout and scaling operations to get an average of the activations that is properly scaled.