# Optimizing Neural Networks

---

## Overfitting:
- Use regularization techniques to reduce overfitting
- Increase *training data* to reduce overfitting
- Early Stopping: track of accuracy on the *validation data* as the network trains

---

## Regularization:
- Adds an extra term to the Cost Function as a weight decay to make the network learn small weights:

$$ C = C_0 + \frac{\lambda}{2n} \sum_w w^2 $$
<div style="text-align: center; font-size: 10px"> where $C_0$ is the original Cost Function </div>

### L2 regularization:
-  The sum of the squares of all the weights scaled by a factor $ \frac{\lambda}{2n}$, where $\lambda>0$ is the regularization parameter:
- Small $\lambda$ tends to minimize the *Cost function*, large $\lambda$ tends to prefer small weights
- The partial derivatives of the *Cost function* (claculated using backpropagation) becomes:

$$ 
\frac{\partial C}{\partial w}  =  \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} w \\
\frac{\partial C}{\partial b}  =  \frac{\partial C_0}{\partial b} $$

- The bias stays unchanged, while the weight is scaled, which is called **weight decay**:

$$ 
w \rightarrow w - \eta \frac{\partial C_0}{\partial w} - \frac{\eta \lambda}{n} w \\ 
=  \left(1 - \frac{\eta \lambda}{n}\right) w - \eta \frac{\partial C_0}{\partial w}
$$

- Increase the the *regularization parameter* $\lambda$ by the same factor as the increase in the training data to keep **weight decay** the same

### L1 regularization:
- The sum of the absolute values of the weights:

$$ C = C_0 + \frac{\lambda}{n} \sum_w |w| $$

- Penalizes large weights by shrinking the weights

### Dropout:
- Instead of midifying the *Cost function*, **Dropout** modifies the Network
- Removes some nodes from the Hidden Layer while training
- When Tesing, weights should be scaled by multipling by (1-p)% dropout rate to account for Training Dropout
- Similar in principle to using ensembles where the effects of different networks is averaged

### Expanding Training Data:
- Artificially expand the training data by manupulating the data: rotate image, add background noise to sound data...

---

## Hyper-parameters:
- Start with a simpler model and build up later
- Use *Grid Search* to searche through a grid in hyper-parameter space
- Monitor the validation accuracy more often as the Network is learning
- When choosing *hyper-parameters* use *early stopping* loosely to terminate if the best classification accuracy doesn't improve **only** for some time to see the effect of the chosen hyper-parameter

### Learning Rate $\eta$:
- *Stochastic gradient descent*: step gradually down into a valley of the cost function without overshooting the minimum
- Start with $\eta$ that decreases the *Cost function* in the first few epochs, then increase until the *Cost* starts to oscillate or increase: $\eta = 0.01 \rightarrow 0.1 \rightarrow 1.0...$
- Decrease $\eta$ if the initial value causes the *Cost function* to increase $\eta = 1.0 \rightarrow 0.1 \rightarrow 0.01...$

### Learning Rate Schedule:
- It is advantageous to vary the *learning rate*: start out large to change weights quickly and decrease as the Network learns to make more fine adjustments
- Decrease the *learning rate* as validation accuracy starts to get worse.  Decrease by a factor of 10 until a certain factor (ex. 1,000) is reached
- Start out with a constant *learning rate* and once the Network is optimized, select a *learnng rate schedule* to optimize the Network even further
- **Adagrad** provides adaptive learning rate by incorporating knowledge of the geometry of past observations

### Regularization Parameter $\lambda$:
- Strat with $\lambda = 0$ which eliminates the regularization, then use the validation data to select a good value for $\lambda$ increaseing or decreasing by a factor of 10
- Once $\lambda$ is picked, adjust as the *Training Data* is adjusted by the same factor 

### Adam Optimization - Adaptive Moment Estimation:
- **Adam** combines the advantages of two other extensions of stochastic gradient descent
    - Adaptive Gradient Algorithm (AdaGrad) 
    - Root Mean Square Propagation (RMSProp) that also maintains per-parameter learning rates
- Adam realizes the benefits of both AdaGrad and RMSProp

---