# Tuning key gradient descent parameters

- lowering or increasing the learning rate, a learning rate that is too high may lead to updates that vastly overshoot a proper fit, and a learning rate that is too low may make training so slow that it appears to stall

- increasing the batch size, a batch with more samples will lead to gradients that are more informative and less noisy(lower variance).

## leveraging better architecture priors

using a model that make the right assumptions about the problem is essential to achieve generalization, should leverage the right architecture priors

## Increasing model capacity

stalling means the loss stops decreasing, training hits a plateau, indicate learning has saturated the model can't improve further, could be due to: 

- learning rate too low
- poor init or vanishing gradients
- inadequate model capacity or bad data

#### overfitting is done when a model memorizes training data instead of learning generalizable patterns. sometimes allowerd or encouraged to test capacity(enough capacity to represent the training data), if it can't overfit it is underpowered, if model can't overfit a small dataset, something is wrong, maybe with the architecture, data pipeline or loss function. Pretraining - in transfer leraning, we might overfit on a source task, then finetune on the target task, overfitting helps the model learn useful representations.

# Improving generalization

## dataset curation
 sometimes the problem is not with the model but with the data, spending more time on data almost always yields a much greater return on investment than spending the same on developing a better model.

 important way to improve the generalization potential of your data is feature engineeing.


# Feature Engineering

making a problem easier by expressing it in a simpler way. Make the latent manifold smoother, simpler, better organized.

## Using early stopping

in DL, we always use  models that are overparamaterized, not an issue because we never fully fit a DL model. Such a fit wouldn't generalize at all, we will always interrupt training long before we have reached the minimum possible training loss. We can save our model at the end of each epoch and once best epoch found, reuse the closest saved model we have. In Keras, typical to do this with an EarlyStopping callback which interrupt training as soon as validation metrics have stopped improving while remembering the best known model state.

## Regularizing model

Regularization techniques are a set of best practices that actively impede the model's ability to fit perfectly to the training data, with the goal of making the model perform better during validation, called "regularizing" the  model, tends to make the model simpler, more regular, curve smoother, more generic, less specific to the training setand better able to generalize by more closely approximating the latent manifold of the data.

## Reducing the network's size

model that is too small won't overfit, simplest way to mitigate overfitting is to reduce the size of the model. if the model has limited memorization resources, it won't be able to simply memorize its training daa, thus in order to minimize its loss, it will have to resort to learning compressed representations that have predictive power regarding the targets. a compromise to be found between too much capacity and not enough  capacity. 

## Adding weight regularization

Occam's razor: given two explanations for something, the explanation most likely to be correct is the simplest one, one that make fewer assumptions.

A simpler model in this context, model where the distribution of parameter values has less entropy, a common way to mitigate overfitting is to put constraints on the complexity of a model by forcing its weights to take only small values making distribution of weight values more regular - weight regularization, done by adding to the loss funciton of the model a cost associated with having large weights, comes in two flavors : 

- ### L1 regularization - the cost added is proportional to the absolute value of the weight coefficients(the L1 norm of the weights)

- ### L2 regularization - the cost added is proportional to the square of the value of the weight coefficients( the L2 norm of the weights). also called weight decay in the context of NNs.

in Keras, weight regularization is added by passing weight regularizer instances to layers

## Adding droput

weight regularization is more typically used for smaller DL models. Large DL models tend to be so overparmaterized that imposing constraints on weight values has not much impact on model capacity and generalization, for this a different technique : 

Dropout is one the most effective and most commonly used regularization techniques for NN, developed by Geoff Hinton and his students at the Uni of Toronto.

Dropout applied to a layer consists of randomly dropping out setting to zero a number of features(output) of the layer during training.
The dropout rate is usually set between 0.2 and 0.5.

at test time, no units are dropped out, instead the layer's output values are scaled down by a factor equal to the dropout rate, to balance for the fact that more units are active than at training time.

the core idea is that introducing noise in the output values of a lyer can break up happenstance patterns that aren't significant(hinton refers it to as conspiracies), which the model will start memorizing if no noise is present. It can be introuced in a model via the Dropout layer, applied to the output of the layer right before it.

# RECAP

- get more training data, or better
- develop better features
- reduce the capacity of the model
- add weight regularization (for smaller models).
- add droput.
