<a href="https://colab.research.google.com/github/isa-ulisboa/greends-pml/blob/main/notebooks/T8_techniques_to_improve_DP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Techniques to improve deep learning

**Regularization through loss function** is a technique used in machine learning to prevent overfitting by adding a penalty term to the loss function. This encourages the model to learn a simpler representation of the data and reduces its capacity to memorize the training data.

**Dropout** is also a regularization technique used in deep learning to prevent overfitting. It works by randomly “dropping out” or deactivating some of the neurons in a neural network during training. This means that during each forward pass, some of the neurons are temporarily removed from the network, along with all their incoming and outgoing connections.

The idea behind dropout is to introduce randomness and prevent the model from relying too heavily on any single neuron or feature. By randomly dropping out neurons during training, the model is forced to learn a more robust representation of the data that is less sensitive to small changes in the input.

Dropout is typically applied to the hidden layers of a neural network and can be controlled by a hyperparameter called the dropout rate, which specifies the probability that any given neuron will be dropped out during training. A common value for the dropout rate is 0.5, meaning that on average, half of the neurons in a given layer will be dropped out during each forward pass.

During testing or inference, dropout is not applied and all neurons are active. However, to account for the fact that only a fraction of the neurons were active during training, the outputs of the neurons are typically scaled down by the dropout rate.

**Self-regularized activation functions** can help improve the generalization performance of a neural network by introducing an implicit form of regularization. This can be achieved through various mechanisms, such as controlling the distribution of the activations or the gradients. For example, the *Mish activation function* has been shown to have a self-regularizing effect due to its non-monotonic and smooth nature, which can help prevent the vanishing gradient problem and improve the training dynamics of deep neural networks.

The *Mish activation function* (https://arxiv.org/abs/1908.08681) is an alternative to *ReLu*. It is a smooth, continuous, self regularized, non-monotonic activation function mathematically defined as

$$f(x)= x \, {\rm tanh} (\ln (1+e^x)).$$

**Momentum**  is a technique used in deep learning to accelerate the training of neural networks. It is an optimization algorithm that helps the model converge faster by adding a fraction of the previous weight update to the current weight update.

In gradient descent, the weights of a neural network are updated by taking a step in the direction of the negative gradient of the loss function with respect to the weights. This can sometimes result in slow convergence or getting stuck in local minima. Momentum addresses these issues by introducing a “momentum” term that takes into account the previous weight updates.

The idea behind momentum is to add a fraction of the previous weight update to the current weight update, effectively “smoothing out” the updates and helping the model converge faster. This can be controlled by a hyperparameter called the momentum coefficient, which specifies how much of the previous weight update should be added to the current weight update. A common value for the momentum coefficient is 0.9.

Momentum can be used with various optimization algorithms, such as stochastic gradient descent (SGD) or Adam, to improve their convergence properties.

**Adam** (short for Adaptive Moment Estimation) is an optimization algorithm commonly used in deep learning to train neural networks. It is an extension of stochastic gradient descent (SGD) that incorporates ideas from other optimization algorithms such as AdaGrad and RMSProp.

Adam works by maintaining an estimate of the first and second moments of the gradients (i.e., the mean and uncentered variance) and using these estimates to adaptively adjust the learning rate for each weight in the network. This allows the algorithm to converge faster and achieve better performance than traditional SGD.

One of the key advantages of Adam is that it requires little tuning of its hyperparameters. The algorithm has three main hyperparameters: the learning rate, the first moment decay rate (beta1), and the second moment decay rate (beta2). The default values for these hyperparameters (0.001, 0.9, and 0.999, respectively) usually work well in practice.

Adam has been shown to work well on a wide range of deep learning problems and is often used as the default optimizer in many deep learning frameworks.