In [8]:
import numpy as np
from tensorflow.keras import Input
from tensorflow.keras import regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.constraints import MaxNorm

## Regularization Using the Frobenius Norm

In a neuron network, the weight matrix $W^{\ell}$ for a layer $\ell$ has the following dimensions:

\begin{align*}
\underbrace{W^{\ell}}_{n^{\ell} \times n^{\ell - 1}}
\end{align*}

where 

* $n^{\ell}$ is the number of neurons or units in the current layer $\ell$ 
* $n^{\ell - 1}$ is the number of neurons or units in the input layer $\ell - 1$ for the current layer $\ell$

The bias vector $\vec{b}^{\ell}$ for a layer $\ell$ has the following dimensions:

\begin{align*}
\underbrace{\vec{b}^{\ell}}_{n^{\ell} \times 1}
\end{align*}

The cost function of a neural network is a function of all parameters:

\begin{align*}
J(W^{1}, \vec{b}^{1}, W^{2}, \vec{b}^{2}, ..., W^{\mathcal{L}}, \vec{b}^{\mathcal{L}}) = \frac{1}{m}\sum^{m}_{i=1}L(\hat{y}^{i}, y^{i})
\end{align*}

where $\mathcal{L}$ is the number of layers in the network. In words, the cost function is the sum of the losses over all $m$ training example scaled by a factor $\frac{1}{m}$. To regularize the weights, we add an additional term to the equation as follows:

\begin{align*}
J_{\text{regularized}}(W^{1}, \vec{b}^{1}, W^{2}, \vec{b}^{2}, ..., W^{\mathcal{L}}, \vec{b}^{\mathcal{L}}) = \frac{1}{m}\sum^{m}_{i=1}L(\hat{y}^{i}, y^{i}) + \frac{\lambda}{2m}\sum^{\mathcal{L}}_{\ell=1}||W^{\ell}||^{2}
\end{align*}

where 

* $||W^{\ell}||^{2}=\sum^{n^{\ell}}_{i=1}\sum^{\ell - 1}_{j=1} (W^{\ell}_{ij})^2$ is Frobenius norm of the $\ell^{th}$ layer weight matrix

### Gradient Descent Update Rule with Regularization

From backpropagation, we obtain the partial derivatives of the cost function w.r.t each of the weight matrices plus the regularization term:

\begin{align*}
\frac{\partial{J}}{\partial{W^{\ell}}} + \frac{\lambda}{m}W^{\ell}
\end{align*}

The update rule is then:

\begin{align*}
W^{\ell} &:= W^{\ell} - \alpha \big(\frac{\partial{J}}{\partial{W^{\ell}}} + \frac{\lambda}{m}W^{\ell}\big) \\
&:=W^{\ell} - \alpha \frac{\partial{J}}{\partial{W^{\ell}}} \textcolor{blue}{- \frac{\alpha \lambda}{m}W^{\ell}} \\
&:=W^{\ell} \textcolor{blue}{- \frac{\alpha \lambda}{m}W^{\ell}} - \alpha \frac{\partial{J}}{\partial{W^{\ell}}} \\
&:=\textcolor{blue}{(1 - \frac{\alpha \lambda}{m})}W^{\ell} - \alpha \frac{\partial{J}}{\partial{W^{\ell}}} \\
\end{align*}

This is known as "weight decay" since, at each step of the algorithm, the matrix $W^{\ell}$ is shrunk by a factor that is smaller than 1. This factor is the term highlighted in blue.

In [13]:
# The regularization technique formulated above can be implemented in Keras by using the kernel_regularizer parameter of the Dense layer
model_l2_regularize = Sequential(
    [               
        Input(shape=(400,)),   
        Dense(units=25, activation="relu", kernel_regularizer=regularizers.L2(l2=0.01),  use_bias=True, name='layer_1'), # Lambda = 0.01 
        Dense(units=15, activation="relu", kernel_regularizer=regularizers.L2(l2=0.03), use_bias=True, name='layer_2'), # Lambda = 0.03
        Dense(units=1, activation="sigmoid", use_bias=True, name='output_layer')
    ], name = "my_model" 
)   

Alternatively, we can use other types of regularization:

* L1 regularization `regularizers.l1(l1=0.001)`: The cost added is proportional to the absolute value of the weight coefficients (the L1 norm of the weights)
* L2 regularization `regularizers.l1_l2(l1=0.001, l2=0.001)`: The cost added is proportional to the square of the value of the weight coefficients (the L2 norm of the weights).
* Using both L1 and L2 simultaneously.

The lambda values `l1` and `l2` are hyperparameters that we can tune.

## Inverted Drop-out Regularization

Dropout, applied to a layer (usually hidden layers), consists of randomly dropping out (setting to zero) a number of activation values of the layer during training. Suppose a given layer would normally return an activation vector of `[0.2, 0.5, 1.3, 0.8, 1.1]` for a given input training example during training. After applying dropout, this vector will have a few zero entries distributed at random: for example, `[0, 0.5, 1.3, 0, 1.1]`. The **dropout rate** is the fraction of the features that are zeroed out; it’s usually set between 0.2 and 0.5. At test time, no units are dropped out; instead, the layer’s output values are scaled down by a factor equal to the dropout rate to balance for the fact that more units are active at test time than at training time.


```python
# Multiply (element-wise) the activation matrix by a matrix of 0's & 1' with the same dimensions, 
# Elements that are multiplied by 0's are then 'zeroed-out'
layer_output *= np.random.rand(*layer_output.shape) < keep_prob
# Scale down by factor of 'keep_prob'
layer_output /= keep_prob
```

<center> <img  src="images/drop_out.png" width="600" />   </center>

In [17]:
model_dropout = Sequential(
    [               
        Input(shape=(400,)),   
        Dense(units=25, activation="relu", kernel_constraint=MaxNorm(max_value=2, axis=0),  use_bias=True, name='layer_1'), 
        # Applies to the output activation values of the layer right above 'layer_1'
        Dropout(rate=0.5, name='dropout_layer_1'),
        Dense(units=15, activation="relu", kernel_regularizer=MaxNorm(max_value=3, axis=0), use_bias=True, name='layer_2'), 
        # Applies to the output activation values of the layer right above 'layer_2'
        Dropout(rate=0.2, name='dropout_layer_2'),
        Dense(units=1, activation="sigmoid", use_bias=True, name='output_layer')
    ], name = "my_model" 
) 

In [18]:
model_dropout.summary()

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 layer_1 (Dense)             (None, 25)                10025     
                                                                 
 dropout_layer_1 (Dropout)   (None, 25)                0         
                                                                 
 layer_2 (Dense)             (None, 15)                390       
                                                                 
 dropout_layer_2 (Dropout)   (None, 15)                0         
                                                                 
 output_layer (Dense)        (None, 1)                 16        
                                                                 
Total params: 10,431
Trainable params: 10,431
Non-trainable params: 0
_________________________________________________________________


The use of weight constraint is recommended by the literature:

> Though large momentum and learning rate speed up learning, they sometimes cause the network weights to grow very large. To prevent this, we can use max-norm regularization. This constrains the norm of the vector of incoming weights at each hidden unit to be bound by a constant c. Typical values of c range from 3 to 4.

[Dropout: A Simple Way to Prevent Neural Networks from Overfitting](https://jmlr.org/papers/v15/srivastava14a.html), 2014.

The intuition as to why randomly removing a different subset of neurons on each example would reduce overfitting is noise. The core idea is that introducing noise in the output values of a hidden layer can break up happenstance patterns that aren’t significant (what Hinton, the originator of this technique, refers to as conspiracies), that the network will start memorizing if no noise is present.