# Chapter 14 : L1 and L2 Regularisation

- Generalisation loosely means that the model actually learns the meaning of the data compared to the case of overfitting where it memorises the data. 
- **Regularisation methods** are those which reduce generalisation error. 
-  L1 and L2 regularization are used to calculate a number(**penalty**) added to the loss value to penalize the model for large 
weights and biases. 
- Large weights might indicate that a neuron is attempting to memorize a data element.
 - Generally, it is believed that it would be better to have many neurons contributing to a model’s output, rather than a select few. 

## Forward Pass
- L1 regularization’s penalty is the sum of all the absolute values for the weights and biases.This is a linear penalty as regularization loss returned by this function is directly proportional to parameter values. 
- L2 regularization’s penalty is the sum of the squared weights and biases. This non-linear approach penalizes larger weights and biases more than smaller ones because of the square function used to calculate the result.
- In other words, L2 regularization is commonly used as it does not affect small parameter values substantially and does not allow the model to grow weights too large by heavily penalizing relatively big values. 
- L1 regularization, because of its linear nature, penalizes small weights more than L2 regularization, causing the model to start 
being invariant to small inputs and variant only to the bigger ones. That’s why L1 regularization 
is rarely used alone and usually combined with L2 regularization if it’s even used at all.
- Regularization functions of this type drive the sum of weights and the sum of parameters towards 0​, which can also help in cases of exploding gradients (model instability, which might cause weights to become very large values). 
- Beyond this, we also want to dictate how much of an impact we want this regularization penalty to carry. We use a value referred to as lambda in this equation — where a higher value means a more significant penalty. 

**L1 weight and bias regularisation :** <br>
$$L_{1w} = \lambda \sum_{m} |w_m|$$
$$L_{1b} = \lambda \sum_{n} |b_n|$$ 

**L2 weight and bias regularisation :** <br>
$$L_{2w} = \lambda \sum_{m} w_{m}^2$$
$$L_{2b} = \lambda \sum_{n} b_{n}^2$$ 

**Overall Loss:**<br>
$$Loss = DataLoss + L_{1w} + L_{1b} + L_{2w} + L_{2b} $$

Here is the modifications to the code,we'll start with the dense layer class and set the value of lamda nce these can be set separately for every layer.

In [1]:
import numpy as np

In [2]:
 
# Dense layer 
class Layer_Dense: 
 
    # Layer initialization 
    def __init__(self, n_inputs, n_neurons,
                weight_regularizer_l1=0, weight_regularizer_l2=0, 
                bias_regularizer_l1=0, bias_regularizer_l2=0): 
        
        # Initialize weights and biases 
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons) 
        self.biases = np.zeros((1, n_neurons)) 
        # set regularisation strength
        self.weight_regularizer_l1 = weight_regularizer_l1
        self.bias_regularizer_l1 = bias_regularizer_l1
        self.weight_regularizer_l2 = weight_regularizer_l2
        self.bias_regularizer_l1 = bias_regularizer_l2
        
 
    # Forward pass 
    def forward(self, inputs): 
        # Remember input values 
        self.inputs = inputs 
        # Calculate output values from inputs, weights and biases 
        self.output = np.dot(inputs, self.weights) + self.biases 
 
    # Backward pass 
    def backward(self, dvalues): 
        # Gradients on parameters 
        self.dweights = np.dot(self.inputs.T, dvalues) 
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True) 
        # Gradient on values 
        self.dinputs = np.dot(dvalues, self.weights.T) 

Now we update our loss class to include the additional penalty if we choose to set the lambda hyperparameter for any of the regularizers in the layer’s initialization. We will implement this code into the Loss​ class as it is common for the hidden layers. What’s more, the regularization calculation is the same, regardless of the type of loss used. It’s only a penalty that is summed with the data loss value resulting in a final, overall loss value. For this reason, we’re going to add a new method to a general loss 
class, which is inherited by all of our specific loss functions (such as our existing Loss_CategoricalCrossentropy).

In [3]:
# Common loss class 
class Loss: 
 
    # Calculates the data and regularization losses 
    # given model output and ground truth values 

    # Regularization loss calculation 
    def regularization_loss(self, layer): 
 
        # 0 by default 
        regularization_loss = 0 
 
        # L1 regularization - weights 
        # calculate only when factor greater than 0 
        if layer.weight_regularizer_l1 > 0: 
            regularization_loss += layer.weight_regularizer_l1*np.sum(np.abs(layer.weights)) 
 
        # L2 regularization - weights 
        if layer.weight_regularizer_l2 > 0: 
            regularization_loss += layer.weight_regularizer_l2*np.sum(layer.weights*layer.weights) 
 
        # L1 regularization - biases 
        # calculate only when factor greater than 0 
        if layer.bias_regularizer_l1 > 0: 
            regularization_loss += layer.bias_regularizer_l1*np.sum(np.abs(layer.biases)) 
 
        # L2 regularization - biases 
        if layer.bias_regularizer_l2 > 0: 
            regularization_loss += layer.bias_regularizer_l2*np.sum(layer.biases*layer.biases) 
 
        return regularization_loss

    def calculate(self, output, y): 
 
        # Calculate sample losses 
        sample_losses = self.forward(output, y) 
 
        # Calculate mean loss 
        data_loss = np.mean(sample_losses)
 
        # Return loss 
        return data_loss

Then we’ll calculate the regularization loss and add it to our calculated loss in the training loop: 
```
    # Calculate loss from output of activation2 so softmax activation 
    data_loss = loss_function.forward(activation2.output, y) 
 
    # Calculate regularization penalty 
    regularization_loss = loss_function.regularization_loss(dense1) + loss_function.regularization_loss(dense2) 
 
    # Calculate overall loss 
    loss = data_loss + regularization_loss
``` 

We created a new regularization_loss variable and added all layer’s regularization losses to it. This completes the forward pass for regularization, but this also means our overall loss has changed since part of the calculation can include regularization, which must be accounted for in the backpropagation of the gradients. 

## Backward Pass

Ther derivatives of L1 and L2 regularisation's are:
$$
\frac{\partial L_{1w}}{\partial w_m} =
\begin{cases} 
1 & \text{if } w_m > 0, \\
-1 & \text{if } w_m < 0.
\end{cases}
$$
$$\frac{\partial L_{2w}}{\partial w_m} = 2 \lambda w_m$$

In [4]:
weights = [0.2,0.8,-0.5] # weights of one neuron 
dL1 = [] # array of partial derivatives
for w in weights:
    if w >=0 :
        dL1.append(1)
    else:
        dL1.append(-1)

print(dL1)

[1, 1, -1]


In [6]:
weights =  np.array([[0.2, 0.8, -0.5, 1], 
                    [0.5, -0.91, 0.26, -0.5], 
                    [-0.26, -0.27, 0.17, 0.87]])

dL1 = weights.copy()
dL1 = np.where(dL1 >= 0,1.,-1.)
print(dL1)

[[ 1.  1. -1.  1.]
 [ 1. -1.  1. -1.]
 [-1. -1.  1.  1.]]


Now update the dense layer class with this:

In [7]:
 
# Dense layer 
class Layer_Dense: 
 
    # Layer initialization 
    def __init__(self, n_inputs, n_neurons,
                weight_regularizer_l1=0, weight_regularizer_l2=0, 
                bias_regularizer_l1=0, bias_regularizer_l2=0): 
        
        # Initialize weights and biases 
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons) 
        self.biases = np.zeros((1, n_neurons)) 
        # set regularisation strength
        self.weight_regularizer_l1 = weight_regularizer_l1
        self.bias_regularizer_l1 = bias_regularizer_l1
        self.weight_regularizer_l2 = weight_regularizer_l2
        self.bias_regularizer_l1 = bias_regularizer_l2
        
 
    # Forward pass 
    def forward(self, inputs): 
        # Remember input values 
        self.inputs = inputs 
        # Calculate output values from inputs, weights and biases 
        self.output = np.dot(inputs, self.weights) + self.biases 
 
    # Backward pass 
    def backward(self, dvalues): 
        # Gradients on parameters 
        self.dweights = np.dot(self.inputs.T, dvalues) 
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True) 

        # Gradients on regularization 
        # L1 on weights 
        if self.weight_regularizer_l1 > 0: 
            dL1 = self.weights.copy() 
            dL1 = np.where(dL1 >= 0. , 1. , -1.)
            self.dweights += self.weight_regularizer_l1 * dL1 
        
        # L2 on weights 
        if self.weight_regularizer_l2 > 0: 
            self.dweights += 2 * self.weight_regularizer_l2*self.weights
         
        # L1 on biases 
        if self.bias_regularizer_l1 > 0: 
            dL1 = np.ones_like(self.biases) 
            dL1[self.biases < 0] = -1 
            self.dbiases += self.bias_regularizer_l1 * dL1 

        # L2 on biases 
        if self.bias_regularizer_l2 > 0: 
            self.dbiases += 2 * self.bias_regularizer_l2*self.biases 
 
        # Gradient on values 
        self.dinputs = np.dot(dvalues, self.weights.T)

Then we can add weight and bias regularizer parameters when defining a layer:
```
# Create Dense layer with 2 input features and 3 output values 
dense1 = Layer_Dense(2, 64, weight_regularizer_l2​=5e-4, 
                            bias_regularizer_l2​=5e-4) 
```

We usually add regularization terms to the hidden layers only. Even if we are calling the 
regularization method on the output layer as well, it won’t modify gradients if we do not set the 
lambda hyperparameters to values other than 0​. <br>
I have done the updates in the full modal code file.