## Weight decay in PyTorch and its relation with Learning Rate | L2 Regularization



Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function.

```
### loss = loss + weight decay parameter * L2 norm of the weights
```

![Imgur](https://imgur.com/262wI66.png)

And the weights should themselves be updated as follows

```
w[t+1] = w[t] - learning_rate * dw - weight_decay * w

```
We have our loss function, now we add the sum of the squared norms from our weight matrices and multiply this by a constant denoted by lambda. This lambda here is called the regularization parameter and this is another hyperparameter that we’ll have to choose .

If we set lambda to be a relatively large number then it would incentivize the model to set the weight close to 0 because the objective of SGD is to minimize the loss function and remember our original loss function is now being summed with the sum of the squared matrix norms.

Some people prefer to only apply weight decay to the weights and not the bias. PyTorch applies weight decay to both weights and bias.


---

### How do we use weight decay?

To use weight decay, we can simply define the weight decay parameter in the `torch.optim.SGD` optimizer or the `torch.optim.Adam` optimizer. Here we use 1e-4 as a default for weight_decay

```py

optimizer = torch.optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-4)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

```

--------------------

### Why do we use weight decay?

1. To prevent overfitting.

2. To keep the weights small and avoid exploding gradient. Because the L2 norm of the weights are added to the loss, each iteration of your network will try to optimize/minimize the model weights in addition to the loss. This will help keep the weights as small as possible, preventing the weights to grow out of control, and thus avoid exploding gradient.

=================================================================

### Disable weight-decay in few Layers OR to set different values for different layers

PyTorch applies weight decay to both weights and bias.

But some think that decay should not be applied to Bias, since those parameters are less likely to overfit. Furthermore, the decay should also not be applied to parameters with a shape of one, meaning the parameter is a vector and no matrix which is quite often for normalization modules, like batch-norm, layer-norm or weight-norm.

With the introduction of the function named_parameters(), we also get a name along with the parameter value. For standard layers, biases are named as “bias” and combined with the shape, we can create two parameter lists, one with weight_decay and the other without it. Furthermore, we can easily use a skip_list to manually disable weight_decay for some layers, like embedding layers.

```py
def custom_weight_decay(net, l2_value, skip_list=()):
    decay, no_decay = [], []
    for name, param in net.named_parameters():
        if not param.requires_grad: continue # frozen weights
    if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list:
        no_decay.append(param)
    else: decay.append(param)

    return [{'params': no_decay, 'weight_decay': 0.}, {'params': decay, 'weight_decay': l2_value}]


# and the returned list is passed to the optimizer:

params = custom_weight_decay(pytorch_neural_net, 2e-5)
sgd = torch.optim.SGD(params, lr=0.05)

```

Check - https://pytorch.org/docs/stable/optim.html#per-parameter-options

Per-parameter options
Optimizer s also support specifying per-parameter options. To do this, instead of passing an iterable of Variable s, pass in an iterable of dict s. Each of them will define a separate parameter group, and should contain a params key, containing a list of parameters belonging to it. Other keys should match the keyword arguments accepted by the optimizers, and will be used as optimization options for this group.

---

### Custom weight decay operation, not effecting grad values.

```py

import torch


def weight_decay(optimizer, wd):
    """
    Custom weight decay operation, not effecting grad values.
    https://www.fast.ai/2018/07/02/adam-weight-decay/
    """
    for group in optimizer.param_groups:
        for param in group['params']:
            current_lr = group['lr']
            param.data = param.data.add(-wd * group['lr'], param.data)
    return optimizer, current_lr

```

--------------------

## BONUS

What does the `model.parameters()` include?

`model.parameters()` stores the weight and bias values of the model. It is given as an argument to an optimizer to update the weight and bias values of the model with one line of code optimizer.step()

---

## Get the [names of parameters](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.named_parameters)

```py

import torch
import torchvision
from torch import nn
from torchvision import models

my_model = models.resnet50(pretrained=False)

for name, parameter in my_model.named_parameters():
    print(name)

```

Output

```
conv1.weight
bn1.weight
bn1.bias
layer1.0.conv1.weight
layer1.0.bn1.weight
layer1.0.bn1.bias

...
...
...

```

---


### Weight Decay Example

In [1]:
import torch
import numpy as np

np.random.seed(123)
np.set_printoptions(8, suppress=True)

x_numpy = np.random.random((3, 4)).astype(np.double)
w_numpy = np.random.random((4, 5)).astype(np.double)

x_torch = torch.tensor(x_numpy, requires_grad=True)
w_torch = torch.tensor(w_numpy, requires_grad=True)

#######################################################

print('Original weights', w_torch)

lr = 0.1
sgd = torch.optim.SGD([w_torch], lr=lr, weight_decay=0)

y_torch = torch.matmul(x_torch, w_torch)
loss = y_torch.sum()

sgd.zero_grad()
loss.backward()
sgd.step()

w_grad = w_torch.grad.data.numpy()
print('0 weight decay', w_torch)


#######################################################

w_torch = torch.tensor(w_numpy, requires_grad=True)

print('Reset Original weights', w_torch)

sgd = torch.optim.SGD([w_torch], lr=lr, weight_decay=1)

y_torch = torch.matmul(x_torch, w_torch)
loss = y_torch.sum()

sgd.zero_grad()
loss.backward()
sgd.step()

w_grad = w_torch.grad.data.numpy()
print('1 weight decay', w_torch)

Original weights tensor([[0.4386, 0.0597, 0.3980, 0.7380, 0.1825],
        [0.1755, 0.5316, 0.5318, 0.6344, 0.8494],
        [0.7245, 0.6110, 0.7224, 0.3230, 0.3618],
        [0.2283, 0.2937, 0.6310, 0.0921, 0.4337]], dtype=torch.float64,
       requires_grad=True)
0 weight decay tensor([[ 0.2489, -0.1300,  0.2084,  0.5483, -0.0072],
        [ 0.0653,  0.4214,  0.4217,  0.5243,  0.7393],
        [ 0.5694,  0.4559,  0.5674,  0.1679,  0.2067],
        [ 0.0317,  0.0972,  0.4345, -0.1044,  0.2372]], dtype=torch.float64,
       requires_grad=True)
Reset Original weights tensor([[0.4386, 0.0597, 0.3980, 0.7380, 0.1825],
        [0.1755, 0.5316, 0.5318, 0.6344, 0.8494],
        [0.7245, 0.6110, 0.7224, 0.3230, 0.3618],
        [0.2283, 0.2937, 0.6310, 0.0921, 0.4337]], dtype=torch.float64,
       requires_grad=True)
1 weight decay tensor([[ 0.2050, -0.1360,  0.1686,  0.4745, -0.0254],
        [ 0.0478,  0.3683,  0.3685,  0.4608,  0.6544],
        [ 0.4969,  0.3948,  0.4951,  0.1356,  0.1705]

### As you can see, the weights are smaller when I use weight_decay=1 compared to weight_decay=0

