## Weight decay in PyTorch Neural Network

# [Link to my Youtube Video Explaining this whole Notebook](https://youtu.be/hZE4Nja5zKA)

[![Imgur](https://imgur.com/ONcGrnS.png)](https://youtu.be/hZE4Nja5zKA)


Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function.

```
### loss = loss + weight decay parameter * L2 norm of the weights
```

![Imgur](https://imgur.com/fiIDhwe.png)

And the weights should themselves be updated as follows

```
w[t+1] = w[t] - learning_rate * dw - weight_decay * w

## How do we use weight decay ?

To use weight decay, we can simply define the weight decay parameter in the `torch.optim.SGD` optimizer or the `torch.optim.Adam` optimizer. Here we use 1e-4 as a default for weight_decay

In [None]:
import torch
import numpy as np

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-4)

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

### Get names of Parameters

In [1]:

import torch
import torchvision
from torch import nn
from torchvision import models


my_model = models.resnet50(pretrained=False)

for name, parameter in my_model.named_parameters():
    print(name)

conv1.weight
bn1.weight
bn1.bias
layer1.0.conv1.weight
layer1.0.bn1.weight
layer1.0.bn1.bias
layer1.0.conv2.weight
layer1.0.bn2.weight
layer1.0.bn2.bias
layer1.0.conv3.weight
layer1.0.bn3.weight
layer1.0.bn3.bias
layer1.0.downsample.0.weight
layer1.0.downsample.1.weight
layer1.0.downsample.1.bias
layer1.1.conv1.weight
layer1.1.bn1.weight
layer1.1.bn1.bias
layer1.1.conv2.weight
layer1.1.bn2.weight
layer1.1.bn2.bias
layer1.1.conv3.weight
layer1.1.bn3.weight
layer1.1.bn3.bias
layer1.2.conv1.weight
layer1.2.bn1.weight
layer1.2.bn1.bias
layer1.2.conv2.weight
layer1.2.bn2.weight
layer1.2.bn2.bias
layer1.2.conv3.weight
layer1.2.bn3.weight
layer1.2.bn3.bias
layer2.0.conv1.weight
layer2.0.bn1.weight
layer2.0.bn1.bias
layer2.0.conv2.weight
layer2.0.bn2.weight
layer2.0.bn2.bias
layer2.0.conv3.weight
layer2.0.bn3.weight
layer2.0.bn3.bias
layer2.0.downsample.0.weight
layer2.0.downsample.1.weight
layer2.0.downsample.1.bias
layer2.1.conv1.weight
layer2.1.bn1.weight
layer2.1.bn1.bias
layer2.1.conv2.we

## Check that weights are smaller when I apply weight_decay

In [3]:
np.random.seed(123)
np.set_printoptions(8, suppress=True)

x_np = np.random.random((3, 4)).astype(np.double)
weights_np = np.random.random((4, 5)).astype(np.double)

x_torch = torch.tensor(x_np, requires_grad=True)
weights_torch = torch.tensor(weights_np, requires_grad=True)

print('Original weights', weights_torch)


################ 0 weight decay  ##################


lr = 0.1
sgd = torch.optim.SGD([weights_torch], lr=lr, weight_decay=0)

y_torch = torch.matmul(x_torch, weights_torch)
loss = y_torch.sum()

sgd.zero_grad()
loss.backward()
sgd.step()

w_grad = weights_torch.grad.data.numpy()
print('0 weight decay', weights_torch)


################ NOW 1 weight decay ######################

weights_torch = torch.tensor(weights_np, requires_grad=True)

print('Reset Original weights', weights_torch)

sgd = torch.optim.SGD([weights_torch], lr=lr, weight_decay=1)

y_torch = torch.matmul(x_torch, weights_torch)
loss = y_torch.sum()

sgd.zero_grad()
loss.backward()
sgd.step()

w_grad = weights_torch.grad.data.numpy()
print('1 weight decay', weights_torch)

Original weights tensor([[0.4386, 0.0597, 0.3980, 0.7380, 0.1825],
        [0.1755, 0.5316, 0.5318, 0.6344, 0.8494],
        [0.7245, 0.6110, 0.7224, 0.3230, 0.3618],
        [0.2283, 0.2937, 0.6310, 0.0921, 0.4337]], dtype=torch.float64,
       requires_grad=True)
0 weight decay tensor([[ 0.2489, -0.1300,  0.2084,  0.5483, -0.0072],
        [ 0.0653,  0.4214,  0.4217,  0.5243,  0.7393],
        [ 0.5694,  0.4559,  0.5674,  0.1679,  0.2067],
        [ 0.0317,  0.0972,  0.4345, -0.1044,  0.2372]], dtype=torch.float64,
       requires_grad=True)
Reset Original weights tensor([[0.4386, 0.0597, 0.3980, 0.7380, 0.1825],
        [0.1755, 0.5316, 0.5318, 0.6344, 0.8494],
        [0.7245, 0.6110, 0.7224, 0.3230, 0.3618],
        [0.2283, 0.2937, 0.6310, 0.0921, 0.4337]], dtype=torch.float64,
       requires_grad=True)
1 weight decay tensor([[ 0.2050, -0.1360,  0.1686,  0.4745, -0.0254],
        [ 0.0478,  0.3683,  0.3685,  0.4608,  0.6544],
        [ 0.4969,  0.3948,  0.4951,  0.1356,  0.1705]

#### As you can see, the weights are smaller when I use weight_decay=1 compared to weight_decay=0

[Source](https://discuss.pytorch.org/t/how-does-sgd-weight-decay-work/33105)