## Learning rate scheduler
`Learning rate` is one of the important hyperparameter in deeplearning model. The proper value of learning rate is required to converge the model to the local minima.

- If we set learning rate to large value the model learn too much( rapid learning )
  - Unable to converge to a good local minima(overshoot the local lowest value)
- If we set learning rate to small value the model learn too little( slow learning )
  - May take too long or unable to convert to a good local minima


## Need for learning rate scheduler
 - 1) Faster Convergence
 - 2) High Accuracy

## Type of basic learning rate scheduler
- 1) Step-wise decay
- 2) Reduce on loss plateau decay

###  Step-wise decay

 step-wise learning rate decay at every epoch

#### Imports

In [1]:
import torch
import torchvision
import torch.nn as nn
import torchvision.transforms as transforms

In [2]:
torch.manual_seed(0)

<torch._C.Generator at 0x7f0b17675a70>

In [44]:
from torch.optim.lr_scheduler import StepLR

#### Loading dataset

In [45]:
train_set = torchvision.datasets.FashionMNIST(
    root="./data",
    train=True,
    download=True,
    transform=transforms.Compose([
        transforms.ToTensor()
    ]))
test_set = torchvision.datasets.FashionMNIST(
    root="./data",
    train=False,
    download=True,
    transform=transforms.Compose([
        transforms.ToTensor()
    ]))

In [46]:
print(train_set)

Dataset FashionMNIST
    Number of datapoints: 60000
    Root location: ./data
    Split: Train
    StandardTransform
Transform: Compose(
               ToTensor()
           )


In [47]:
print(test_set)

Dataset FashionMNIST
    Number of datapoints: 10000
    Root location: ./data
    Split: Test
    StandardTransform
Transform: Compose(
               ToTensor()
           )


In [82]:
batch_size = 256
num_epochs = 5

In [83]:
train_loader = torch.utils.data.DataLoader(dataset=train_set,
                                          batch_size=batch_size,
                                          shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_set,
                                         batch_size=batch_size,
                                         shuffle=True)

#### Creating model

In [84]:
class FeedForward(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(FeedForward, self).__init__()
        # Linear function
        self.fc1 = nn.Linear(in_features=input_dim, out_features=hidden_dim) 
        # Non-linearity
        self.relu = nn.ReLU()
        # Linear function (readout)
        self.fc2 = nn.Linear(in_features=hidden_dim, out_features=output_dim)  

    def forward(self, x):
        # Linear function
        out = self.fc1(x)
        # Non-linearity
        out = self.relu(out)
        # Linear function (readout)
        out = self.fc2(out)
        return out

In [85]:
input_dim = 28*28*1
hidden_dim = 100
output_dim = 100

In [86]:
net = FeedForward(input_dim, hidden_dim, output_dim)

In [87]:
print(net)

FeedForward(
  (fc1): Linear(in_features=784, out_features=100, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=100, out_features=100, bias=True)
)


In [88]:
learning_rate= 0.1
optimizer = torch.optim.SGD(net.parameters(), lr=learning_rate, momentum=0.9, nesterov=True)
loss_obj = nn.CrossEntropyLoss()

#### Instantiate the step learning scheduler class
- step_size: at how many multiples of epoch you decay
- step_size = 1 , after every one epoch, new_lr = lr * gamma
- step_size = 2, after every two epoch, new_lr = lr * gamma

where gamma is decaying factor

In [89]:
scheduler = StepLR(optimizer, step_size=4, gamma=0.96)

#### Training the model

In [99]:
for epoch in range(num_epochs):
    scheduler.step()
    # Print learning rate
    print('Epoch:', epoch,'LR:', scheduler.get_lr())
    correct = 0    
    total = 0
    for i, (images, labels) in enumerate(train_loader):
        # load images as variable
        images = images.view(-1, 28*28).requires_grad_()
        
        # clear gradients w.r.t parameters
        optimizer.zero_grad()
        
        # forward pass to ger output logits
        outputs = net(images)
        
        # loss
        loss = loss_obj(outputs, labels)
        
        # getting gradients of loss with r t parameters
        loss.backward()
        
        # updating parameters
        optimizer.step()
        
         # Get predictions from the maximum value
        _, predicted = torch.max(outputs.data, 1)

        # Total number of labels
        total += labels.size(0)

        # Total correct predictions
        correct += (predicted == labels).sum().item()
        
    accuracy = 100 *  correct / total 
    
    print("Training Loss: ", loss.item(), "Training Accuracy: ", accuracy, "%")
    print("")
    print("")
    

Epoch: 0 LR: [0.07213895789838333]
Training Loss:  0.24062193930149078 Training Accuracy:  93.24166666666666 %


Epoch: 1 LR: [0.06648326359915008]
Training Loss:  0.18453781306743622 Training Accuracy:  93.45166666666667 %


Epoch: 2 LR: [0.069253399582448]
Training Loss:  0.26101574301719666 Training Accuracy:  93.49166666666666 %


Epoch: 3 LR: [0.069253399582448]
Training Loss:  0.20341770350933075 Training Accuracy:  93.555 %


Epoch: 4 LR: [0.069253399582448]
Training Loss:  0.2323964238166809 Training Accuracy:  93.61 %




In [100]:
with torch.no_grad():
    correct = 0    
    total = 0
    # Iterate through test dataset
    for images, labels in test_loader:
        # Load images to a Torch Variable
        images = images.view(-1, 28*28)

        # Forward pass only to get logits/output
        outputs = net(images)
        
        # loss
        loss = loss_obj(outputs, labels)

        # Get predictions from the maximum value
        _, predicted = torch.max(outputs.data, 1)

        # Total number of labels
        total += labels.size(0)

        # Total correct predictions
        correct += (predicted == labels).sum().item()
    accuracy = 100 *  correct / total 

    # Print Loss
    print('Testing_Loss: {}. Testing_Accuracy: {} %'.format(testing_loss.item(), accuracy))
    print("")
    print("")


Testing_Loss: 0.4492620527744293. Testing_Accuracy: 87.59 %




## Pointers on step-wise decay
 - You should want to decay your LR gradually when you are training more
    - Converge too fast , to a crappy loss, if you decay rapidly.
 - To decay slower
    - Larger gamma
    - Larger step size
    
 - You should always decay learning rate slowly.

### Reduce on Loss Plateau decay
- Reduce learning rate whenever loss plateaus
  - Patience: number of epochs with no improvement after which learning rate will be reduced
     - Patience = 0
  - Factor: multiplier to decrease the learning rate, lr = lr * factor
     - Factor = 0.1
    

##### lr = lr * factor 
##### mode='max': look for the maximum validation accuracy to track
##### patience: number of epochs - 1 where loss plateaus before decreasing LR
        - patience = 0, after 1 bad epoch, reduce LR
##### factor = decaying factor

In [116]:
from torch.optim.lr_scheduler import ReduceLROnPlateau
scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=1, verbose=True)

In [117]:
iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Load images as Variable
        images = images.view(-1, 28*28).requires_grad_()

        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()

        # Forward pass to get output/logits
        outputs = net(images)

        # Calculate Loss: softmax --> cross entropy loss
        loss = loss_obj(outputs, labels)

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()

        iter += 1

        if iter % 500 == 0:
            # Calculate Accuracy         
            correct = 0
            total = 0
            # Iterate through test dataset
            for images, labels in test_loader:
                # Load images to a Torch Variable
                images = images.view(-1, 28*28)

                # Forward pass only to get logits/output
                outputs = net(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)

                # Total number of labels
                total += labels.size(0)

                # Total correct predictions
                 # Without .item(), it is a uint8 tensor which will not work when you pass this number to the scheduler
                correct += (predicted == labels).sum().item()

            accuracy = 100 * correct / total

            # Print Loss
            # print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data[0], accuracy))

    # Decay Learning Rate, pass validation accuracy for tracking at every epoch
    print('Epoch {} completed'.format(epoch+1))
    print('Loss: {}. Accuracy: {}'.format(loss.item(), accuracy))
    print('-'*20)
    scheduler.step(accuracy)

Epoch 1 completed
Loss: 0.12008205056190491. Accuracy: 88.74
--------------------
Epoch 2 completed
Loss: 0.10750473290681839. Accuracy: 88.74
--------------------
Epoch 3 completed
Loss: 0.08275134861469269. Accuracy: 88.74
--------------------
Epoch     3: reducing learning rate of group 0 to 6.6045e-08.
Epoch 4 completed
Loss: 0.042819853872060776. Accuracy: 88.74
--------------------
Epoch 5 completed
Loss: 0.0938592255115509. Accuracy: 88.74
--------------------
Epoch     5: reducing learning rate of group 0 to 3.3023e-08.


- In these examples, we used patience=1 because we are running few epochs
  - You should look at a larger patience such as 5 if for example you ran 500 epochs.
- You should experiment with 2 properties
  - Patience
  - Decay factor