In [None]:
import torch
import torch.nn.functional as F

# Model training with Stochastic gradient descent

## Gradient Descent Algorithm

```py
For each training epoch:
  overall_loss L = 0
  For each training example:
    output = model_call(z)
    overall_loss L = overall_loss L + l(output, label) # l() is the loss function

    overall_loss L = overall_loss L/n # this is normalization, n is no of training example, for reasonable scale and easier comparisions.
    # till here we computed the overall loss
    # Now using this loss to compute gradient of loss with respect to parameters of the network in our case its w and b
    # once we get these gradients then we use these to update the parameters.
    w = w - lr*gradient_of_loss_wrt_w
    b = b - lr*gradient_of_loss_wrt_b
    # here we are multiplying gradient with lr learning rate(alpha) which is small value that scales the loss so that our updates are not too large.
    # large updates might disturb the training
    # alpha is a hyper parameter.
    # and tuning this for finding the right value for alpha
```

**Notes**

Gradient descent computes the loss over whole trainig epoch. and epoch meaning doing forward pass one single time on whole dataset that is one epoch. and other thing is that loss is based on whole training epoch the parameter update will be also on once per epoch.

## Stochastic Gradient Descant

Stochastic gradient descant is the flawor  of gradient descent with more frequent updates.

**Stochastic gradient descant algorithm**
```py
For each training epoch:
  For each training example:
    output = model_call(z)
    L = loss_function(output, label)

    Compute gradients of parameters & update parameters
    w = w - lr*gradient_of_loss_wrt_w
    b = b - lr*gradient_of_loss_wrt_b
```

- Its similar to previous method iterating for multiple training epochs, in each training epoch iterate over each training example.
- then compute the output from the model
- then compute the loss, in gradient descent we kept updating the loss for whole training epoch before we compute the gradients.
- **In contrast here, stochastic gradient descent we compute the gradient of the loss for single training example**
- and using this loss we update the model parameters for single training example.
- In conclusion, we are **updating model after each training example**.
- So here difference is that we have more frequent updates when we iterate over the dataset.
- 🟥 However, updating the model parameter after each training examples is also not ideal. because loss is an approximation of the overall loss, and for using just one training example we can get prety rough approximations.

To improve the rough approximation and use concept of linear algebra to speedup the training theres a concept called **mini batch gradient descent** which is flawor of stochastic gradient descent

**Stochastic Gradient Descent With Mini Batch**

Mini batch gradient descent is hybrid of gradient descent and stochastic gradient descant algorithm.

**GD**: 1 update per epoch

**SGD**: n updates per epoch where n is total no of training example.

**MGD(Mini batch GD)**:
- its hybrid between GD and SGD.
- here we form small groups or batches of the training examples and will make 1 update after each batch.
- what will be the optimal size of small mini batches? so `mini batch sizes are tuning parameter`, typical minibatch sizes are power of 2. 2^2 = 4 etc. and this hase something to do with GPU architectures, bcs of using our hardwares efficiently

```py
For each training epoch:
  For each minibatch: # new, iterating over mini batches
    overall_loss L = 0

    For each training example in minibatch:
      output = model_call(z)
      overall_loss L = overall_loss L + loss_funct(output, label)
    overall_loss L = overall_loss L / n
    Compute gradients of parameters & update parameters per batch
    w = w - lr*gradient_of_loss_wrt_w
    b = b - lr*gradient_of_loss_wrt_b
```

## Advantages of Mini Batch GD

- The gradients from these mini batches are un biased estimates of the gradient, that's because they don't systematically deviate from the calculation of the gradients on the whole dataset.
- Updates will be less noisy
- faster than GD because more than 1 updates per epoch. model will learn more faster.
- much better GPU utilization and it lets us take advantage of certain concepts of linear algebra(matrix multiplication) than the SGD because here we can pass multiple training example to the model which can use matrix multiplication to compute instead of passing single single trainig examples for more see tensor notebook.

# Hyper tuning parameters in GD
- learning rate
- mini batch size


## Calculate gradient using PyTorch


In [None]:
# Model Parameters
w_1 = torch.tensor([0.23], requires_grad=True) # initializing model parameters, requires_grad = True meaning we want gradient of loss wrt these parameters
b = torch.tensor([0.1], requires_grad=True)


In [None]:
# input and true label
x_1 = torch.tensor([1.23])
y = torch.tensor([1.])

In [None]:
# calculate the weighted sum
u = x_1 * w_1 # weighted sum
z = u + b # add bias
print(z)

tensor([0.3829], grad_fn=<AddBackward0>)


In [None]:
a = torch.sigmoid(z)
print(a)

tensor([0.5946], grad_fn=<SigmoidBackward0>)


In [None]:
l = F.binary_cross_entropy(a, y) # calculate the loss
print("loss:", l)


loss: tensor(0.5199, grad_fn=<BinaryCrossEntropyBackward0>)


In [None]:
# now we want to compute gradient of the loss, compute the partial derivatives wrt w1 and b
# we can compute partial derivative using PyTorch's autograd engine
from torch.autograd import grad

grad_L_w1 = grad(l, w_1, retain_graph=True) # computing gradients automatically
# retain graph True meaning keep computing graph in memory otherwise it will deconstruct the graph that we previously build, and also for computing for bias b
grad_L_w1


(tensor([-0.4987]),)

In [None]:
grad_L_b = grad(l, b, retain_graph=True)
grad_L_b

(tensor([-0.4054]),)

## .backward() method
Instead of using graph and grad function manually, we can compute partial derivative of model parameters with only one .backward() call.


In [None]:
l.backward()

In [None]:
w_1.grad

tensor([-0.4987])

In [None]:
b.grad

tensor([-0.4054])