# PyTorch basics and Linear Regression
### Back -propogation : Differences between Gradient descent, Stochiastic gradient descent and Batch gradient descent with examples

## PyTorch basics

### Difference between numpy and tensor
Pytorch tensors work with GPU (can do large number of matrix operation than CPU)
To work with GPU , we need to write code in pgmming language CUDA (similar to C). Need to move data from CPU to GPU.
Tensors are written in CUDA language which runs on GPU

In [1]:
# Opt for Auto complete while coding
import rlcompleter, readline
readline.parse_and_bind('tab:complete')

In [2]:
import torch

In [3]:
mat = torch.tensor([[[2,3],[3,4]] , [[2,5],[4,7]]])

In [4]:
mat

tensor([[[2, 3],
         [3, 4]],

        [[2, 5],
         [4, 7]]])

In [5]:
import numpy as np
ar1 = np.array([[[2,4],[3,5]],[[6,3],[1,2]]])
ar = np.array([[[2,4],[3,5]],[[6,3],[1,2,6]]]) #This doesnt give error even though the last matrix has 3 elements


In [6]:
mat = torch.tensor([[[2,4],[3,5]],[[6,3],[1,2,6]]]) #This  gives error

ValueError: expected sequence of length 2 at dim 2 (got 3)

In [7]:
w = torch.tensor(3., requires_grad=True)
u = torch.tensor(4., requires_grad=True)

In [8]:
y = u + w
print(y)
y.backward()
#derivative
print("dy/dw = " , w.grad )

tensor(7., grad_fn=<AddBackward0>)
dy/dw =  tensor(1.)


In [9]:
x = np.array([1.,2.])
z = torch.tensor(x)  #convert Array to tensor
print(type(z))
y = torch.from_numpy(x) #convert Array to tensor
print(type(y))
m = y.numpy()  #convert tensor to ndarray
print(type(m))

<class 'torch.Tensor'>
<class 'torch.Tensor'>
<class 'numpy.ndarray'>


In [10]:
y.is_cuda # This tensor is not stored in GPu, it is stored in CPU

False

In [11]:
torch.tensor(3, dtype=torch.float32)
#torch.tensor(3, dtype='float32')#gives error
p = np.array(3, dtype='float32')

In [12]:
x1 = torch.tensor([1,2])
x2 = torch.tensor([3,1])
z = x1 @ x2
print("Matrix multiplication = dot_product using @",z)

p = x1*x2
print("Elemet-wise multiplication using *",p)

q = torch.dot(x1,x2)
print("Matrix multiplication = dot_product using dot method", q)


Matrix multiplication = dot_product using @ tensor(5)
Elemet-wise multiplication using * tensor([3, 2])
Matrix multiplication = dot_product using dot method tensor(5)


## Linear Regression using basic tensor operations

1. Shape = [No of rows, No. of columns]

2. We need to predict the yield of apples and oranges -> Targetshape =  (number of example , no of targets) -> [5,2]

3. Features are Temp,rainfall,humidity -> Inputshape = (No of examples , no of parameters(features)) -> [5,3]

4. Weight matrix = For each target, each feature across all examples we need to assign weights. i.e weights are same for all examples.

5. Shape of Weight matrix = [No of targets , No of input features]  = [2,3]

6. Bias term -> When all the input features are zero, output need not be zero, there will some small value in the output. This value is set by bias term. Bias term is also shared among all the examples. Hence, we will have 1 bias term for one target output. 

7. Shape of bias matrix = [No. of targets , 1] = [2,1]

8. yield_apple  = w11 * temp + w12 * rainfall + w13 * humidity + b1
9. yield_orange = w21 * temp + w22 * rainfall + w23 * humidity + b2



Logic: Randomly initialise weight matrix and bias matrix. We will be changing this weight matrix to get low loss[actualVal - predictedVal]. Hence we would take grads for w and b

In [13]:
# 5 examples, each example gives Temp,rainfall,humidity values. Eg: [73(temp),67(rainfall),43(humidity)]
inputs = np.array([[73, 67, 43], 
                   [91, 88, 64], 
                   [87, 134, 58], 
                   [102, 43, 37], 
                   [69, 96, 70]], dtype='float32')
print(inputs.shape)
x = torch.from_numpy(inputs) #convert to tensor

(5, 3)


In [14]:
targets = np.array([[56, 70], 
                    [81, 101], 
                    [119, 133], 
                    [22, 37], 
                    [103, 119]], dtype='float32')
print(targets.shape)
y = torch.from_numpy(targets)
print("Range of target ",np.min(targets)," - " , np.max(targets))

(5, 2)
Range of target  22.0  -  133.0


In [15]:
#Initialise weight matrix with random values. Values picked from Normal distribution, usually ranging from -1 to +1, mean=0 and std deviation =1
torch.manual_seed(3)
w =torch.randn(2,3, requires_grad=True)
#w = torch.empty((1,3)).normal_(mean=0,std=1) #Other way of initialising the weights
b =torch.randn(2,requires_grad=True)
print(w)
print(b)

tensor([[ 0.8033,  0.1748,  0.0890],
        [-0.6137,  0.0462, -1.3683]], requires_grad=True)
tensor([0.3375, 1.0111], requires_grad=True)


In [16]:
def model(x):
    yhat = x @ w.t() + b
    return yhat

In [17]:
yhat = model(x)
print("Predicted matrix ", yhat)
print("Actual target ", y)

Predicted matrix  tensor([[  74.5165,  -99.5312],
        [  94.5155, -138.3418],
        [  98.8109, -125.5529],
        [  93.0817, -110.2279],
        [  78.7760, -132.6801]], grad_fn=<AddBackward0>)
Actual target  tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])


### Calculate loss -> MSE
1. y - yhat
2. sqr = square(y - yhat)
3. sum(all elements in sqr) / totalElements

In [18]:
#Function : Cost function mse 
def loss_mse(y,yhat):
    diff = y- yhat
    sqr = diff*diff  # * -> elementwise multiplication
    mse = torch.sum(sqr)/sqr.numel()
    return mse

### INITIAL LOSS BEFORE APPLYING GRADIENT DESCENT

In [19]:
print("INITIAL LOSS BEFORE APPLYING GRADIET DESCENT")
loss = loss_mse(y,yhat)
print("Average Loss", loss)
elementwise_loss = torch.sqrt(loss)
print("Average loss in each element ",elementwise_loss)
print("On average, each element in the prediction differs from the actual target by about", elementwise_loss)

INITIAL LOSS BEFORE APPLYING GRADIET DESCENT
Average Loss tensor(24446.6367, grad_fn=<DivBackward0>)
Average loss in each element  tensor(156.3542, grad_fn=<SqrtBackward>)
On average, each element in the prediction differs from the actual target by about tensor(156.3542, grad_fn=<SqrtBackward>)


### Compute gradients - 1 ITERATION/1 EPOCH

In [20]:
loss.backward()

In [21]:
print(w)
wgrad = w.grad
print(wgrad)
# Inference
# For every unit change in w11, loss increases by 1280.80. To reduce the loss, we need to subtract w11 by 1280.80
# For every unit change in w12, loss increases by 91.129
# For every unit change in w21, loss decreased by 17806.6. To reduce the loss, we need to add w21 with 17806.6

tensor([[ 0.8033,  0.1748,  0.0890],
        [-0.6137,  0.0462, -1.3683]], requires_grad=True)
tensor([[  1280.8098,     91.1297,    284.9165],
        [-17806.6328, -19511.7695, -12133.7656]])


In [22]:
print(b)
bgrad = b.grad
print(bgrad)
# Inference
# For every 1 unit increase in b1, loss increases by 11.74
# For every 1 unit increase in b1, loss decreases by -213.2

tensor([0.3375, 1.0111], requires_grad=True)
tensor([  11.7401, -213.2668])


In [23]:
#Update the weights and biases matrix
# While updating the weights, we should not modify the gradients.Hence use, torch.no_grad
# While updating, we multiply grad with small number(learning rate - alpha), to ensure that we dont modify weights by large number.

with torch.no_grad():
    print("Initial weights",w)
    print("Initial bias",b)
    w-=wgrad * 1e-5
    b-=bgrad * 1e-5
    print("Updated weights ", w)
    print("Updated bias matrix", b)
    wgrad.zero_()
    bgrad.zero_()
    

Initial weights tensor([[ 0.8033,  0.1748,  0.0890],
        [-0.6137,  0.0462, -1.3683]], requires_grad=True)
Initial bias tensor([0.3375, 1.0111], requires_grad=True)
Updated weights  tensor([[ 0.7905,  0.1739,  0.0861],
        [-0.4357,  0.2413, -1.2469]], requires_grad=True)
Updated bias matrix tensor([0.3374, 1.0132], requires_grad=True)


### LOSS AFTER PERFORMING BACK_PROPOGATION 

In [24]:
# Calculate loss with these new weights and biases
yhat = model(x)
print("Predicted output",yhat)
loss = loss_mse(y,yhat)
print("Average Loss", loss)
elementwise_loss = torch.sqrt(loss)
print("Average loss in each element ",elementwise_loss)
print("On average, each element in the prediction differs from the actual target by about", elementwise_loss)

Predicted output tensor([[ 73.3979, -68.2398],
        [ 93.0874, -97.1996],
        [ 97.4091, -76.8757],
        [ 91.6305, -79.1834],
        [ 77.6052, -93.1664]], grad_fn=<AddBackward0>)
Average Loss tensor(16736.2539, grad_fn=<DivBackward0>)
Average loss in each element  tensor(129.3687, grad_fn=<SqrtBackward>)
On average, each element in the prediction differs from the actual target by about tensor(129.3687, grad_fn=<SqrtBackward>)


### RESULT INTERPRETATION after applying GRADIENT DESCENT/BACK-PROPOGATION ONCE
1st iteration -> Takes all the 5 examples(x), calculates yhat, loss, grads and updates the weights. 
2nd iterations -> Takes all the 5 examples(x), calculates yhat, loss, grads and updates the weights.

In [25]:
print("After first iteration loss reduced from 24446.6367 to 16736.255")

print("After first iteration loss exhibited by each element reduced from 156.34 to 129.36, i.e each predicted element differs from actual value by 129.36 on an avg")

After first iteration loss reduced from 24446.6367 to 16736.255
After first iteration loss exhibited by each element reduced from 156.34 to 129.36, i.e each predicted element differs from actual value by 129.36 on an avg


## Gradient descent : Consider all examples at each epoch
### PERFOM BACK_PROPOGATION 100 times i.e epochs=100, Considering all examples in each epoch 
100 ITERATIONS = 100 EPOCHS , CONSIDERING ALL EXAMPLES IN EACH EPOCH

In [26]:
# Train for 100 epoch / say 100 iterations
# Hyperparameters ::  number of epoch(here,we assigned 100) and learning rate(here, we assigned 1e-5)
for i in range(100):
    yhat = model(x)
    loss = loss_mse(y,yhat)
    if(i % 10==0):
        print("Loss at iter " ,i ," = ",loss)
    loss.backward()
    with torch.no_grad():
        w-=w.grad * 1e-5
        b-=b.grad * 1e-5
        w.grad.zero_()  #If this is not done, for second iteration , it would take double derivative - f'(f'(dl/dw)) or f''(dl/dw)
        b.grad.zero_()

Loss at iter  0  =  tensor(16736.2539, grad_fn=<DivBackward0>)
Loss at iter  10  =  tensor(1046.5990, grad_fn=<DivBackward0>)
Loss at iter  20  =  tensor(672.2653, grad_fn=<DivBackward0>)
Loss at iter  30  =  tensor(601.9579, grad_fn=<DivBackward0>)
Loss at iter  40  =  tensor(544.7785, grad_fn=<DivBackward0>)
Loss at iter  50  =  tensor(494.2476, grad_fn=<DivBackward0>)
Loss at iter  60  =  tensor(449.4819, grad_fn=<DivBackward0>)
Loss at iter  70  =  tensor(409.7966, grad_fn=<DivBackward0>)
Loss at iter  80  =  tensor(374.5901, grad_fn=<DivBackward0>)
Loss at iter  90  =  tensor(343.3328, grad_fn=<DivBackward0>)


In [27]:
print("GRADIENT DESCENT ON 100 EPOCHS")
print("Loss at 99th iter", loss)
elementwise_loss = torch.sqrt(loss)
print("Average loss in each element ",elementwise_loss)
print("On average, each element in the prediction differs from the actual target by about", elementwise_loss.item())

GRADIENT DESCENT ON 100 EPOCHS
Loss at 99th iter tensor(318.1910, grad_fn=<DivBackward0>)
Average loss in each element  tensor(17.8379, grad_fn=<SqrtBackward>)
On average, each element in the prediction differs from the actual target by about 17.837909698486328


## Stochiastic Gradient descent : 1 example at a time
### Basic tensor operations for back-propgation

1. Iteration1 / epoch 1: We take 1st example,random initialised value, calculate yhat, find loss(for 1st example y[0] - yhat[0]), find wgrad,bgrad and update weights .
We take 2nd example, calculate yhat, find loss(for 2nd example y[1] - yhat[1]), find wgrad,bgrad and update weights. 
Similarly perform these tasks on all the examples.

2. Iteration 2 / epoch 2: We take 1st example,random initialised value, calculate yhat, find loss(for 1st example y[0] - yhat[0]), find wgrad,bgrad and update weights.
We take 2nd example, calculate yhat, find loss(for 2nd example y[1] - yhat[1]), find wgrad,bgrad and update weights.
Similarly perform these tasks on all the example.

3. ..... epoch 100

In [28]:
for i in range(100):
    #print("At iteration ",i)
    for j in range(len(x)):
        yhat = model(x[j])
        #print(yhat)
        loss = loss_mse(y[j],yhat)
        #print(y[j])
        if(i % 10==0):
            print("Loss at iter " ,i ," = ",loss)
        loss.backward()
        with torch.no_grad():
            w-=w.grad * 1e-5
            b-=b.grad * 1e-5
            w.grad.zero_()  #If this is not done, for second iteration , it would take double derivative - f'(f'(dl/dw)) or f''(dl/dw)
            b.grad.zero_()



Loss at iter  0  =  tensor(33.8693, grad_fn=<DivBackward0>)
Loss at iter  0  =  tensor(53.8243, grad_fn=<DivBackward0>)
Loss at iter  0  =  tensor(260.7957, grad_fn=<DivBackward0>)
Loss at iter  0  =  tensor(762.4399, grad_fn=<DivBackward0>)
Loss at iter  0  =  tensor(737.5680, grad_fn=<DivBackward0>)
Loss at iter  10  =  tensor(30.5703, grad_fn=<DivBackward0>)
Loss at iter  10  =  tensor(40.6871, grad_fn=<DivBackward0>)
Loss at iter  10  =  tensor(215.7035, grad_fn=<DivBackward0>)
Loss at iter  10  =  tensor(418.0986, grad_fn=<DivBackward0>)
Loss at iter  10  =  tensor(549.4191, grad_fn=<DivBackward0>)
Loss at iter  20  =  tensor(20.5718, grad_fn=<DivBackward0>)
Loss at iter  20  =  tensor(37.9198, grad_fn=<DivBackward0>)
Loss at iter  20  =  tensor(193.7313, grad_fn=<DivBackward0>)
Loss at iter  20  =  tensor(224.8997, grad_fn=<DivBackward0>)
Loss at iter  20  =  tensor(444.4177, grad_fn=<DivBackward0>)
Loss at iter  30  =  tensor(14.6264, grad_fn=<DivBackward0>)
Loss at iter  30  = 

In [29]:
print("STOCHIASTIC GRADIENT DESCENT RESULTS")
print("Loss at 99th iter", loss)
elementwise_loss = torch.sqrt(loss)
print("Average loss in each element ",elementwise_loss)
print("On average, each element in the prediction differs from the actual target by about", elementwise_loss.item())

STOCHIASTIC GRADIENT DESCENT RESULTS
Loss at 99th iter tensor(147.6412, grad_fn=<DivBackward0>)
Average loss in each element  tensor(12.1508, grad_fn=<SqrtBackward>)
On average, each element in the prediction differs from the actual target by about 12.150769233703613


## Linear Regression using pytorch built-in functions -> Neural network lib(nn)

In [30]:
import torch.nn as nn

In [31]:
inputs = np.array([[73, 67, 43], [91, 88, 64], [87, 134, 58], 
                   [102, 43, 37], [69, 96, 70], [72, 66, 42], 
                   [90, 87, 65], [85, 134, 56], [100, 42, 33], 
                   [68, 95, 74], [63, 77, 53], [81, 48, 54], 
                   [57, 124, 48], [142, 43, 37], [67, 96, 80],[50,70,30]],dtype='float32')
targets = np.array([[56, 70], [81, 101], [119, 133], 
                    [22, 37], [103, 119], [66, 70], 
                    [84, 101], [120, 133], [22, 47], 
                    [103, 139], [59, 70], [81, 151], 
                    [129, 135], [23, 39], [104, 159],[50,60]], 
                   dtype='float32')
x = torch.from_numpy(inputs)
y = torch.from_numpy(targets)
x[[1,3]]

tensor([[ 91.,  88.,  64.],
        [102.,  43.,  37.]])

16 examples -> create batches 

In [32]:
# TensorDataset -> give input as output as tuple. Useful for taking small samples of data
from torch.utils.data import TensorDataset
train_ds = TensorDataset(x,y)
train_ds[:3]
#train_ds[[1,3]] #1st tensor is input, 2nd tensor is the target(y)

(tensor([[ 73.,  67.,  43.],
         [ 91.,  88.,  64.],
         [ 87., 134.,  58.]]),
 tensor([[ 56.,  70.],
         [ 81., 101.],
         [119., 133.]]))

### DATALOADER


1. Split data into batches -> Earlier in python we would use ,no of batches = total_examples/batchsize.
Say, batchsize =4, i.e 4 training examples considered in 1 run, here, total_examples= no_of_batches = 16/4=4batches .

2. All datapoints will be considered, the tuple will not be repeated/duplicated in the batches .This operation can be achieved using DataLoader

3. We can do shuffling and random sampling. 
4. Input training set would be (x,y) tuple obtained from TensorDataset(train_ds)

5. Shuffling helps randomize the input to the optimization algorithm, which can lead to faster reduction in the loss.

In [33]:
from torch.utils.data import DataLoader

In [34]:
#? DataLoader

In [35]:
batch_size = 4
train_dl = DataLoader(train_ds,batch_size,shuffle=True)

In [36]:
for xb,yb in train_dl:
    print(xb)
    print(yb)
    break

tensor([[ 73.,  67.,  43.],
        [ 50.,  70.,  30.],
        [100.,  42.,  33.],
        [ 63.,  77.,  53.]])
tensor([[56., 70.],
        [50., 60.],
        [22., 47.],
        [59., 70.]])


## 1 ITERATION, CONSIDERING ALL EXAMPLES - GRADIENT DESCENT
### Using pytorch in-built optimizer (SGD)

In [37]:
import torch.nn.functional as F
torch.manual_seed(3)
model = nn.Linear(3,2)
#print("Weight :: ",model.weight)
#print("Bias :: ",model.bias)
loss_fn =  F.mse_loss
opt = torch.optim.SGD(list(model.parameters()), lr=1e-5)

In [38]:
print(list(model.parameters())) # weights and bias matrices are the parameters
yhat = model(x)
loss = loss_fn(yhat,y)
loss.backward()
opt.step()
opt.zero_grad()
print("Loss",loss.item())
elementwise_loss = np.sqrt(loss.item())
print("Average loss in each element ",elementwise_loss)
print("On average, each element in the prediction differs from the actual target by about", elementwise_loss.item())
#print(list(model.parameters())) #UPDATED WEIGHTS AND BIASES AFTER OPTIMISATION

[Parameter containing:
tensor([[-0.5724, -0.4554, -0.2473],
        [-0.5462, -0.0328, -0.5079]], requires_grad=True), Parameter containing:
tensor([0.3139, 0.2814], requires_grad=True)]
Loss 31566.533203125
Average loss in each element  177.66973068906532
On average, each element in the prediction differs from the actual target by about 177.66973068906532


## 100 ITERATIONS(EPOCHS), CONSIDERING ALL EXAMPLES - GRADIENT DESCENT

In [39]:
import torch.nn.functional as F
torch.manual_seed(2) #We do this before creating model, since, model initialises the w, b matrices randomly
model = nn.Linear(3,2)
loss_fn = F.mse_loss
opt = torch.optim.SGD(model.parameters(), lr= 1e-15)

In [40]:
# At each epoch -> 
# we consider all 16 examples , predict output, calculate loss , grad descent and update the weights. 
# Thus, at each epoch we update the weights and bias matrix just once.
# With 100 epoch we will have 100 *1 = 100 losses calculated i.e weights will be updated 100 times (counter)


def gradDes(model, opt,loss_fn, x, epochs,counter=0):
    for i in range(epochs):
        counter=counter+1
        # model would have initialised w,b matrices. Predict the output considering input(X) and initialised w,b
        pred =yhat= model(x)
        #Find the loss from defined mse loss function. Pass actual and predicted values
        loss = loss_fn(yhat,y)
        #print(loss)
        # Perform backward/grad_des/differentiation on loss function
        loss.backward()
        
        #Optimisation : Update the w and b based on on grad descent calculation. grad optimizer takes care of this
        opt.step()
        
        # set calculated grad values of w, b to zero
        opt.zero_grad()
        
        #if (i+1) % 10 == 0:
            #print("Loss at epoch {} is {:.4f}".format(i,loss.item()))
        
    return (loss.item(),counter)
        
        

In [41]:
loss,counter = gradDes(model,opt,loss_fn,x,100)
print("counter",counter)
print("Loss at 99th iter", loss)
elementwise_loss = np.sqrt(loss)
print("Average loss in each element ",elementwise_loss)
print("On average, each element in the prediction differs from the actual target by about", elementwise_loss.item())

counter 100
Loss at 99th iter 6332.97216796875
Average loss in each element  79.57997341020383
On average, each element in the prediction differs from the actual target by about 79.57997341020383


## 100 ITERATIONS, CONSIDERING 1 EXAMPLE AT A TIME - STOCHIASTIC GRADIENT DESCENT

In [42]:
len(train_ds)

16

In [43]:
import torch.nn.functional as F
torch.manual_seed(2) #We do this before creating model, since, model initialises the w, b matrices randomly
model = nn.Linear(3,2)
loss_fn = F.mse_loss
opt = torch.optim.SGD(model.parameters(), lr= 1e-15)

In [44]:
# At each epoch -> 
# first we consider 1 example and update the weights.
# next consider 2nd example  and updated weights and calculate loss and update weights and so on
# This continus for all 16 examples.Thus each epoch will have 16 losses calculated
# With 100 epoch we will have 100 *16= 1600 losses calculated i.e weights will be updated 1600 times (counter)

def stochGradDes(model, opt,loss_fn, x, epochs,counter=0):
    for i in range(epochs):
        for j in range(len(train_ds)):
            counter= counter+1
        # model would have initialised w,b matrices. Predict the output considering input(X) and initialised w,b
            pred = yhat= model(x[j])
        
        #Find the loss from defined mse loss function. Pass actual and predicted values
            loss = loss_fn(yhat,y[j])
        
        # Perform backward/grad_des/differentiation on loss function
            loss.backward()
        
        #Optimisation : Update the w and b based on on grad descent calculation. grad optimizer takes care of this
            opt.step()
        
        # set calculated grad values of w, b to zero
            opt.zero_grad()

            #if (i+1) % 10 == 0:
                #print("Loss at epoch {} is {:.4f}".format(i,loss.item()))
        
    return (loss.item(),counter)
        

In [45]:
(loss,counter) = stochGradDes(model,opt,loss_fn,x,100)
print("Loss at 99th iter", loss)
print("Counter = ",counter)
elementwise_loss = np.sqrt(loss)
print("Average loss in each element ",elementwise_loss)
print("On average, each element in the prediction differs from the actual target by about", elementwise_loss.item())

Loss at 99th iter 1989.31591796875
Counter =  1600
Average loss in each element  44.60174792503933
On average, each element in the prediction differs from the actual target by about 44.60174792503933


## 100 ITERATIONS, CONSIDERING 4 EXAMPLES AT A TIME - BATCH GRADIENT DESCENT

In [46]:
import torch.nn.functional as F
torch.manual_seed(2) #We do this before creating model, since, model initialises the w, b matrices randomly
model = nn.Linear(3,2)
loss_fn = F.mse_loss
opt = torch.optim.SGD(model.parameters(), lr= 1e-15)

In [47]:
# At each epoch -> 
# first we consider 4 examples, predict the outputs for the 4 examples, calculate overall loss(sum loss for all examples)
# and perform grad descent and update the weights.
# next consider next 4 examples calculate loss and updated weights  so on
# This continues for all 16 examples i.e 4 batches -> each batch containing 4 examples shuffled(train_dl).Thus each epoch will have 4 losses calculated
# With 100 epoch we will have 100 *4= 400 losses calculated i.e weights will be updated 400 times


def batchGradDes(model, opt,loss_fn, x, epochs,counter=0):
    for i in range(epochs):
        for xb,yb in train_dl:
            counter = counter+1
        # model would have initialised w,b matrices. Predict the output considering input(X) and initialised w,b
            pred = yhat= model(xb)
        #Find the loss from defined mse loss function. Pass actual and predicted values
            loss = loss_fn(yhat,yb)

        # Perform backward/grad_des/differentiation on loss function
            loss.backward()
        
        #Optimisation : Update the w and b based on on grad descent calculation. grad optimizer takes care of this
            opt.step()
        
        # set calculated grad values of w, b to zero
            opt.zero_grad()
        
            #if (i+1) % 10 == 0:
                #print("Loss at epoch {} is {:.4f}".format(i,loss.item()))
        
    return (loss.item(),counter)
        

In [48]:
(loss,counter) = batchGradDes(model,opt,loss_fn,x,100)
print("Loss at 99th iter", loss)
print("counter",counter)
elementwise_loss = np.sqrt(loss)
print("Average loss in each element ",elementwise_loss)
print("On average, each element in the prediction differs from the actual target by about", elementwise_loss.item())

Loss at 99th iter 4408.11865234375
counter 400
Average loss in each element  66.39366424850905
On average, each element in the prediction differs from the actual target by about 66.39366424850905


# RESULTS 

### 1 iteration, all examples
Overall Loss = 31566.533
On average, each element in the prediction differs from the actual target by about 177.66


### 100 iterations all examples
Overall Loss at 99th iter 6332.972
On average, each element in the prediction differs from the actual target by about 79.57

### 100 iterations 1 example at a time
Overall Loss at 99th iter 1989.31.
On average, each element in the prediction differs from the actual target by about 44.601

### 100 iterations with batch_size=4, i.e considering 4 examples at a time
Overall Loss at 99th iter 4408.11.
On average, each element in the prediction differs from the actual target by about 66.39

