#### PyTorch has emerged as a major contender in the race to be the king of deep learning frameworks.

In this notebook I will go over some regular snippets and techniques of it.

# Compute basic gradients from the sample tensors using PyTorch

#### First some basics of Pytorch here

**Autograd**: This class is an engine to calculate derivatives (Jacobian-vector product to be more precise). It records a graph of all the operations performed on a gradient enabled tensor and creates an acyclic graph called the dynamic computational graph. The leaves of this graph are input tensors and the roots are output tensors. Gradients are calculated by tracing the graph from the root to the leaf and multiplying every gradient in the way using the chain rule.

A Variable class wraps a tensor. You can access this tensor by calling `.data` attribute of a Variable.

The Variable also stores the gradient of a scalar quantity (say, loss) with respect to the parameter it holds. This gradient can be accessed by calling the `.grad` attribute. This is basically the gradient computed up to this particular node, and the gradient of the every subsequent node, can be computed by multiplying the edge weight with the gradient computed at the node just before it.

The third attribute a Variable holds is a grad_fn, a Function object which created the variable.

**Variable**: The Variable, just like a Tensor is a class that is used to hold data. It differs, however, in the way it’s meant to be used. Variables are specifically tailored to hold values which change during training of a neural network, i.e. the learnable paramaters of our network. Tensors on the other hand are used to store values that are not to be learned. For example, a Tensor maybe used to store the values of the loss generated by each example.

Every **variable** object has several members one of them is **grad**:

**grad**: grad holds the value of gradient. If requires_grad is False it will hold a None value. Even if requires_grad is True, it will hold a None value unless .backward() function is called from some other node. For example, if you call out.backward() for some variable out that involved x in its calculations then x.grad will hold ∂out/∂x.

**Backward() function**
Backward is the function which actually calculates the gradient by passing it’s argument (1x1 unit tensor by default) through the backward graph all the way up to every leaf node traceable from the calling root tensor. The calculated gradients are then stored in .grad of every leaf node. Remember, the backward graph is already made dynamically during the forward pass. Backward function only calculates the gradient using the already made graph and stores them in leaf nodes.

In [1]:
import torch
from torch.autograd import Variable
import torch as tch
import torch.nn as nn
import numpy as np
from sklearn.datasets import make_blobs
from matplotlib import pyplot

def forward(x):
    return x * w

w = Variable(torch.Tensor([1.0]), requires_grad=True)
# . On setting .requires_grad = True they start forming a backward graph
# that tracks every operation applied on them to calculate the gradients
# using something called a dynamic computation graph (DCG)
# When you finish your computation you can call .backward() and have
# all the gradients computed automatically. The gradient for this tensor
# will be accumulated into .grad attribute.

# Now create an array of data.
# By PyTorch’s design, gradients can only be calculated
# for floating point tensors which is why I’ve created a float type
# array before making it a gradient enabled PyTorch tensor
x_data = [11.0, 22.0, 33.0]
y_data = [21.0, 14.0, 64.0]

def loss_function(x, y):
    y_pred = forward(x)
    return (y_pred - y) * (y_pred - y)


# Now running the training loop
for epoch in range(10):
    for x_val, y_val in zip(x_data, y_data):
        l = loss_function(x_val, y_val)
        l.backward()
        print("\tgrad: ", x_val, y_val, w.grad.data[0])
        w.data = w.data - 0.01 * w.grad

        # Manually set the gradient to zero after updating weights
        w.grad.data.zero_()

        print('progress: ', epoch, l.data[0])

	grad:  11.0 21.0 tensor(-220.)
progress:  0 tensor(100.)
	grad:  22.0 14.0 tensor(2481.6001)
progress:  0 tensor(3180.9602)
	grad:  33.0 64.0 tensor(-51303.6484)
progress:  0 tensor(604238.8125)
	grad:  11.0 21.0 tensor(118461.7578)
progress:  1 tensor(28994192.)
	grad:  22.0 14.0 tensor(-671630.6875)
progress:  1 tensor(2.3300e+08)
	grad:  33.0 64.0 tensor(13114108.)
progress:  1 tensor(3.9481e+10)
	grad:  11.0 21.0 tensor(-30279010.)
progress:  2 tensor(1.8943e+12)
	grad:  22.0 14.0 tensor(1.7199e+08)
progress:  2 tensor(1.5279e+13)
	grad:  33.0 64.0 tensor(-3.3589e+09)
progress:  2 tensor(2.5900e+15)
	grad:  11.0 21.0 tensor(7.7553e+09)
progress:  3 tensor(1.2427e+17)
	grad:  22.0 14.0 tensor(-4.4050e+10)
progress:  3 tensor(1.0023e+18)
	grad:  33.0 64.0 tensor(8.6030e+11)
progress:  3 tensor(1.6991e+20)
	grad:  11.0 21.0 tensor(-1.9863e+12)
progress:  4 tensor(8.1519e+21)
	grad:  22.0 14.0 tensor(1.1282e+13)
progress:  4 tensor(6.5750e+22)
	grad:  33.0 64.0 tensor(-2.2034e+14)
pro

Weight initialization is an important task in training a neural network,
whether its a convolutional neural network
(CNN), a deep neural network (DNN), and a recurrent neural network
(RNN). Lets some examples of initializing the weights.


Weight initialization can be done by using various methods, including
random weight initialization.
Weight initialization based on a distribution
is done using
- Uniform distribution,
- Bernoulli distribution,
- Multinomial distribution, and normal distribution.

To execute a neural network, a set of initial weights needs to be passed to
the backpropagation layer to compute the loss function (and hence, the
accuracy can be calculated). The selection of a method depends on the
data type, the task, and the optimization required for the model.

Bernoulli Distribution is a random experiment that has only two outcomes (usually called a “Success” or a “Failure”). It is best used when we have two outcomes of a given event. Its considered as the discrete
probability distribution, which has two possible outcomes. If the event happens, then the value is 1, and if the event does not happen, then the value is 0.

For discrete probability distribution, we calculate probability mass
function instead of probability density function. The probability mass
function looks like the following formula.

![](https://i.imgur.com/bz2dWtc.png)

From the Bernoulli distribution, we create sample tensors by considering the uniform distribution of size 4 and 4 in a matrix format, as follows.

 Specifically, `torch.bernoulli()` samples from the distribution and returns a binary value (i.e. either 0 or 1). Here, it returns 1 with probability p and return 0 with probability 1-p.

```python
torch.bernoulli(input, *, generator=None, out=None)
```
It draws binary random numbers (0 or 1) from a Bernoulli distribution.

Syntax

```python
torch.bernoulli(input, *, generator=None, out=None) → Tensor
```

Parameters :

input (Tensor) – the input tensor of probability values for the Bernoulli distribution

generator (torch.Generator, optional) – a pseudorandom number generator for sampling

out (Tensor, optional) – out tensor only has values 0 or 1 and is of the same shape as input.

The input tensor should be a tensor containing probabilities to be used for drawing the binary random number. Hence, all values in input have to be in the range:

0 <= input_i <=1


In [2]:
torch.bernoulli(torch.Tensor(4, 4).uniform_(0, 1))


tensor([[0., 0., 0., 1.],
        [1., 0., 0., 1.],
        [0., 1., 0., 1.],
        [1., 1., 1., 0.]])

# Generation of sample random values from a multinomial distribution

Note the syntax of multinomial function from official doc

```python
torch.multinomial(input, num_samples, replacement=False, *, generator=None, out=None) → LongTensor
```
Returns a tensor where each row contains num_samples indices sampled from the multinomial probability distribution located in the corresponding row of tensor input.



In [3]:
sample_tensor = torch.Tensor([10, 10, 13, 10, 34,45,65,67,87,89,87,34])
torch.multinomial(torch.tensor([10., 10., 13., 10., 34., 45., 65., 67., 87., 89., 87., 34.]), 3)

tensor([8, 7, 9])

Sampling from multinomial distribution with a replacement returns the tensors’ index values.

In [4]:
torch.multinomial(torch.tensor([10., 10., 13., 10., 34., 45., 65., 67., 87., 89., 87., 34.]), 5, replacement=True)

tensor([5, 5, 7, 7, 7])

And now, the weight initialization from the normal distribution, which is also a method
that is used in fitting a neural network, fitting a deep neural network, and
CNN and RNN. Let’s have a look at the process of creating a set of random
weights generated from a normal distribution.

Syntax

```python
torch.normal(mean, std, *, generator=None, out=None) → Tensor
```
Returns a tensor of random numbers drawn from separate normal distributions whose mean and standard deviation are given.

The mean is a tensor with the mean of each output element’s normal distribution

The std is a tensor with the standard deviation of each output element’s normal distribution

The shapes of mean and std don’t need to match, but the total number of elements in each tensor need to be the same.

In [5]:
torch.normal(mean=torch.arange(1., 11), std=torch.arange(1, 0, -0.1))

tensor([0.3873, 1.8538, 2.5595, 3.3766, 5.2688, 7.0845, 7.6231, 8.0542, 8.7442,
        9.9076])

In [6]:
torch.normal(mean=0.5, std=torch.arange(1.,6.))

tensor([ 3.2883, -0.2210,  1.6501,  1.4141, -4.3036])

In [7]:
torch.normal(mean=0.5, std=torch.arange(0.2, 0.6))


tensor([0.5485])

# Variable in PyTorch and its defined? What is a random variable in PyTorch?

In PyTorch, the algorithms are represented as a computational graph.

A variable is considered as a representation around the tensor object,
corresponding gradients (slope of the function), and a reference to the function from where it was
created. 

The slope of the function can be computed by the derivative of the
function with respect to the parameters that are present in the function.

Basically, a PyTorch variable is a node in a computational graph, which
stores data and gradients. When training a neural network model, after
each iteration, we need to compute the gradient of the loss function with
respect to the parameters of the model, such as weights and biases. After
that, we usually update the weights using the gradient descent algorithm.

Below Figure explains how the linear regression equation is deployed under
the hood using a neural network model in the PyTorch framework.
In a computational graph structure, the sequencing and ordering
of tasks is very important. The one-dimensional tensors are X, Y, W,
and alpha. The direction of the arrows change when we
implement backpropagation to update the weights to match with Y, so that
the error or loss function between Y and predicted Y can be minimized.


![Imgur](https://imgur.com/6JOtOGb.png)

#### Lets see and example

An example of how a variable is used to create a computational graph is
displayed in the following script. There are three variable objects around
tensors— x1, x2, and x3—with random points generated from a = 12 and
b = 23. The graph computation involves only multiplication and addition,
and the final result with the gradient is shown.

The partial derivative of the loss function with respect to the weights
and biases in a neural network model is achieved in PyTorch using the
Autograd module. Variables are specifically designed to hold the changed
values while running a backpropagation in a neural network model when
the parameters of the model change. The variable type is just a wrapper
around the tensor. It has three properties: data, grad, and function.





In [8]:
from torch.autograd import Variable
Variable(torch.ones(2,2), requires_grad=True)


tensor([[1., 1.],
        [1., 1.]], requires_grad=True)

In [9]:
a, b = 12, 23
x1 = Variable(torch.randn(a, b), requires_grad=True )
x2 = Variable(torch.randn(a,b), requires_grad=True)
x3 = Variable(torch.randn(a,b), requires_grad=True)

In [10]:
c = x1 * x2
d = a + x3
e = torch.sum(d)

e.backward()

print(e)

tensor(3296.0845, grad_fn=<SumBackward0>)


In [11]:
x1.data

tensor([[ 8.1828e-01, -1.6748e-01,  3.3572e-01,  1.3247e+00, -1.1390e-01,
         -1.1533e+00,  1.8396e-02,  1.1846e+00, -5.4557e-01,  2.6307e-01,
         -7.5235e-01, -1.0258e+00,  1.7264e+00,  3.1132e-01,  1.4500e+00,
          1.0881e+00,  6.6679e-01, -7.3055e-01,  1.0475e+00, -1.4899e+00,
          1.7605e+00,  3.0716e+00,  1.1979e+00],
        [-7.5272e-01, -1.6185e-01,  3.1758e-01, -1.1183e+00, -8.4021e-02,
         -1.5449e+00,  4.1204e-01,  4.0797e-01, -9.3155e-01,  3.6896e-01,
          6.0915e-01, -2.7323e-01,  1.5830e+00,  3.6732e-01, -4.4652e-02,
         -4.2154e-01,  5.9620e-01,  3.7424e-01,  5.9423e-01,  9.7377e-01,
         -2.4661e+00,  1.5546e+00,  1.2334e+00],
        [ 8.0186e-02,  2.4171e+00, -1.8963e+00, -6.1453e-01,  1.7587e-02,
         -2.2381e-01,  7.1807e-01,  4.1588e-01, -9.7389e-02, -7.9841e-02,
         -2.5912e-02, -7.5603e-01, -1.2318e+00, -9.9687e-01, -6.7872e-01,
         -1.1399e+00,  1.3165e+00,  7.0379e-01,  4.6876e-01, -3.2940e-01,
          1.84

In [12]:
x2.data

tensor([[-8.0376e-01,  1.4647e-01,  2.6804e-01,  3.5481e-01,  1.5795e-02,
          1.3872e-01,  3.4475e-01, -2.7923e-01,  9.9045e-01,  6.9297e-01,
         -5.5273e-01,  1.0695e+00,  1.4228e-01,  2.7997e-01, -3.2946e-01,
         -5.1557e-02, -6.0712e-01,  5.7552e-02, -2.2274e+00,  1.4848e+00,
         -2.7376e-01,  2.7362e-02,  1.5364e+00],
        [ 2.2755e-01,  7.2207e-01,  1.6892e+00,  9.3856e-01,  1.5081e+00,
         -2.8657e-02,  3.5874e-01, -4.7104e-01, -6.1006e-01, -1.4048e+00,
          4.8653e-01, -2.2149e-01,  1.8077e+00, -4.4378e-01, -9.5723e-02,
         -3.5231e-01,  1.0047e+00,  9.0307e-01, -1.5678e+00,  4.6605e-01,
          4.7620e-01,  3.9073e-01,  9.4730e-03],
        [-1.0358e+00,  1.7686e+00,  2.0033e+00,  4.9726e-01, -1.3719e+00,
          1.0420e+00,  1.6000e-01,  9.5809e-01, -5.9137e-01, -4.2875e-01,
         -1.3759e+00,  1.9779e+00, -5.9183e-01,  7.4635e-01, -8.1359e-01,
         -8.0251e-01,  2.4896e-01, -3.3909e-01,  2.0032e-01,  1.3580e+00,
          6.58

In [13]:
x3.data

tensor([[-4.8534e-01, -1.1212e+00,  2.4476e-01, -6.7335e-01,  2.9018e-01,
         -7.7006e-01,  3.8578e-01, -3.5645e-01, -4.8541e-01, -4.1493e-01,
         -5.3967e-01, -4.7828e-01,  8.5377e-01, -2.0661e-01,  1.3739e+00,
          1.9713e+00, -1.1111e-01, -1.5334e+00,  1.8913e-01,  5.4865e-01,
         -1.9779e+00, -6.4604e-01, -1.4073e+00],
        [-5.0758e-01,  1.5364e+00,  2.5612e+00, -3.0844e-01,  8.1289e-02,
         -1.3138e+00, -9.6680e-01,  4.3937e-01,  1.0111e+00,  1.2071e+00,
         -5.6007e-01,  1.5521e+00,  2.8156e-01, -6.9735e-01,  5.7842e-01,
          2.5003e-01,  7.7192e-01,  3.7598e-01,  4.9031e-01,  1.4758e+00,
          1.0504e+00,  1.7253e+00, -6.4296e-01],
        [-8.5327e-01, -4.7621e-01,  8.7918e-01, -1.1781e+00,  1.7463e-01,
         -5.0065e-02, -1.9840e+00,  2.8272e-01,  3.6420e-01,  6.0833e-01,
          1.4822e+00,  1.0106e+00, -1.3307e+00,  1.3743e+00, -5.1769e-01,
          1.1176e-01, -1.1042e+00,  2.8250e-01, -1.1586e-01,  1.9232e+00,
          2.06

# How do we set up a loss function and optimize it ? 

Choosing the right loss function increases the chances of model convergence. 

we use another tensor as the update variable, and introduce
the tensors to the sample model and compute the error or loss. Then we
compute the rate of change in the loss function to measure the choice of
loss function in model convergence.

In the following example, t_c and t_u are two tensors. This can be
constructed from any NumPy array.


In [14]:
torch.__version__

'1.7.1'

In [15]:
torch.tensor

<function _VariableFunctionsClass.tensor>

The sample model is just a linear equation to make the calculation
happen and the loss function defined if the mean square error (MSE)
shown next. For now, this is just a simple linear equation
computation.


In [16]:
#height of people
t_c = torch.tensor([58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0])

#weight of people
t_u = torch.tensor([115.0, 117.0, 120.0, 123.0, 126.0, 129.0, 132.0, 135.0, 139.0, 142.0, 146.0, 150.0, 154.0, 159.0,164.0])



Let’s now define the model. The w parameter is the weight tensor,
which is multiplied with the t_u tensor. The result is added with a constant
tensor, b, and the loss function chosen is a custom-built one; it is also available in PyTorch. 

In the following example, t_u is the tensor used, t_p
is the tensor predicted, and t_c is the precomputed tensor, with which the
predicted tensor needs to be compared to calculate the loss function.

The formula $$w * t_u + b$$ is the linear equation representation of a
tensor-based computation.



In [17]:
def model(t_u, w, b):
    return w * t_u + b
  
def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()
  
w = torch.ones(1)
b = torch.zeros(1)

t_p = model(t_u, w, b)
t_p

tensor([115., 117., 120., 123., 126., 129., 132., 135., 139., 142., 146., 150.,
        154., 159., 164.])

In [18]:
loss = loss_fn(t_p, t_c)
loss

tensor(5259.7334)

The initial loss value is 5259.7334, which is too high because of the
initial round of weights chosen. The error in the first round of iteration
is backpropagated to reduce the errors in the second round, for which
the initial set of weights needs to be updated. Therefore, the rate of
change in the loss function is essential in updating the weights in the
estimation process.


In [19]:
delta = 0.1

loss_rate_of_change_w = (loss_fn(model(t_u, 
                                       w + delta, b), 
                                 t_c) - loss_fn(model(t_u, w - delta, b), 
                                                t_c)) / (2.0 * delta)

In [20]:
learning_rate = 1e-2

w = w - learning_rate * loss_rate_of_change_w

There are two parameters to update the rate of loss function: the
learning rate at the current iteration and the learning rate at the previous
iteration. If the delta between the two iterations exceeds a certain
threshold, then the weight tensor needs to be updated, else model
convergence could happen. The preceding script shows the delta and
learning rate values. Currently, these are static values that the user has the
option to change.


In [21]:
loss_rate_of_change_b = (loss_fn(model(t_u, w, b + delta), t_c) - 
                         loss_fn(model(t_u, w, b - delta), t_c)) / (2.0 * delta)

b = b - learning_rate * loss_rate_of_change_b

b

tensor([544.])

This is how a simple mean square loss function works in a two-­
dimensional tensor example, with a tensor size of 10,5.
Let’s look at the following example. The MSELoss function is within the
neural network module of PyTorch.


In [22]:
from torch import nn
loss = nn.MSELoss()
input = torch.randn(10, 5, requires_grad=True)
target = torch.randn(10, 5)
output = loss(input, target)
output.backward()

When we look at the gradient calculation that is used for
backpropagation, it is shown as MSELoss.


In [23]:
output.grad_fn

<MseLossBackward at 0x7f8e0049afa0>

# Tensor differentiation and its relevance in computational graph execution using the PyTorch framework

The computational graph network is represented by nodes and connected
through functions. There are two different kinds of nodes: dependent and
independent. Dependent nodes are waiting for results from other nodes
to process the input. Independent nodes are connected and are either
constants or the results. Tensor differentiation is an efficient method to
perform computation in a computational graph environment.

In a computational graph, tensor differentiation is very effective because
the tensors can be computed as parallel nodes, multiprocess nodes, or
multithreading nodes. The major deep learning and neural computation
frameworks include this tensor differentiation.
Autograd is the function that helps perform tensor differentiation,
which means calculating the gradients or slope of the error function,
and backpropagating errors through the neural network to fine-tune the
weights and biases. Through the learning rate and iteration, it tries to
reduce the error value or loss function.
To apply tensor differentiation, the nn.backward() method needs to
be applied. Let’s take an example and see how the error gradients are
backpropagated. To update the curve of the loss function, or to find where
the shape of the loss function is minimum and in which direction it is
moving, a derivative calculation is required. Tensor differentiation is a way
to compute the slope of the function in a computational graph.


In [24]:
# Make a sample tensor x, for which automatic gradient calculation needs to happen.
x = Variable(torch.ones(4, 4) * 12.5, requires_grad=True)
x

tensor([[12.5000, 12.5000, 12.5000, 12.5000],
        [12.5000, 12.5000, 12.5000, 12.5000],
        [12.5000, 12.5000, 12.5000, 12.5000],
        [12.5000, 12.5000, 12.5000, 12.5000]], requires_grad=True)

In [25]:
#  Create a linear function fn that is created using the x variable.
fn = 2 * (x * x) + 5 * x + 6


#  Using the backward function, we can perform a backpropagation calculation. 
fn.backward(torch.ones(4,4))

# The .grad() function holds the final output from the tensor differentiation.
x.grad

tensor([[55., 55., 55., 55.],
        [55., 55., 55., 55.],
        [55., 55., 55., 55.],
        [55., 55., 55., 55.]])

# Define a feed forward neural network with a toy dataset

The toy dataset, with 5,000 samples each having 32 features, is divided
into 80% train and 20% test. Let’s create a class that defines the neural
network using PyTorch’s NN module.

Feed-forward neural networks were the earliest implementations within
deep learning. These networks are called feed-forward because the
information within them moves only in one direction (forward)—that is,
from the input nodes (units) towards the output units.

We will now implement a simple network using PyTorch. Defining the creation of a neural network for the purpose of this exercise.

In [26]:
# Creating a toy dataset

samples = 5000

#Let’s divide the toy dataset into training (80%) and rest for validation.
train_split = int(samples*0.8)

#Create a dummy classification dataset
X, y = make_blobs(n_samples=samples, centers=2, n_features=64, cluster_std=10, random_state=2020)
y = y.reshape(-1,1)

#Convert the numpy datasets to Torch Tensors
X,y = tch.from_numpy(X),tch.from_numpy(y)
X,y =X.float(),y.float()

#Split the datasets inot train and test(validation)
X_train, x_test = X[:train_split], X[train_split:]
Y_train, y_test = y[:train_split], y[train_split:]

#Print shapes of each dataset
print("X_train.shape:",X_train.shape)
print("x_test.shape:",x_test.shape)
print("Y_train.shape:",Y_train.shape)
print("y_test.shape:",y_test.shape)
print("X.dtype",X.dtype)
print("y.dtype",y.dtype)



X_train.shape: torch.Size([4000, 64])
x_test.shape: torch.Size([1000, 64])
Y_train.shape: torch.Size([4000, 1])
y_test.shape: torch.Size([1000, 1])
X.dtype torch.float32
y.dtype torch.float32


In [27]:
#Define a neural network with 3 hidden layers and 1 output layer
#Hidden Layers will have 64,256 and 1024 neurons
#Output layers will have 1 neuron

class NeuralNetwork(nn.Module):
    
    def __init__(self):
        super().__init__()
        tch.manual_seed(2020)
        self.fc1 = nn.Linear(64, 256)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(256, 1024)
        self.relu2 = nn.ReLU()
        self.out = nn.Linear(1024, 1)
        self.final = nn.Sigmoid()
        
    def forward(self, x):
        op = self.fc1(x)
        op = self.relu1(op)        
        op = self.fc2(op)
        op = self.relu2(op)
        op = self.out(op)
        y = self.final(op)
        return y


The torch.nn module provides the essential means to define and train
neural networks. It contains all the necessary building blocks for creating
neural networks of various kinds, sizes, and complexity. We will create
a class for our neural network by inheriting this module and create an
initializing method as well as a forward pass method.


The __init__ method creates the different pieces of the network
and keeps it ready for us every time we create an object with this class.
Essentially, we used the initialization method to create the hidden layers,
the output layer, and the activation for each layer. 

The `nn.Linear(64,256)` function creates a layer with 64 input features and 256 output features.
The next layer, naturally, will have 256 input features, and so on. The `nn.ReLU()` and `nn.Sigmoid()` functions add the activation function when
connected to a layer. Each of the individual components created within the
initialization function is connected in the `forward()` method.


In the forward method, we connect the individual components of
the neural network. The first hidden layer, fc1, accepts input data and
produces 256 outputs for the next layer. The fc1 layer is passed to the
relu1 activation layer, which then passes the activated output to the next
layer, fc2, which repeats the same process, to create the final output layer,
which has the sigmoid activation function (since our toy dataset is crafted
for binary classification).


On creating an object of the class NeuralNetwork, and calling the
forward method, we get outputs from the network, which are computed
by multiplying the input matrix with a randomly initialized weight matrix
passed through an activation function and repeated for the number of
hidden layers until the final output layer. At first, the network would
obviously generate junk outputs—i.e., predictions (which would add no
value to our classification problem, at least not now).


# Defining the Loss, Optimizer, and Training Function for the Neural Network

To get more accurate predictions for our given problem, we would
need to train the network—i.e., to backpropagate the loss and update the
weights with respect to the loss function. Fortunately, PyTorch provides
these essential building blocks in an extremely easy to use and intuitive
way. 


In [28]:
#Define function for training a network
def train_network(model,optimizer,loss_function, num_epochs,batch_size,X_train,Y_train):
    #Explicitly start model training
    model.train()

    loss_across_epochs = []
    for epoch in range(num_epochs):
        train_loss= 0.0


        for i in range(0,X_train.shape[0],batch_size):

            #Extract train batch from X and Y
            input_data = X_train[i:min(X_train.shape[0],i+batch_size)]
            labels = Y_train[i:min(X_train.shape[0],i+batch_size)]

            #set the gradients to zero before starting to do backpropragation 
            optimizer.zero_grad()

            #Forward pass
            output_data  = model(input_data)

            #Caculate loss
            loss = loss_function(output_data, labels)

            #Backpropogate
            loss.backward()

            #Update weights
            optimizer.step()

            train_loss += loss.item() * batch_size

        print("Epoch: {} - Loss:{:.4f}".format(epoch+1,train_loss ))
        loss_across_epochs.extend([train_loss])        

    #Predict
    y_test_pred = model(x_test)
    a =np.where(y_test_pred>0.5,1,0)
    return(loss_across_epochs)
###------------END OF FUNCTION--------------

#Create an object of the Neural Network class
model = NeuralNetwork()

#Define loss function
loss_function = nn.BCELoss()  #Binary Crosss Entropy Loss

#Define Optimizer
adam_optimizer = tch.optim.Adam(model.parameters(),lr= 0.001)

#Define epochs and batch size
num_epochs = 10
batch_size=16


#Calling the function for training and pass model, optimizer, loss and related paramters
adam_loss = train_network(model,adam_optimizer \
,loss_function,num_epochs,batch_size,X_train,Y_train)



Epoch: 1 - Loss:107.9976
Epoch: 2 - Loss:8.7378
Epoch: 3 - Loss:8.2710
Epoch: 4 - Loss:0.8969
Epoch: 5 - Loss:0.2221
Epoch: 6 - Loss:0.0017
Epoch: 7 - Loss:0.0016
Epoch: 8 - Loss:0.0014


Let’s look at the individual components we defined leveraging PyTorch’s readily provided building
blocks. We need to define a loss function that will be used to measure
the difference between our predictions and actual labels. PyTorch
provides a comprehensive list of loss functions for different outcomes.
These loss functions are available under torch.nn.*. Examples include
MSELoss (mean squared error loss), CrossEntropyLoss (for multi-class
classification), and BCELoss (binary cross-entropy loss), which is used
for binary classification. For our use case, we will leverage binary cross-
entropy loss.

```py
This is defined as loss_function = torch.nn.BCELoss().

```



# Gradient-Based Optimization Techniques

### Gradient Descent with Momentum

Gradient descent with momentum leverages the past
gradients to calculate an exponentially weighted average of the gradients
to further smoothen the parameter updates.

![Imgur](https://imgur.com/ZNSJbau.png)

The update process can be simplified using the following equations.
First, we compute an exponentially weighted average of the past gradients
as νt, where 

$$νt = γνt − 1 + η∇ΘJ(Θ)$$

and 

$$Θ = Θ - νt.$$


The γ here is a hyperparameter that takes values between 0 and 1.
Next, we use this exponentially weighted average in the updates of weights
instead of the gradients directly.

By leveraging the exponentially weighted averages of the gradients,
instead of directly using the gradients, the incremental steps are smoother and
faster and thus overcome the problems with oscillating around the minima.



### RMSprop

At the core, RMSprop computes the moving
average of the squared gradients for each weight and divides the gradient
by the square root of the mean square. This complex process should help
in decoding the name root mean square prop. Leveraging exponential
average here helps in giving recent updates more preferences than less
recent ones.
The RMSprop can be represented as follows:

![Imgur](https://imgur.com/MaLGGt9.png)


where η – is a hyperparameter that defines the initial learning rate, and
gt is the gradient at time t for a parameter/weight w in Θ. We add ∈ to the
denominator to avoid divide by zero situations.


### Adam

A simplified name for adaptive moment estimation, Adam is the most
popular choice recently for optimizers in deep learning. In a simple way,
Adam combines the best of RMSprop and stochastic gradient descent
with momentum. From RMSprop, it borrows the idea of using squared
gradients to scale the learning rate, and it takes the idea of moving
averages of the gradient instead of directly using the gradient when
compared to SGD with momentum.
Here, for each weight w in Θ, we have

![Imgur](https://imgur.com/7pt6keW.png)

The preceding three types of optimization algorithms represent just a
few from the breadth of available options for different types of use cases
within deep learning. 

---

# Training Model with Various Optimizers

Next, we define an optimizer for our network.

Pytorch provides a comprehensive list of optimizers that can be used for building various
kinds of neural networks. All optimizers are organized under torch.
optim.* (e.g., `torch.optim.SGD`, for SGD optimizer). For our use case,
we are using the Adam optimizer (the most recommended optimizer for
the majority of use cases). While defining the optimizer, we also need
to define the parameters for which the gradient needs to be computed
during backpropagation. For the neural network, this list would be all
the weights in the feed-forward network. We can easily denote the entire
list of model weights to the optimizer by using `model.parameters()`
within the definition of the optimizer. We can then additionally define
hyperparameters for the selected optimizer. By default, PyTorch provides
fairly good values for all necessary hyperparameters. However, we can
further override them to tailor optimizers for our use case.

```py
adam_optimizer = tch.optim.Adam(model.parameters(),lr= 0.001)
```

Lastly, we need to define the batch size and the number of epochs
required to train our model. Batch size refers to the number of samples
within a batch in a mini-batch update. One forward and backward pass
for all the batches that cover all samples once is called an epoch. Finally,
we pass all these constructs to our function to train our model. Let’s take a
detailed look at the constructs within the function.


In our training function, we define a structure to train our network with
the provided optimizer, loss function, model object, and training data over
batches for the defined number of epochs. First, we initiate our model for
training mode with model.train(). Setting the model object to train mode
explicitly is essential; the same would be essential while leveraging the
model for evaluation—i.e., explicitly setting the model to evaluate mode
with model.eval(). This ensures that the model is aware of the time when
it is expected to update the parameters and when to not. In the preceding
example, we did not add the evaluation loop because it is a tiny toy dataset.


We will train the network over mini-batches. The for loop divides the
training data into batches with our defined size. The training data, along
with the corresponding labels, is extracted for a batch using the following
code:



```py
input_data = X_train[i:min(X_train.shape[0],i+batch_size)]
labels = Y_train[i:min(X_train.shape[0],i+batch_size)]

```

We then need to set the gradients to zero before starting to do
backpropagation using optimizer.zero_grad(). Missing this step will
accumulate the gradients on subsequent backward passes and lead to
undesirable effects. This behavior is by design in PyTorch. Then, we
calculate the forward pass using output_data = model(input_data).
The forward pass is the execution of the forward() function in our class
definition. It connects the different layers we defined for the network,
which finally outputs the prediction for each sample. 

Once we have the predictions, we can calculate its deviation from the actual label using the
loss function—i.e., 

$$loss = loss_function(output_data, labels)$$


To backpropagate our loss, PyTorch provides a built-in module that
does the heavy lifting for computing gradients for the loss with respect to
the weights. We simply call the loss.backward() method, and the entire
backpropagation is taken care of. 

Once the gradients are computed, it is time to update our model weights. This is done in the step
optimizer.step(). The optimizer step is aware of the parameters that
need to be updated with the gradient, as we provided them while defining
our optimizer. Calling the optimizer.step() function updates the weights
for the network, automatically taking into account the hyperparameters
defined within the optimizer—in our case, the learning rate.
We repeat this process over batches for the entire training sample. The
training process is repeated for multiple epochs, and with each iteration
we expect the losses to reduce and the weights to align in order to achieve
better accuracy for predictions.

Below code ses different optimizers to illustrate the training process
for the preceding neural network. Since the network was trained for a toy
dataset, we will plot the total losses after each epoch for different optimizers,
instead of plotting the validation accuracy. We can study the outputs i.e. loss
across epochs for each optimization variant showcased in the plot at the bottom.


In [None]:

#Define loss function
loss_function = nn.BCELoss()  #Binary Crosss Entropy Loss
num_epochs = 10
batch_size=16

#Define a model object from the class defined earlier
model = NeuralNetwork()

#Train network using RMSProp optimizer
rmsprp_optimizer = tch.optim.RMSprop(model.parameters()
, lr=0.01, alpha=0.9
, eps=1e-08, weight_decay=0.1
, momentum=0.1, centered=True)
print("RMSProp...")

rmsprop_loss = train_network(model,rmsprp_optimizer,loss_function
,num_epochs,batch_size,X_train,Y_train)


#Train network using Adam optimizer

model = NeuralNetwork()
adam_optimizer = tch.optim.Adam(model.parameters(),lr= 0.001)
print("Adam...")
adam_loss = train_network(model,adam_optimizer,loss_function
,num_epochs,batch_size,X_train,Y_train)

#Train network using SGD optimizer

model = NeuralNetwork()
sgd_optimizer = tch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
print("SGD...")
sgd_loss = train_network(model,sgd_optimizer,loss_function
,num_epochs,batch_size,X_train,Y_train) 

#Plot the losses for each optimizer across epochs
import matplotlib.pyplot as plt
%matplotlib inline

epochs = range(0,10)

ax = plt.subplot(111)
ax.plot(adam_loss,label="ADAM")
ax.plot(sgd_loss,label="SGD")
ax.plot(rmsprop_loss,label="RMSProp")
ax.legend()
plt.xlabel("Epochs")
plt.ylabel("Overall Loss")
plt.title("Loss across epochs for different optimizers")
plt.show()

