#### PyTorch has emerged as a major contender in the race to be the king of deep learning frameworks.

In this notebook I will go over some regular snippets and techniques of it.

# Compute basic gradients from the sample tensors using PyTorch

#### First some basics of Pytorch here

**Autograd**: This class is an engine to calculate derivatives (Jacobian-vector product to be more precise). It records a graph of all the operations performed on a gradient enabled tensor and creates an acyclic graph called the dynamic computational graph. The leaves of this graph are input tensors and the roots are output tensors. Gradients are calculated by tracing the graph from the root to the leaf and multiplying every gradient in the way using the chain rule.

A Variable class wraps a tensor. You can access this tensor by calling `.data` attribute of a Variable.

The Variable also stores the gradient of a scalar quantity (say, loss) with respect to the parameter it holds. This gradient can be accessed by calling the `.grad` attribute. This is basically the gradient computed up to this particular node, and the gradient of the every subsequent node, can be computed by multiplying the edge weight with the gradient computed at the node just before it.

The third attribute a Variable holds is a grad_fn, a Function object which created the variable.

**Variable**: The Variable, just like a Tensor is a class that is used to hold data. It differs, however, in the way it’s meant to be used. Variables are specifically tailored to hold values which change during training of a neural network, i.e. the learnable paramaters of our network. Tensors on the other hand are used to store values that are not to be learned. For example, a Tensor maybe used to store the values of the loss generated by each example.

Every **variable** object has several members one of them is **grad**:

**grad**: grad holds the value of gradient. If requires_grad is False it will hold a None value. Even if requires_grad is True, it will hold a None value unless .backward() function is called from some other node. For example, if you call out.backward() for some variable out that involved x in its calculations then x.grad will hold ∂out/∂x.

**Backward() function**
Backward is the function which actually calculates the gradient by passing it’s argument (1x1 unit tensor by default) through the backward graph all the way up to every leaf node traceable from the calling root tensor. The calculated gradients are then stored in .grad of every leaf node. Remember, the backward graph is already made dynamically during the forward pass. Backward function only calculates the gradient using the already made graph and stores them in leaf nodes.

In [1]:
import torch
from torch.autograd import Variable

def forward(x):
    return x * w

w = Variable(torch.Tensor([1.0]), requires_grad=True)
# . On setting .requires_grad = True they start forming a backward graph
# that tracks every operation applied on them to calculate the gradients
# using something called a dynamic computation graph (DCG)
# When you finish your computation you can call .backward() and have
# all the gradients computed automatically. The gradient for this tensor
# will be accumulated into .grad attribute.

# Now create an array of data.
# By PyTorch’s design, gradients can only be calculated
# for floating point tensors which is why I’ve created a float type
# array before making it a gradient enabled PyTorch tensor
x_data = [11.0, 22.0, 33.0]
y_data = [21.0, 14.0, 64.0]

def loss_function(x, y):
    y_pred = forward(x)
    return (y_pred - y) * (y_pred - y)


# Now running the training loop
for epoch in range(10):
    for x_val, y_val in zip(x_data, y_data):
        l = loss_function(x_val, y_val)
        l.backward()
        print("\tgrad: ", x_val, y_val, w.grad.data[0])
        w.data = w.data - 0.01 * w.grad

        # Manually set the gradient to zero after updating weights
        w.grad.data.zero_()

        print('progress: ', epoch, l.data[0])

	grad:  11.0 21.0 tensor(-220.)
progress:  0 tensor(100.)
	grad:  22.0 14.0 tensor(2481.6001)
progress:  0 tensor(3180.9602)
	grad:  33.0 64.0 tensor(-51303.6484)
progress:  0 tensor(604238.8125)
	grad:  11.0 21.0 tensor(118461.7578)
progress:  1 tensor(28994192.)
	grad:  22.0 14.0 tensor(-671630.6875)
progress:  1 tensor(2.3300e+08)
	grad:  33.0 64.0 tensor(13114108.)
progress:  1 tensor(3.9481e+10)
	grad:  11.0 21.0 tensor(-30279010.)
progress:  2 tensor(1.8943e+12)
	grad:  22.0 14.0 tensor(1.7199e+08)
progress:  2 tensor(1.5279e+13)
	grad:  33.0 64.0 tensor(-3.3589e+09)
progress:  2 tensor(2.5900e+15)
	grad:  11.0 21.0 tensor(7.7553e+09)
progress:  3 tensor(1.2427e+17)
	grad:  22.0 14.0 tensor(-4.4050e+10)
progress:  3 tensor(1.0023e+18)
	grad:  33.0 64.0 tensor(8.6030e+11)
progress:  3 tensor(1.6991e+20)
	grad:  11.0 21.0 tensor(-1.9863e+12)
progress:  4 tensor(8.1519e+21)
	grad:  22.0 14.0 tensor(1.1282e+13)
progress:  4 tensor(6.5750e+22)
	grad:  33.0 64.0 tensor(-2.2034e+14)
pro

Weight initialization is an important task in training a neural network,
whether its a convolutional neural network
(CNN), a deep neural network (DNN), and a recurrent neural network
(RNN). Lets some examples of initializing the weights.


Weight initialization can be done by using various methods, including
random weight initialization.
Weight initialization based on a distribution
is done using
- Uniform distribution,
- Bernoulli distribution,
- Multinomial distribution, and normal distribution.

To execute a neural network, a set of initial weights needs to be passed to
the backpropagation layer to compute the loss function (and hence, the
accuracy can be calculated). The selection of a method depends on the
data type, the task, and the optimization required for the model.

Bernoulli Distribution is a random experiment that has only two outcomes (usually called a “Success” or a “Failure”). It is best used when we have two outcomes of a given event. Its considered as the discrete
probability distribution, which has two possible outcomes. If the event happens, then the value is 1, and if the event does not happen, then the value is 0.

For discrete probability distribution, we calculate probability mass
function instead of probability density function. The probability mass
function looks like the following formula.

![](https://i.imgur.com/bz2dWtc.png)

From the Bernoulli distribution, we create sample tensors by considering the uniform distribution of size 4 and 4 in a matrix format, as follows.

 Specifically, `torch.bernoulli()` samples from the distribution and returns a binary value (i.e. either 0 or 1). Here, it returns 1 with probability p and return 0 with probability 1-p.

```python
torch.bernoulli(input, *, generator=None, out=None)
```
It draws binary random numbers (0 or 1) from a Bernoulli distribution.

Syntax

```python
torch.bernoulli(input, *, generator=None, out=None) → Tensor
```

Parameters :

input (Tensor) – the input tensor of probability values for the Bernoulli distribution

generator (torch.Generator, optional) – a pseudorandom number generator for sampling

out (Tensor, optional) – out tensor only has values 0 or 1 and is of the same shape as input.

The input tensor should be a tensor containing probabilities to be used for drawing the binary random number. Hence, all values in input have to be in the range:

0 <= input_i <=1


In [2]:
torch.bernoulli(torch.Tensor(4, 4).uniform_(0, 1))


tensor([[1., 1., 0., 0.],
        [1., 0., 1., 0.],
        [1., 1., 0., 0.],
        [1., 1., 0., 1.]])

# Generation of sample random values from a multinomial distribution

Note the syntax of multinomial function from official doc

```python
torch.multinomial(input, num_samples, replacement=False, *, generator=None, out=None) → LongTensor
```
Returns a tensor where each row contains num_samples indices sampled from the multinomial probability distribution located in the corresponding row of tensor input.



In [3]:
sample_tensor = torch.Tensor([10, 10, 13, 10, 34,45,65,67,87,89,87,34])
torch.multinomial(torch.tensor([10., 10., 13., 10., 34., 45., 65., 67., 87., 89., 87., 34.]), 3)

tensor([8, 7, 6])

Sampling from multinomial distribution with a replacement returns the tensors’ index values.

In [4]:
torch.multinomial(torch.tensor([10., 10., 13., 10., 34., 45., 65., 67., 87., 89., 87., 34.]), 5, replacement=True)

tensor([ 2, 10,  7,  7,  5])

And now, the weight initialization from the normal distribution, which is also a method
that is used in fitting a neural network, fitting a deep neural network, and
CNN and RNN. Let’s have a look at the process of creating a set of random
weights generated from a normal distribution.

Syntax

```python
torch.normal(mean, std, *, generator=None, out=None) → Tensor
```
Returns a tensor of random numbers drawn from separate normal distributions whose mean and standard deviation are given.

The mean is a tensor with the mean of each output element’s normal distribution

The std is a tensor with the standard deviation of each output element’s normal distribution

The shapes of mean and std don’t need to match, but the total number of elements in each tensor need to be the same.

In [5]:
torch.normal(mean=torch.arange(1., 11), std=torch.arange(1, 0, -0.1))

tensor([ 2.2550,  2.3941,  4.1017,  3.8206,  3.8617,  6.7376,  6.9885,  8.2895,
         9.0641, 10.1435])

In [6]:
torch.normal(mean=0.5, std=torch.arange(1.,6.))

tensor([ 1.7196, -1.7331, -0.5258,  8.6843,  6.7171])

In [7]:
torch.normal(mean=0.5, std=torch.arange(0.2, 0.6))


tensor([0.1715])

# Variable in PyTorch and its defined? What is a random variable in PyTorch?

In PyTorch, the algorithms are represented as a computational graph.

A variable is considered as a representation around the tensor object,
corresponding gradients (slope of the function), and a reference to the function from where it was
created. 

The slope of the function can be computed by the derivative of the
function with respect to the parameters that are present in the function.

Basically, a PyTorch variable is a node in a computational graph, which
stores data and gradients. When training a neural network model, after
each iteration, we need to compute the gradient of the loss function with
respect to the parameters of the model, such as weights and biases. After
that, we usually update the weights using the gradient descent algorithm.

Below Figure explains how the linear regression equation is deployed under
the hood using a neural network model in the PyTorch framework.
In a computational graph structure, the sequencing and ordering
of tasks is very important. The one-dimensional tensors are X, Y, W,
and alpha. The direction of the arrows change when we
implement backpropagation to update the weights to match with Y, so that
the error or loss function between Y and predicted Y can be minimized.


![Imgur](https://imgur.com/6JOtOGb.png)

#### Lets see and example

An example of how a variable is used to create a computational graph is
displayed in the following script. There are three variable objects around
tensors— x1, x2, and x3—with random points generated from a = 12 and
b = 23. The graph computation involves only multiplication and addition,
and the final result with the gradient is shown.

The partial derivative of the loss function with respect to the weights
and biases in a neural network model is achieved in PyTorch using the
Autograd module. Variables are specifically designed to hold the changed
values while running a backpropagation in a neural network model when
the parameters of the model change. The variable type is just a wrapper
around the tensor. It has three properties: data, grad, and function.





In [8]:
from torch.autograd import Variable
Variable(torch.ones(2,2), requires_grad=True)


tensor([[1., 1.],
        [1., 1.]], requires_grad=True)

In [9]:
a, b = 12, 23
x1 = Variable(torch.randn(a, b), requires_grad=True )
x2 = Variable(torch.randn(a,b), requires_grad=True)
x3 = Variable(torch.randn(a,b), requires_grad=True)

In [10]:
c = x1 * x2
d = a + x3
e = torch.sum(d)

e.backward()

print(e)

tensor(3308.0266, grad_fn=<SumBackward0>)


In [11]:
x1.data

tensor([[-1.1948e+00, -4.3463e-01,  1.4228e-02, -1.1072e+00, -2.5251e+00,
         -4.8485e-01,  7.0058e-01, -1.6862e+00,  7.9322e-01,  4.1430e-01,
          5.3667e-01,  8.9945e-01, -6.6198e-01,  6.2360e-01,  2.7276e-01,
          8.8737e-01, -9.3339e-01,  1.5435e+00, -1.0010e-01, -7.0054e-01,
         -1.2701e+00,  7.0417e-01, -1.2603e+00],
        [-1.4176e+00, -5.6853e-01, -4.7011e-01,  8.6958e-02,  1.0064e+00,
          7.6997e-01, -4.3721e-01, -1.6879e+00,  1.8929e+00,  9.0744e-01,
         -1.0529e+00, -8.6968e-01, -3.3842e-01,  2.1060e-01,  3.4340e-01,
          3.6989e-01, -1.8349e-02, -7.7136e-01,  9.9383e-02,  5.7914e-01,
          9.1266e-01,  6.1785e-01, -1.7764e-01],
        [-2.4128e-01, -6.0513e-01, -8.5285e-01, -1.1193e+00, -2.2175e-01,
         -8.9704e-01,  2.4373e+00, -6.8417e-01,  3.6260e-01,  1.5849e+00,
         -1.5271e+00,  4.1806e-01, -6.5407e-02, -1.2069e-01, -5.3221e-01,
          2.3037e-01,  1.1260e+00,  1.5921e+00, -7.1246e-01,  2.6244e-01,
         -2.79

In [13]:
x2.data

tensor([[-0.2323,  1.0593,  1.2485, -0.1623, -0.2469, -0.5072, -1.7480, -0.3025,
          0.3745, -0.2810,  1.3867, -1.6541,  0.7997,  0.2644,  0.3291,  0.5063,
          1.0440, -1.0739,  0.8740, -0.9745,  0.7733, -0.5117, -0.4838],
        [-2.1859,  1.0723,  0.2196, -0.8392, -0.0644, -0.3019, -0.0626, -0.6245,
          1.4117, -0.4612, -0.2708,  0.1783, -0.1824, -1.3890, -0.9597,  0.2676,
         -0.0511,  0.0432, -0.1359,  0.1001,  0.1300,  1.2376, -0.4733],
        [-0.4443, -0.5492, -1.1786,  0.4643, -1.1559, -1.0434,  0.7723,  0.1378,
          1.9992,  0.6532,  0.5285,  0.2646, -0.9390,  1.3218, -0.0312,  2.1481,
         -1.6313, -0.2006,  0.1719,  1.9600, -1.5404,  0.3777,  1.1420],
        [-0.5087, -1.2848,  0.0456,  0.1344, -0.2015, -1.5203, -0.3222, -0.3510,
         -1.4221,  2.0242,  0.8190, -1.4847, -0.4617, -0.1612, -0.2404, -2.4287,
         -0.4648,  1.0000,  0.4915, -1.5307,  1.0485, -0.5486,  0.1091],
        [-1.4989,  1.7670, -0.8108,  0.7540, -0.1301,  0.518

In [12]:
x3.data

tensor([[ 2.2685e+00, -1.6303e-01, -5.0734e-01,  2.5137e+00, -7.5152e-01,
          8.5568e-01, -1.5421e+00, -3.8581e-01,  1.3785e-01,  5.5638e-01,
         -5.4050e-01,  4.5599e-01, -1.3799e+00, -1.1523e+00,  1.1155e+00,
         -9.9180e-01,  1.2502e+00, -1.1242e+00,  1.8247e+00,  9.4384e-01,
         -6.4326e-01,  1.7480e-01, -5.1634e-01],
        [-3.2151e-01,  1.0037e+00,  2.2172e-01,  1.3565e+00,  1.2506e+00,
          8.4615e-01, -5.5272e-01,  2.2040e+00,  1.5468e+00, -4.5035e-01,
          6.0280e-01, -4.4529e-01,  2.0841e-01, -6.9552e-01,  1.6227e+00,
         -2.7073e+00, -2.9638e-01, -5.8008e-02,  9.4437e-01, -3.0339e-02,
         -3.3180e-01, -1.5900e+00, -1.0186e+00],
        [ 2.3928e-01,  2.9273e-01,  3.9009e-01, -4.7186e-01, -7.4832e-01,
          1.9697e-02,  1.0442e+00, -1.6350e+00, -7.9645e-01, -4.4036e-01,
         -7.8541e-01, -1.6881e+00,  4.4951e-01, -1.0841e+00,  4.4919e-01,
         -4.6066e-01, -5.7666e-01, -6.2507e-01, -5.5095e-01, -1.1946e+00,
         -3.28

# How do we set up a loss function and optimize it ? 

Choosing the right loss function increases the chances of model convergence. 

we use another tensor as the update variable, and introduce
the tensors to the sample model and compute the error or loss. Then we
compute the rate of change in the loss function to measure the choice of
loss function in model convergence.

In the following example, t_c and t_u are two tensors. This can be
constructed from any NumPy array.


In [14]:
torch.__version__

'1.7.1'

In [15]:
torch.tensor

<function _VariableFunctionsClass.tensor>

The sample model is just a linear equation to make the calculation
happen and the loss function defined if the mean square error (MSE)
shown next. For now, this is just a simple linear equation
computation.


In [25]:
#height of people
t_c = torch.tensor([58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0])

#weight of people
t_u = torch.tensor([115.0, 117.0, 120.0, 123.0, 126.0, 129.0, 132.0, 135.0, 139.0, 142.0, 146.0, 150.0, 154.0, 159.0,164.0])



tensor([115., 117., 120., 123., 126., 129., 132., 135., 139., 142., 146., 150.,
        154., 159., 164.])

Let’s now define the model. The w parameter is the weight tensor,
which is multiplied with the t_u tensor. The result is added with a constant
tensor, b, and the loss function chosen is a custom-built one; it is also available in PyTorch. 

In the following example, t_u is the tensor used, t_p
is the tensor predicted, and t_c is the precomputed tensor, with which the
predicted tensor needs to be compared to calculate the loss function.

The formula $$w * t_u + b$$ is the linear equation representation of a
tensor-based computation.



In [None]:
def model(t_u, w, b):
    return w * t_u + b
  
def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()
  
w = torch.ones(1)
b = torch.zeros(1)

t_p = model(t_u, w, b)
t_p

In [17]:
loss = loss_fn(t_p, t_c)
loss

tensor(5259.7334)

The initial loss value is 5259.7334, which is too high because of the
initial round of weights chosen. The error in the first round of iteration
is backpropagated to reduce the errors in the second round, for which
the initial set of weights needs to be updated. Therefore, the rate of
change in the loss function is essential in updating the weights in the
estimation process.


In [18]:
delta = 0.1

loss_rate_of_change_w = (loss_fn(model(t_u, 
                                       w + delta, b), 
                                 t_c) - loss_fn(model(t_u, w - delta, b), 
                                                t_c)) / (2.0 * delta)

In [19]:
learning_rate = 1e-2

w = w - learning_rate * loss_rate_of_change_w

There are two parameters to update the rate of loss function: the
learning rate at the current iteration and the learning rate at the previous
iteration. If the delta between the two iterations exceeds a certain
threshold, then the weight tensor needs to be updated, else model
convergence could happen. The preceding script shows the delta and
learning rate values. Currently, these are static values that the user has the
option to change.


In [21]:
loss_rate_of_change_b = (loss_fn(model(t_u, w, b + delta), t_c) - 
                         loss_fn(model(t_u, w, b - delta), t_c)) / (2.0 * delta)

b = b - learning_rate * loss_rate_of_change_b

b

tensor([1078.3999])

This is how a simple mean square loss function works in a two-­
dimensional tensor example, with a tensor size of 10,5.
Let’s look at the following example. The MSELoss function is within the
neural network module of PyTorch.


In [23]:
from torch import nn
loss = nn.MSELoss()
input = torch.randn(10, 5, requires_grad=True)
target = torch.randn(10, 5)
output = loss(input, target)
output.backward()

When we look at the gradient calculation that is used for
backpropagation, it is shown as MSELoss.


In [24]:
output.grad_fn

<MseLossBackward at 0x7f5df74a1ac0>