1.How does unsqueeze help us to solve certain broadcasting problems?

The size of one tensor must be equal to the size of the other tensor. If we want to broadcast in the other dimension, we have to change the shape of our vector. This is done with the unsqueeze method in PyTorch. It has been exemplified below.

In [40]:
import torch
from torch import tensor
c = tensor([16.,25,36])
m = tensor([[1., 2.5, 3.6], [4,5,6], [36,43,108]])
c = c.unsqueeze(1)
m.shape,c.shape

(torch.Size([3, 3]), torch.Size([3, 1]))

In [41]:
#2.How can we use indexing to do the same operation as unsqueeze?

#The unsqueeze command can be replaced by None indexing.

c.shape, c[None,:].shape,c[:,None].shape

(torch.Size([3, 1]), torch.Size([1, 3, 1]), torch.Size([3, 1, 1]))

3.How do we show the actual contents of the memory used for a tensor?

The commonly used way to store such data is in a single array that is laid out as a single, contiguous block within memory. More concretely, a 3x3x3 tensor would be stored simply as a single array of 27 values, one after the other.

4.When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added to each row or each column of the matrix? (Be sure to check your answer by running this code in a notebook.)

In [10]:
try:
    import numpy as np
    vector_row = np.array([1,2,3]) # creating a vector as a row (size 3)

    matrix = np.array([[1,2,3], # creating a 3X3 matrix
                  [4,5,6],
                   [7,8,9]])

    c  = np.add(vector_row, matrix)
    print(c)
    print(type(c))
except:
    print('addition or matrix and vector')

[[ 2  4  6]
 [ 5  7  9]
 [ 8 10 12]]
<class 'numpy.ndarray'>


5.Do broadcasting and expand_as result in increased memory use? Why or why not?

Broadcasting should not increase the memory usage, but we would definitely need to store the result.
Here is an example.

import torch
from torch import tensor
x = torch.randn(30000, 1, 10).cuda()
y = torch.randn(20, 10).cuda()

print('before sub')
print('mem expected in MB: ', (x.nelement() + y.nelement()) * 4 / 1024**2)
print('mem allocated in MB: ', torch.cuda.memory_allocated() / 1024**2)
print('max mem allocated in MR: ', torch.cuda.max_memory_allocated() / 1024**2)

res = x - y
print('after sub')
print('mem expected in MB: ', (x.nelement() + y.nelement() + res.nelement()) * 4 / 1024**2)
print('mem allocated in MB: ', torch.cuda.memory_allocated() / 1024**2)
print('max mem allocated in MR: ', torch.cuda.max_memory_allocated() / 1024**2)

before sub
mem expected in MB:  1.145172119140625
mem allocated in MB:  1.1455078125
max mem allocated in MR:  1.1455078125
after sub
mem expected in MB:  24.033355712890625
mem allocated in MB:  24.03369140625
max mem allocated in MR:  24.03369140625

From the above output, we can see that the expected memory matches the allocated and maximum allocated memory closely.

Memory leak situation happens due to expand_as. The reasonis being the meaningless models where expand is used. As a consequence, memory will keep increasing. It should be noted that even without using expand_as, the memory usauge can also increase for a while, but it will finally stablize at a level.

In [5]:
#6.Implement matmul using Einstein summation.

import torch

a = torch.arange(9).reshape(3,3)
b = torch.arange(15).reshape(3,5)
torch.einsum('ik,kj->ij', [a,b])


tensor([[ 25,  28,  31,  34,  37],
        [ 70,  82,  94, 106, 118],
        [115, 136, 157, 178, 199]])

7.What does a repeated index letter represent on the lefthand side of einsum?

The left hand side represents the operands dimensions, separated by commas. Here we have two tensors that each have two dimensions (i,k and k,j). 

8.What are the three rules of Einstein summation notation? Why?

There are essentially three rules of Einstein summation notation, namely:

Repeated indices are implicitly summed over.
Each index can appear at most twice in any term.
Each term must contain identical non-repeated indices.

Einstein summation is a very practical way of expressing operations involving indexing and sum of products. In Eistein summation, einsum function  is often the fastest way to do custom operations in PyTorch, without diving into C++ and CUDA.

9.What are the forward pass and backward pass of a neural network?

Forward Propagation or Forward Pass is the way to move from the Input layer (left) to the Output layer (right) in the neural network. The process of moving from the right to left i.e backward from the Output to the Input layer is called the Backward Propagation or Backward Pass.

10.Why do we need to store some of the activations calculated for intermediate layers in the forward pass?

The activation function or activations is the most important factor in a neural network which decides whether or not a neuron will be activated or not and transferred to the next layer. This simply means that it will decide whether the neuron's input to the network is relevant or not in the process of prediction.

11.What is the downside of having activations with a standard deviation too far away from 1?

A standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean. Low standard deviation means data are clustered around the mean, and high standard deviation indicates data are more spread out. A standard deviation close to zero indicates that data points are close to the mean, whereas a high or low standard deviation indicates data points are respectively above or below the mean. 

12.How can weight initialization help avoid this problem?

The aim of weight initialization is to prevent layer activation outputs from exploding or vanishing during the course of a forward pass through a deep neural network. If either occurs, loss gradients will either be too large or too small to flow backwards beneficially, and the network will take longer to converge, if it is even able to do so at all.

There are different weight intialization techniques.
Zero Initialization (Initialized all weights to 0), Random Initialization (Initialized weights randomly).
Best Practices for Weight Initialization.

Use RELU or leaky RELU as the activation function, as they both are relatively robust to the vanishing or exploding gradient problems (especially for networks that are not too deep). In the case of leaky RELU, they never have zero gradients. Thus they never die and training continues.

Use Heuristics for weight initialization: For deep neural networks, we can use any of the following heuristics to initialize the weights depending on the chosen non-linear activation function. While these heuristics do not completely solve the exploding or vanishing gradients problems, they help to reduce it to a great extent. 

Gradient Clipping:  It is another way for dealing with the exploding gradient problem. In this technique, we set a threshold value, and if our chosen function of a gradient is larger than this threshold, then we set it to another value