1. How does unsqueeze help us to solve certain broadcasting problems?



The unsqueeze() function in PyTorch and other similar libraries helps to increase the dimensions of a tensor by inserting one or more dimensions of size 1 at the specified position(s). This can be particularly useful in broadcasting operations, where we need to perform element-wise operations between tensors that might have different dimensions.

For example, let's say we have a 2D tensor A with shape (3, 4) and a 1D tensor B with shape (4,). If we want to perform element-wise multiplication between A and B, we can't do it directly because their shapes are not compatible for broadcasting. However, we can use unsqueeze() to add an extra dimension to B so that its shape becomes (1, 4), which is compatible with the second dimension of A.

2. How can we use indexing to do the same operation as unsqueeze?


We can use indexing to achieve the same result as unsqueeze() by manually adding new dimensions to a tensor.

To add a new dimension of size 1 at a specific position in a tensor, we can use the None or np.newaxis keyword along that dimension's axis. For example, let's say we have a 1D tensor A with shape (3,). 

3. How do we show the actual contents of the memory used for a tensor?



To show the actual contents of the memory used for a tensor in PyTorch, we can use the numpy() method to convert the tensor to a NumPy array, and then use the data attribute of the NumPy array to access the underlying memory buffer.

In [None]:
4. When adding a vector of size 3 to a matrix of size 3×3, are the elements of the vector added
to each row or each column of the matrix? (Be sure to check your answer by running this
code in a notebook.)


In [2]:
pip install torch


Collecting torch
  Downloading torch-2.0.0-cp310-cp310-manylinux1_x86_64.whl (619.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m619.9/619.9 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting nvidia-cudnn-cu11==8.5.0.96
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m557.1/557.1 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting nvidia-cusparse-cu11==11.7.4.91
  Downloading nvidia_cusparse_cu11-11.7.4.91-py3-none-manylinux1_x86_64.whl (173.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.2/173.2 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting nvidia-cuda-runtime-cu11==11.7.99
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.3/849.3 kB[0m [31m64.7 MB/s[

In [3]:
import torch

# Create a matrix and a vector
A = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
B = torch.tensor([10, 20, 30])

# Add the vector to the matrix
C = A + B.unsqueeze(0)

print(C)


tensor([[11, 22, 33],
        [14, 25, 36],
        [17, 28, 39]])


In this example, B.unsqueeze(0) adds a new dimension of size 1 at position 0 of the vector B, so that it has shape 1×3. When we add this vector to the matrix A, PyTorch broadcasts the vector along the rows of the matrix, creating a temporary matrix of shape 3×3 with the same elements as the vector. Then, it performs element-wise addition between the temporary matrix and A to produce the result C, which is also a matrix of size 3×3.

Note that if we wanted to add the vector to each column of the matrix instead, we would need to transpose the matrix, add the vector to each row of the transposed matrix, and then transpose the result back to the original shape.

5. Do broadcasting and expand_as result in increased memory use? Why or why not?


Broadcasting and `expand_as` do not necessarily result in increased memory use, as they do not create new copies of the data. 

Broadcasting is a way to perform operations between tensors of different shapes by implicitly replicating the smaller tensor to match the shape of the larger tensor. This replication is done on the fly and does not create new copies of the data, so it does not increase memory use. The replicated tensor is treated as if it were a view of the original tensor, so any changes made to the replicated tensor are reflected in the original tensor.

Similarly, `expand_as` is a method that returns a new tensor with the same data as the original tensor, but with additional dimensions added and existing dimensions replicated as necessary to match the shape of the specified tensor. Like broadcasting, `expand_as` does not create new copies of the data, so it does not increase memory use. The new tensor returned by `expand_as` is a view of the original tensor, so any changes made to the new tensor are reflected in the original tensor.

However, it's worth noting that while broadcasting and `expand_as` do not create new copies of the data, they may still result in increased memory use if the resulting tensor is significantly larger than the original tensor. This is because the larger tensor may require more memory to store than the original tensor, even though the data is not duplicated. In addition, if the operations performed on the tensors are memory-intensive (such as matrix multiplication), the increased size of the resulting tensor may lead to higher memory usage during the operation.

6. Implement matmul using Einstein summation.


In [4]:
import torch

# Create two matrices
A = torch.tensor([[1, 2, 3], [4, 5, 6]])
B = torch.tensor([[7, 8], [9, 10], [11, 12]])

# Compute matrix product using Einstein summation
C = torch.einsum('ij,jk->ik', A, B)

print(C)


tensor([[ 58,  64],
        [139, 154]])


7. What does a repeated index letter represent on the lefthand side of einsum?


When a repeated index letter appears on the left-hand side of an Einstein summation notation string in `einsum`, it means that the corresponding dimension of the input tensor should be summed over. 

For example, consider the string `'ii->i'`. This notation specifies that we want to sum over the first index of a 2D input tensor with shape `(n, n)`, and return a 1D output tensor with shape `(n,)`. The repeated index letter `i` indicates that we want to sum over the elements along the diagonal of the matrix. The resulting output tensor will contain the diagonal elements of the input matrix.

Similarly, consider the string `'ijk->jik'`. This notation specifies that we want to transpose the first and second dimensions of a 3D input tensor with shape `(n, m, p)`. The repeated index letter `i` indicates that we want to keep the elements along the third dimension fixed and swap the first and second dimensions. The resulting output tensor will have shape `(m, n, p)`, with the first and second dimensions transposed.

Note that in the output string of `einsum`, each index letter must appear exactly once. The repeated index letters on the left-hand side specify which dimensions to sum over or transpose.

8. What are the three rules of Einstein summation notation? Why?


The three rules of Einstein summation notation are:

1. Repeated indices are summed over. 

2. Each index appears only twice in a term, once as a subscript and once as a superscript. 

3. The order of the terms does not matter.

These rules were developed by Albert Einstein to simplify the notation of tensor calculus. They allow complex tensor operations to be expressed using simple algebraic expressions, and they make it easier to perform tensor manipulations and transformations.

The first rule states that repeated indices in a term are implicitly summed over. This means that if an index appears twice in a term (once as a subscript and once as a superscript), then the tensor product represented by that term should be summed over that index. This is equivalent to performing a dot product, or matrix multiplication, between the corresponding rows and columns of the tensors.

The second rule ensures that the index notation is unambiguous and consistent. Each index appears exactly twice in a term, once as a subscript and once as a superscript. This allows the terms to be combined and simplified in a straightforward manner.

The third rule states that the order of the terms does not matter. This means that the order in which the terms are written does not affect the final result, as long as the indices are properly matched. This is useful when performing tensor manipulations that involve multiple terms, as it allows the terms to be rearranged and combined in any order.

In [None]:
9. What are the forward pass and backward pass of a neural network?


The forward pass and backward pass are the two main steps of training a neural network using backpropagation. Here's a brief explanation of each step:

1. Forward pass: During the forward pass, the input data is passed through the neural network layer by layer, and the output of each layer is computed. The output of the final layer is compared to the ground truth labels to compute the loss function, which measures how well the network is performing on the given task.

2. Backward pass: During the backward pass, the gradients of the loss function with respect to the parameters of the network are computed using the chain rule of differentiation. The gradients are then used to update the parameters of the network, typically using an optimization algorithm such as stochastic gradient descent. This step is known as backpropagation.

The backward pass is crucial for training a neural network, as it allows the network to adjust its parameters to minimize the loss function and improve its performance on the given task. The forward pass and backward pass are typically repeated multiple times (or "epochs") until the network converges to a satisfactory solution.

It's worth noting that the forward pass and backward pass can also be performed together in a single step, using automatic differentiation libraries such as PyTorch or TensorFlow. In this case, the gradients are computed automatically using the backpropagation algorithm, without the need for manual implementation.


In [None]:
10. Why do we need to store some of the activations calculated for intermediate layers in the
forward pass?


In [None]:
Storing activations calculated for intermediate layers in the forward pass is important for several reasons:

1. Backpropagation: During the backward pass, we need to compute the gradients of the loss function with respect to the parameters of the network. To do this, we need to propagate the gradients backwards through the layers of the network using the chain rule of differentiation. This requires the intermediate activations to be available, as they are needed to compute the gradients.

2. Reuse of activations: In some cases, we may need to reuse the activations of intermediate layers for other purposes. For example, we might want to extract features from the activations of a pre-trained neural network for use in a different model or application.

3. Debugging: Storing the activations of intermediate layers can be useful for debugging and visualizing the behavior of the network. By inspecting the activations at different layers, we can gain insight into how the network is processing the input data and how the features are being transformed and combined.

4. Efficient computation: In some cases, it may be more efficient to compute the activations of intermediate layers once and reuse them multiple times, rather than recalculating them for each new input. This can be particularly useful for large or complex networks that require a lot of computation.

Overall, storing activations of intermediate layers in the forward pass is an important part of training and using neural networks, as it allows us to perform efficient and effective computations, and gain insight into the behavior of the network.

In [None]:
11. What is the downside of having activations with a standard deviation too far away from 1?


In [None]:
12. How can weight initialization help avoid this problem?