1. **Prove that the transpose of the transpose of a matrix is the matrix itself: $(\mathbf{A}^\top)^\top = \mathbf{A}$.**

See the [link](https://mathinstructor.net/2012/04/we-have-matrix-a-how-to-prove-that-transpose-of-a-transpose-is-equal-to-matrix-a-i-e-att-a/).

2. **Given two matrices $\mathbf{A}$ and $\mathbf{B}$, show that sum and transposition commute:** $\mathbf{A}^\top + \mathbf{B}^\top = (\mathbf{A} + \mathbf{B})^\top$.

Suppose the size of A and B is (m,n), then the size of $A^T+B^T$ is (n,m). The size of $(A+B)^T$ is (n,m). The (i,j)-entry of $A^T+B^T$ is the sum of (i,j)-entries of $A^T$ and $B^T$, which are (j,i)-entries of A and B, respectively. Thus the (i,j)-entry of $A^T+B^T$ is the (j,i)-entry of the sum of A and B, which is equal to the (i,j)-entry of the transpose $(A+B)^T$.

3. **Given any square matrix $\mathbf{A}$, is $\mathbf{A} + \mathbf{A}^\top$ always symmetric? Can you prove the result by using only the results of the previous two exercises?**

$(A+A^T)^T=A^T+(A^T)^T=A^T+A$. So it is always symmetric.

4. **We defined the tensor `X` of shape (2, 3, 4) in this section. What is the output of `len(X)`? Write your answer without implementing any code, then check your answer using code.**

In [1]:
import torch
X = torch.arange(24).reshape(2, 3, 4)
len(X)

2

5. **For a tensor `X` of arbitrary shape, does `len(X)` always correspond to the length of a certain axis of `X`? What is that axis?**

len(X) is the size of the first dimention.

6. **Run `A / A.sum(axis=1)` and see what happens. Can you analyze the results?**

In [6]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
A / A.sum(axis=1)

RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 1

The shape of A is (2, 3). The shape of A.sum(axis=1) is 2. So it can't implement broadcasting.

7. **When traveling between two points in downtown Manhattan, what is the distance that you need to cover in terms of the coordinates, i.e., in terms of avenues and streets? Can you travel diagonally?**

The distance is the sum of the length of all avenues passing through. Can't travel diagonally.

8. **Consider a tensor of shape (2, 3, 4). What are the shapes of the summation outputs along axes 0, 1, and 2?**

In [10]:
X = torch.arange(24).reshape(2, 3, 4)
X.sum(axis=0).shape, X.sum(axis=1).shape, X.sum(axis=2).shape

(torch.Size([3, 4]), torch.Size([2, 4]), torch.Size([2, 3]))

9. **Feed a tensor with three or more axes to the `linalg.norm` function and observe its output. What does this function compute for tensors of arbitrary shape?**

In [15]:
import numpy as np
X = torch.ones((1,3,3))
np.linalg.norm(X)

3.0

The linalg.norm function can compute the norm of a matrix of a vector. When given no parameters, the return of a matrix is the Frobenius norm: $(\sum_{i,j} (a_{i,j}^2))^{\frac{1}{2}}$

10. **Consider three large matrices, say $\mathbf{A} \in \mathbb{R}^{2^{10} \times 2^{16}}$, $\mathbf{B} \in \mathbb{R}^{2^{16} \times 2^{5}}$ and $\mathbf{C} \in \mathbb{R}^{2^{5} \times 2^{14}}$, initialized with Gaussian random variables. You want to compute the product $\mathbf{A} \mathbf{B} \mathbf{C}$. Is there any difference in memory footprint and speed, depending on whether you compute $(\mathbf{A} \mathbf{B}) \mathbf{C}$ or $\mathbf{A} (\mathbf{B} \mathbf{C})$. Why?**

For matrix A(a,b) and matrix B(b,c), the product AB needs $abc$ multiplications and $a(b-1)c \approx abc$ additions.

- (AB)C:

AB: $2^{10} * 2^{16} * 2*5 = 2^{31}$ multiplications/addtions

(AB)C: $2^{10} * 2^5 * 2*{14} = 2^{29}$ multiplications/addtions

calculation: about $2^{31}$ multiplications/addtions

memory: intermediate variable ($2^{10},2^{5}$)

- A(BC):

BC: $2^{16} * 2^{5} * 2*{14} = 2^{35}$ multiplications/addtions

A(BC): $2^{10} * 2^{16} * 2*{14} = 2^{40}$ multiplications/addtions

calculation: about $2^{40}$ multiplications/addtions

memory: intermediate variable ($2^{16},2^{14}$)

So computing (AB)C needs less calculations and memory. 

11. **Consider three large matrices, say $\mathbf{A} \in \mathbb{R}^{2^{10} \times 2^{16}}$, $\mathbf{B} \in \mathbb{R}^{2^{16} \times 2^{5}}$ and $\mathbf{C} \in \mathbb{R}^{2^{5} \times 2^{16}}$. Is there any difference in speed depending on whether you compute $\mathbf{A} \mathbf{B}$ or $\mathbf{A} \mathbf{C}^\top$? Why? What changes if you initialize $\mathbf{C} = \mathbf{B}^\top$ without cloning memory? Why?**

In [25]:
import time
A = torch.randn((2 ** 10,2 ** 16))
B = torch.randn((2 ** 16,2 ** 5))
C = torch.randn((2 ** 5,2 ** 16))

time1 = time.time()
D = A@B
time2 = time.time()
print("1 A@B", time2-time1)

time1 = time.time()
D = A@(C.transpose(0,1))
time2 = time.time()
print("2 A@C^T",time2-time1)

C = B.T
time1 = time.time()
D = A@(C.transpose(0,1))
time2 = time.time()
print("3 A@(B^T)^T", time2-time1)

1 A@B 0.01600050926208496
2 A@C^T 0.01799941062927246
3 A@(B^T)^T 0.021985530853271484


<font color = green>(from the discussion)</font>
<font color = red>(uncertain)</font>

$ \mathbf{A} \mathbf{C}^{T} $ yields better performance due to the layout of data in memory : since the row major format in which data is usually stored in torch usually prefers memory accesses of the same row, when you take transpose of $ \mathbf{C}^{\top} $, it’s not really taking a physical transpose, but a logical one, meaning when we index the trasposes matrix at $ (i, j) $, it just gets internally converted to $ (j, i) $ of the matrix before transposition. Since the elements of the second matrix are accessed column-wise, it is inefficient for this task, but if we have it logically transposed, then the accesses become efficent again, since the data is logically being accessed across rows, ie, columns-wise, but is physically geting accessed across columns, ie, row-wise, since we didn’t actually perform the element swaps, only decided to change indexing under the hood. Hence, the transpose technique works faster.

I think the analysis makes sense. So the result should be 2 faster than 1 and 3. But the actual result is that the three methods take about the same amount of time.

12. **Consider three matrices, say $\mathbf{A}, \mathbf{B}, \mathbf{C} \in \mathbb{R}^{100 \times 200}$. Construct a tensor with three axes by stacking $[\mathbf{A}, \mathbf{B}, \mathbf{C}]$. What is the dimensionality? Slice out the second coordinate of the third axis to recover $\mathbf{B}$. Check that your answer is correct.**

In [5]:
A = torch.randn((100,200))
B = torch.randn((100,200))
C = torch.randn((100,200))
D = torch.stack((A, B, C), dim=2)
D.shape

torch.Size([100, 200, 3])

In [8]:
E = D[:,:,1]
E.equal(B)

True