# Linear Algebra

In [61]:
import torch

## Scalars

Scalars are the numbers of the everyday mathematics. Scalars are denote by ordinary lower-cased letters (e.g., $x$,$y$ and $z$) and the space of all (continuous) real-valued scalars by $\Bbb{R}$. The scalars are defined in this space in this way $x \in \Bbb{R}$.

Scalars are implemented as tensors that contain only one element


In [2]:
x = torch.tensor(3.0)
y = torch.tensor(2.0)

x+y, x*y, x/y, x**y

(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))

## Vectors 

For current purposes, you can think of a vector as a fixed-length array of scalars. As with their code counterparts, we call these scalars the elements of the vector (synonyms include entries and components). When vectors represent examples from real-world datasets, their values hold some real-world significance.

Vectors are implemented as 1st-order tensors. In general, such tensors can have arbitrary lengths, subject to memory limitations

In [3]:
x = torch.arange(3)
x

tensor([0, 1, 2])

- We refer to a vector with a bold lowercase letter as $\mathbf{x}$
- We  refer to an element of a vector using a subscript as $x_2$

$$\mathbf{x} = \begin{bmatrix}x_{0} \\ \vdots \\ x_{n-1}\end{bmatrix},$$

- Here $x_0, \ldots, x_{n-1}$ are elements of the vector. Later on, we will distinguish between such column vectors and row vectors whose elements are stacked horizontally. Recall that we access a tensor’s elements via indexing

In [5]:
x[0], x[2], x[-1], x[-2]

(tensor(0), tensor(2), tensor(2), tensor(1))

To indicate that a vector contains elements, we write $\mathbf{x} \in \Bbb{R}^{n}$ Formally, we call $n$ the dimensionality of the vector.

In [6]:
len(x) , x.shape

(3, torch.Size([3]))

- We use order to refer to the number of axes and dimensionality exclusively to refer to the number of components.

## Matrices

- We denote matrices by bold capital letters as $\mathbf{A}$
- The expresion $\mathbf{A} \in \Bbb{R}^{m \times n}$ indicates a matrix $\mathbf{A}$ contains $m \times n$ real-valued scalars, arranged as $m$ rows and $n$ columns.
- When $m = n$ we say that the matrix is square.
- We can ilustrate any matrix as a table.
- To refer to an individual element we subscript bot the row and column indices $a_{ij}$ element in $\mathbf{A}$'s $i^{th}$ row and $j^{th}$ column

$$\begin{split}\mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix}.\end{split}$$

- In code, we represent a matrix $\mathbf{A} \in \Bbb{R}^{m \times n}$ by a 2nd order tensor whit shape $(m,n)$
- Matrices are useful for representing datasets. Typically, rows correspond to individual records and columns correspond to distinct attributes.

In [7]:
A = torch.arange(6).reshape(3,2)
A

tensor([[0, 1],
        [2, 3],
        [4, 5]])

In [9]:
A.reshape(2,3)

tensor([[0, 1, 2],
        [3, 4, 5]])

In [10]:
A.reshape(1,6)

tensor([[0, 1, 2, 3, 4, 5]])

In [11]:
A.reshape(6,1)

tensor([[0],
        [1],
        [2],
        [3],
        [4],
        [5]])

### Transpose

- Sometimes we want to flip the axes. When we exchange a matrix’s rows and columns, the result is called its transpose.
- Formally, we signify a matrix $\mathbf{A}$’s transpose by $\mathbf{A}^{T}$ and if $\mathbf{B} = \mathbf{A}^{T}$, then $b_{ij} = a_{ij}$ for all $i$ and $j$. 
- The transpose of an $m \times n$ matrix is an $n \times m$ matrix:

$$\begin{split}\mathbf{A}^\top =
\begin{bmatrix}
    a_{11} & a_{21} & \dots  & a_{m1} \\
    a_{12} & a_{22} & \dots  & a_{m2} \\
    \vdots & \vdots & \ddots  & \vdots \\
    a_{1n} & a_{2n} & \dots  & a_{mn}
\end{bmatrix}.\end{split}$$

In [12]:
A.T

tensor([[0, 2, 4],
        [1, 3, 5]])

**Symmetric matrices**

Symmetric matrices are the subset of square matrices that are equal to their own transposes: $\mathbf{A} = \mathbf{A}^\top$
. The following matrix is symmetric:

In [15]:
A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A == A.T

tensor([[True, True, True],
        [True, True, True],
        [True, True, True]])

## Tensors

- Tensors give us a generic way of describing extensions to n-order arrays
- We call software objects of the tensor class “tensors” precisely because they too can have arbitrary numbers of axes.
- It may be confusing to use the word tensor for both the mathematical object and its realization in code, our meaning should usually be clear from context.
- We denote general tensor as $\mathsf{X}$, $\mathsf{Y}$ and $\mathsf{Z}$
- Their indexing mechanism follow nasturally from that of matrices $x_{ijk}$

- An example of an application of tensors are images.
- Each image arrives as a 3rd-order tensor with axes corresponding to the height, width, and channel. At each spatial location, the intensities of each color (red, green, and blue) are stacked along the channel.
- A collection of images is represented in code by a 
4th-order tensor, where distinct images are indexed along the first axis.

In [18]:
torch.arange(24).reshape(2, 3, 4)

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

In [19]:
torch.arange(24).reshape(4, 2, 3)

tensor([[[ 0,  1,  2],
         [ 3,  4,  5]],

        [[ 6,  7,  8],
         [ 9, 10, 11]],

        [[12, 13, 14],
         [15, 16, 17]],

        [[18, 19, 20],
         [21, 22, 23]]])

In [20]:
torch.arange(24).reshape(3, 4, 2)

tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5],
         [ 6,  7]],

        [[ 8,  9],
         [10, 11],
         [12, 13],
         [14, 15]],

        [[16, 17],
         [18, 19],
         [20, 21],
         [22, 23]]])

## Basic Tensor Arithmetic Properties

### Elementwise operations
- Elementwise operations produce outputs that have the same shape as their operands

In [3]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone()  # Assign a copy of A to B by allocating new memory
A, A + B, A-B

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 tensor([[ 0.,  2.,  4.],
         [ 6.,  8., 10.]]),
 tensor([[0., 0., 0.],
         [0., 0., 0.]]))

### Hadarmard product

- The Hadamard product (denoted $\odot$) is the elementwise product of two matrices.
- The Hadamard product of two matrices $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$ is defined as:

$$\begin{split}\mathbf{A} \odot \mathbf{B} =
\begin{bmatrix}
    a_{11}  b_{11} & a_{12}  b_{12} & \dots  & a_{1n}  b_{1n} \\
    a_{21}  b_{21} & a_{22}  b_{22} & \dots  & a_{2n}  b_{2n} \\
    \vdots & \vdots & \ddots & \vdots \\
    a_{m1}  b_{m1} & a_{m2}  b_{m2} & \dots  & a_{mn}  b_{mn}
\end{bmatrix}.\end{split}$$

In [5]:
A * B, B*A

(tensor([[ 0.,  1.,  4.],
         [ 9., 16., 25.]]),
 tensor([[ 0.,  1.,  4.],
         [ 9., 16., 25.]]))

### Scalar addition and multiplication

- Adding or multiplying a scalar and a tensor produces a result with the same shape as the original tensor. Here, each element of the tensor is added to (or multiplied by) the scalar.


In [6]:
a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, a * X

(tensor([[[ 2,  3,  4,  5],
          [ 6,  7,  8,  9],
          [10, 11, 12, 13]],
 
         [[14, 15, 16, 17],
          [18, 19, 20, 21],
          [22, 23, 24, 25]]]),
 tensor([[[ 0,  2,  4,  6],
          [ 8, 10, 12, 14],
          [16, 18, 20, 22]],
 
         [[24, 26, 28, 30],
          [32, 34, 36, 38],
          [40, 42, 44, 46]]]))

In [7]:
(a+X).shape, (a*X).shape

(torch.Size([2, 3, 4]), torch.Size([2, 3, 4]))

## Reduction

- Sum of tensor elements, considering a vecor $\mathbf{x}$ of lenght $n$, we write $\sum_{i=0}^{n-1} x_i$

In [8]:
x = torch.arange(3, dtype=torch.float32)
x, x.sum()

(tensor([0., 1., 2.]), tensor(3.))

- To express sums over the elements of tensors of arbitrary shape, we simply sum over all its axes. 
- Considering $\mathbf{A}$ as $m \times n$ matrix, the sum is written: $\sum_{i=0}^{m-1} \sum_{j=0}^{n-1} a_{ij}$

In [13]:
A = torch.arange(6, dtype=torch.float32).reshape(2,3)
A, A.shape, A.sum()

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 torch.Size([2, 3]),
 tensor(15.))

- Invoking the sum function reduces a tensor along all of its axes, eventually producing a scalar
- To sum over all elements along the rows (axis 0), we specify axis=0 in sum. Since the input matrix reduces along axis 0 to generate the output vector, this axis is missing from the shape of the output.

In [16]:
A, A.shape, A.sum(axis=0), A.sum(axis=0).shape

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 torch.Size([2, 3]),
 tensor([3., 5., 7.]),
 torch.Size([3]))

In [17]:
A, A.shape, A.sum(axis=1), A.sum(axis=1).shape

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 torch.Size([2, 3]),
 tensor([ 3., 12.]),
 torch.Size([2]))

- Reducing a matrix along both rows and columns via summation is equivalent to summing up all the elements of the matrix.

In [18]:
A.sum(axis=[0, 1]) == A.sum()  # Same as A.sum()

tensor(True)

- A related quantity is the mean, also called the average. We calculate the mean by dividing the sum by the total number of elements.
- Likewise, the function for calculating the mean can also reduce a tensor along specific axes.

In [19]:
A.mean(), A.sum() / A.numel()

(tensor(2.5000), tensor(2.5000))

In [20]:
A.mean(axis=0), A.sum(axis=0) / A.shape[0]

(tensor([1.5000, 2.5000, 3.5000]), tensor([1.5000, 2.5000, 3.5000]))

In [21]:
A.mean(axis=1), A.sum(axis=1) / A.shape[1]

(tensor([1., 4.]), tensor([1., 4.]))

## Non-Reduction Sum

- It can be useful to keep the number of axes unchanged when invoking the function for calculating the sum or mean. This matters when we want to use the broadcast mechanism.
- Mantains the number of axes but not but not the exact dimensions

In [3]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
A

tensor([[0., 1., 2.],
        [3., 4., 5.]])

In [5]:
A.sum(axis=1), A.sum(axis=1, keepdims=True)

(tensor([ 3., 12.]),
 tensor([[ 3.],
         [12.]]))

In [6]:
A.shape, (A.sum(axis=1)).shape, (A.sum(axis=1, keepdims=True)).shape

(torch.Size([2, 3]), torch.Size([2]), torch.Size([2, 1]))

In [7]:
sum_A = A.sum(axis=1, keepdims=True)
sum_A

tensor([[ 3.],
        [12.]])

- For instance, since sum_A keeps its two axes after summing each row, we can divide A by sum_A with broadcasting to create a matrix where each row sums up to 1.

In [12]:
A / sum_A, (A / sum_A).sum(axis=1)

(tensor([[0.0000, 0.3333, 0.6667],
         [0.2500, 0.3333, 0.4167]]),
 tensor([1., 1.]))

- If we want to calculate the cumulative sum of elements of A along some axis, say axis=0 (row by row), we can call the cumsum function. By design, this function does not reduce the input tensor along any axis.

In [19]:
A, A.cumsum(axis=0), A.cumsum(axis=1)

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 tensor([[0., 1., 2.],
         [3., 5., 7.]]),
 tensor([[ 0.,  1.,  3.],
         [ 3.,  7., 12.]]))

## Dot Products

- One of the most fundamental operations is the dot product.
- Given two vectos $\mathbf{x}$,$\mathbf{y} \in \Bbb{R}^d$ their dot product (or inner product) $\langle \mathbf{x}, \mathbf{y} \rangle$ is a sum over the products of the elements at the same position:
$$\mathbf{x}^\top \mathbf{y} = \sum_{i=0}^{d-1} x_i y_i$$

In [22]:
x = torch.arange(3, dtype = torch.float32)
y = torch.ones(3, dtype = torch.float32) 
x, y, torch.dot(x, y)

(tensor([0., 1., 2.]), tensor([1., 1., 1.]), tensor(3.))

- Equivalently we can calculate the dot product of two vectors performing an elementwise multiplication followed by sum

In [23]:
torch.sum(x*y)

tensor(3.)

Dot products are useful in a wide range of contexts:
- Given some set of values, denoted by a vector $\mathbf{x} \in \Bbb{R}^n$, and a set of weights, denoted by $\mathbf{w} \in \Bbb{R}^n$, the weighted sum of the values in $\mathbf{x}$ according to the weights $\mathbf{w}$ could be expressed as the dot product $\mathbf{x}^{\top}\mathbf{w}$. When the weights are nonnegative and sum to 1, the dot product expresses a weighted average.
- After normalizing two vectors to have unit length, the dot product express the cosine of the angle between them.

## Matrix-Vector Products

- To start off, we visualize our matrix in terms of its row vectors.

$$\begin{split}\mathbf{A}=
\begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_m \\
\end{bmatrix},\end{split}$$

- Each $\mathbf{a}^\top_{i} \in \Bbb{R}^{n}$ is a row vector representing the $i^{th}$ row of the matrix $\mathbf{A}$ 
- The vector $\mathbf{x}$ is an $n$-dimensional vector
- The matrix-vector porduct $\mathbf{A}\mathbf{x}$ is a column vector of lenght $m$, whose $i^{th}$ element is the dot product $\mathbf{a^{\top}_i}\mathbf{x}$

$$\begin{split}\mathbf{A}\mathbf{x}
= \begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_m \\
\end{bmatrix}\mathbf{x}
= \begin{bmatrix}
 \mathbf{a}^\top_{1} \mathbf{x}  \\
 \mathbf{a}^\top_{2} \mathbf{x} \\
\vdots\\
 \mathbf{a}^\top_{m} \mathbf{x}\\
\end{bmatrix}.\end{split}$$


- We can think this multiplication as transformation that project vectors from $\Bbb{R}^{n}$ to $\Bbb{R}^{m}$


In [46]:
A = torch.arange(6, dtype = torch.float32).reshape(2,3)
x = torch.arange(3, dtype = torch.float32)
A, x

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 tensor([0., 1., 2.]))

In [47]:
(A[0]*x).sum(), (A[1]*x).sum(), torch.dot(A[0],x), torch.dot(A[1],x)

(tensor(5.), tensor(14.), tensor(5.), tensor(14.))

In [48]:
torch.mv(A,x)

tensor([ 5., 14.])

In [42]:
A@x

tensor([ 5., 14.])

- The matrix produce a transformation in the vector
- We can represent rotations as multiplications by certain square matrices.
- Matrix–vector products also describe the key calculation involved in computing the outputs of each layer in a neural network given the outputs from the previous layer.

## Matrix-Matrix Multiplication

- Say that we have two matrices $\mathbf{A} \in \Bbb{R}^{n \times k}$ and $\mathbf{B} \in \Bbb{R}^{k \times m}$

\begin{split}\mathbf{A}=\begin{bmatrix}
 a_{11} & a_{12} & \cdots & a_{1k} \\
 a_{21} & a_{22} & \cdots & a_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
 a_{n1} & a_{n2} & \cdots & a_{nk} \\
\end{bmatrix},\quad
\mathbf{B}=\begin{bmatrix}
 b_{11} & b_{12} & \cdots & b_{1m} \\
 b_{21} & b_{22} & \cdots & b_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
 b_{k1} & b_{k2} & \cdots & b_{km} \\
\end{bmatrix}.\end{split}

- Let $\mathbf{a_i^{\top} \in \Bbb{R}^k}$ denote the row vector representing the $i^{th}$ row of the matrix $\mathbf{A}$
- Let $\mathbf{b_i \in \Bbb{R}^k}$ denote the row vector representing the $i^{th}$ row of the matrix $\mathbf{B}$

$$\begin{split}\mathbf{A}=
\begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_n \\
\end{bmatrix},
\quad \mathbf{B}=\begin{bmatrix}
 \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\
\end{bmatrix}.\end{split}$$

- To form the matrix product $\mathbf{C} \in \Bbb{R}^{n \times m}$ we simply compute each element $c_{ij}$ as the dot product between the $i^{th}$ row of $\mathbf{A}$ and the $j^{th}$ column of $\mathbf{B}$, for example $\mathbf{a}^\top_i \mathbf{b}_j$

$$\begin{split}\mathbf{C} = \mathbf{AB} = \begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_n \\
\end{bmatrix}
\begin{bmatrix}
 \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\
\end{bmatrix}
= \begin{bmatrix}
\mathbf{a}^\top_{1} \mathbf{b}_1 & \mathbf{a}^\top_{1}\mathbf{b}_2& \cdots & \mathbf{a}^\top_{1} \mathbf{b}_m \\
 \mathbf{a}^\top_{2}\mathbf{b}_1 & \mathbf{a}^\top_{2} \mathbf{b}_2 & \cdots & \mathbf{a}^\top_{2} \mathbf{b}_m \\
 \vdots & \vdots & \ddots &\vdots\\
\mathbf{a}^\top_{n} \mathbf{b}_1 & \mathbf{a}^\top_{n}\mathbf{b}_2& \cdots& \mathbf{a}^\top_{n} \mathbf{b}_m
\end{bmatrix}.\end{split}$$

- The matrix-matrix multiplication $\mathbf{AB}$ as performing $m$ matrix-vector product

In [61]:
A = torch.arange(6, dtype=float).reshape(2,3)
B = torch.ones(3,4, dtype=float)

torch.mm(A,B), A@B

(tensor([[ 3.,  3.,  3.,  3.],
         [12., 12., 12., 12.]], dtype=torch.float64),
 tensor([[ 3.,  3.,  3.,  3.],
         [12., 12., 12., 12.]], dtype=torch.float64))

- The term matrix–matrix multiplication is often simplified to matrix multiplication, and should not be confused with the Hadamard product.

## Norms

- Some of the most useful operators in linear algebra are norms. Informally, the norm of a vector tells us how big it is
- A norm is a function $\| \cdot \|$ that maps a vector to a scalar and satisfies the following three properties:
1. Given any vecotr $\mathbf{x}$, if we scale (all elements of) the vector by a scalar $ \alpha \in \Bbb{R}$, it norms scales accordingly:
$$\|\alpha \mathbf{x}\| = |\alpha| \|\mathbf{x}\|.$$

2. For any $\mathbf{x}$ and $\mathbf{y}$: norms satisfy the triangle inequality:
$$\|\mathbf{x} + \mathbf{y}\| \leq \|\mathbf{x}\| + \|\mathbf{y}\|.$$

3. The norm of a vector is nonegativa an only vanishes if the vector is zero:
$$\|\mathbf{x}\| > 0 \textrm{ for all } \mathbf{x} \neq 0.$$

- Many functions are valid norms and different norms encode different notions of size:
- For vectors:  

1. Manhattan distance ($\ell_1$):
$$\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.$$

In [62]:
u = torch.tensor([3.0, -4.0])
torch.abs(u).sum()

tensor(7.)

2. Euclidean norm ($\ell_2$):

$$\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2}.$$

In [63]:
u = torch.tensor([3.0, -4.0])
torch.norm(u)

tensor(5.)

3. This norms are special cases of the more general $\ell_p$ norms:

$$\|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}.$$


- For matrices:

1. Frobenius norm:

$$\|\mathbf{X}\|_\textrm{F} = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.$$

In [65]:
torch.norm(torch.ones((4, 9)))

tensor(6.)

2. Spectral Norm:
How much longer the matrix–vector product $\mathbf{XV}$ could be relative to $\mathbf{v}$

These concepts are useful because we are often trying to solve optimization problems:

- maximize the probability assigned to observed data
- maximize the revenue associated with a recommender model
- minimize the distance between predictions and the ground truth observations
- minimize the distance between representations of photos of the same person while maximizing the distance between representations of photos of different people

These distances, which constitute the objectives of deep learning algorithms, are often expressed as norms.

## Excercises

### 1

In [69]:
A = torch.arange(6).reshape(2,3)
A == (A.T).T

tensor([[True, True, True],
        [True, True, True]])

### 2

In [74]:
A = torch.arange(12).reshape(3,4)
B = torch.arange(12).reshape(3,4)

A.T + B.T == (A + B).T

tensor([[True, True, True],
        [True, True, True],
        [True, True, True],
        [True, True, True]])

### 3

$A+A^\top = B\\
 B^\top = (A+A^\top)^\top = (A^\top+{A^{\top}}^\top) = A^\top+A = A+A^\top = B $

$B = B^\top$ implies $A+A^\top = (A+A^\top)^\top $ therefore the matrix is always simmetrycal

In [86]:
A = torch.arange(16).reshape(4,4)
A, A.T

(tensor([[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11],
         [12, 13, 14, 15]]),
 tensor([[ 0,  4,  8, 12],
         [ 1,  5,  9, 13],
         [ 2,  6, 10, 14],
         [ 3,  7, 11, 15]]))

In [79]:
A + A.T, A.T + A, (A + A.T).T, (A.T + A).T 

(tensor([[ 0,  5, 10, 15],
         [ 5, 10, 15, 20],
         [10, 15, 20, 25],
         [15, 20, 25, 30]]),
 tensor([[ 0,  5, 10, 15],
         [ 5, 10, 15, 20],
         [10, 15, 20, 25],
         [15, 20, 25, 30]]),
 tensor([[ 0,  5, 10, 15],
         [ 5, 10, 15, 20],
         [10, 15, 20, 25],
         [15, 20, 25, 30]]),
 tensor([[ 0,  5, 10, 15],
         [ 5, 10, 15, 20],
         [10, 15, 20, 25],
         [15, 20, 25, 30]]))

### 4 & 5

The len(X) will return the first value in the shape definition (first axis).

In [84]:
X = torch.arange(24).reshape(3,4,2)
Y = torch.arange(24).reshape(2,3,4)
Z = torch.arange(24).reshape(4,3,2)
len(X), len(Y), len(Z)

(3, 2, 4)

Yes it always corresponds to the first axis that defines the number of elements corresponding to the most outer brackets []

### 6

In [13]:
A = torch.arange(16).reshape(4,4)

A / A.sum(axis=1)

(tensor([[0.0000, 0.0455, 0.0526, 0.0556],
         [0.6667, 0.2273, 0.1579, 0.1296],
         [1.3333, 0.4091, 0.2632, 0.2037],
         [2.0000, 0.5909, 0.3684, 0.2778]]),
 0.6666666666666666)

The denominator is a vector with the sum of each row of the 4x4 Matrix, the numerator is a 4x4 matrix. The division is an elementwise division between each element of the vector with each row of the matrix in the same index position.

In [20]:
A[0] /A.sum(axis=1), A[1] /A.sum(axis=1), A[2] /A.sum(axis=1), A[3] /A.sum(axis=1) 

(tensor([0.0000, 0.0455, 0.0526, 0.0556]),
 tensor([0.6667, 0.2273, 0.1579, 0.1296]),
 tensor([1.3333, 0.4091, 0.2632, 0.2037]),
 tensor([2.0000, 0.5909, 0.3684, 0.2778]))

### 7

In [27]:
avenue_dist = 30.0
street_dist = 10.0

distance_2pts = torch.tensor([avenue_dist, street_dist]) 

torch.abs(distance_2pts).sum(), torch.norm(distance_2pts)

(tensor(40.), tensor(31.6228))

### 8

In [40]:
X = torch.arange(24).reshape(2,3,4)

X

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

In [42]:
(X.sum(axis=0),X.sum(axis=0).shape), (X.sum(axis=1),X.sum(axis=1).shape), (X.sum(axis=2),X.sum(axis=2).shape)

((tensor([[12, 14, 16, 18],
          [20, 22, 24, 26],
          [28, 30, 32, 34]]),
  torch.Size([3, 4])),
 (tensor([[12, 15, 18, 21],
          [48, 51, 54, 57]]),
  torch.Size([2, 4])),
 (tensor([[ 6, 22, 38],
          [54, 70, 86]]),
  torch.Size([2, 3])))

### 9

In [44]:
X = torch.arange(24, dtype=float).reshape(2,3,4)

torch.linalg.norm(X)

tensor(65.7571, dtype=torch.float64)

In [46]:
from math import sqrt

sqrt((X**2).sum())

65.75712889109438

In [50]:
X = X.reshape(6,4)
torch.linalg.norm(X)

tensor(65.7571, dtype=torch.float64)

Calculates the square root of the sum of the squares of each element.

### 10

$A \in \Bbb{R}^{2^{10} \times 2^{16}}, B \in \Bbb{R}^{2^{16} \times 2^{5}}, C \in \Bbb{R}^{2^{5} \times 2^{14}}$

We are gonna to consider $X$ as an intermediate matrix to compute the Matrix product $ABC = Y$

- $(AB)C$
    1. $(AB) = X \in \Bbb{R}^{2^{10} \times 2^{5}}$
    2. $ XC = Y \in \Bbb{R}^{2^{10} \times 2^{14}}$

- $A(BC)$
    1. $(BC) = X \in \Bbb{R}^{2^{16} \times 2^{14}}$
    2. $ AX = Y \in \Bbb{R}^{2^{10} \times 2^{14}}$

Considering that the computation gonna follow a sequential order computing the result of each matrix multiplication, the first approach is probably more generate less memory footprint than the second approach. That's because the intermediate matrix computed is much smaller than the intermediate matrix in the first approach.

### 11

In [44]:
import torch
import time

a = torch.randn([2**10,2**16])
b = torch.randn([2**16,2**5])
c = torch.randn([2**5,2**16])
d = b.T

In [45]:
times =  {}
for i in range(100):
    key = f"iteration_{i} :"
    times[key] = {}
    print(f"{key}\n")
    start = time.time()
    a@b
    end = time.time()
    times[key]["time_A@B"] = end-start
    print (f"time to generate AxB: {round(end-start,3)} seconds") 
    start = time.time()
    a@c.T
    end = time.time()
    times[key]["time_A@CT"] = end-start
    print (f"time to generate AxC.T:{round(end-start,3)} seconds")
    start = time.time()
    a@d.T
    end = time.time()
    times[key]["time_A@(BT)T"] = end-start
    print (f"time to generate AxC.T:{round(end-start,3)} seconds\n")  
    

iteration_0 :

time to generate AxB: 0.041 seconds
time to generate AxC.T:0.063 seconds
time to generate AxC.T:0.044 seconds

iteration_1 :

time to generate AxB: 0.042 seconds
time to generate AxC.T:0.039 seconds
time to generate AxC.T:0.042 seconds

iteration_2 :

time to generate AxB: 0.04 seconds
time to generate AxC.T:0.044 seconds
time to generate AxC.T:0.039 seconds

iteration_3 :

time to generate AxB: 0.041 seconds
time to generate AxC.T:0.041 seconds
time to generate AxC.T:0.045 seconds

iteration_4 :

time to generate AxB: 0.039 seconds
time to generate AxC.T:0.041 seconds
time to generate AxC.T:0.043 seconds

iteration_5 :

time to generate AxB: 0.042 seconds
time to generate AxC.T:0.039 seconds
time to generate AxC.T:0.036 seconds

iteration_6 :

time to generate AxB: 0.04 seconds
time to generate AxC.T:0.042 seconds
time to generate AxC.T:0.041 seconds

iteration_7 :

time to generate AxB: 0.042 seconds
time to generate AxC.T:0.043 seconds
time to generate AxC.T:0.044 sec

In [46]:
import pandas as pd

df = pd.DataFrame(times).T
df

Unnamed: 0,time_A@B,time_A@CT,time_A@(BT)T
iteration_0 :,0.041001,0.062999,0.043999
iteration_1 :,0.042002,0.038998,0.041995
iteration_2 :,0.040002,0.044000,0.038998
iteration_3 :,0.041003,0.040997,0.045000
iteration_4 :,0.039003,0.041000,0.042997
...,...,...,...
iteration_95 :,0.041996,0.041006,0.038995
iteration_96 :,0.041000,0.044003,0.041997
iteration_97 :,0.042001,0.035001,0.044001
iteration_98 :,0.042998,0.043000,0.042998


In [47]:
df["difference_A@B_A@CT"] = df.iloc[:,0] - df.iloc[:,1]
df["difference_A@B_A@(BT)T"] = df.iloc[:,0] - df.iloc[:,2]
df["difference_A@CT_A@(BT)T"] = df.iloc[:,1] - df.iloc[:,2]
df["A@B_faster_A@CT"] = df["difference_A@B_A@CT"] < 0
df["A@B_faster_A@(BT)T"] = df["difference_A@B_A@(BT)T"] < 0
df["A@CT_faster_A@(BT)T"] = df["difference_A@CT_A@(BT)T"] < 0

df

Unnamed: 0,time_A@B,time_A@CT,time_A@(BT)T,difference_A@B_A@CT,difference_A@B_A@(BT)T,difference_A@CT_A@(BT)T,A@B_faster_A@CT,A@B_faster_A@(BT)T,A@CT_faster_A@(BT)T
iteration_0 :,0.041001,0.062999,0.043999,-0.021998,-2.998114e-03,0.019000,True,True,False
iteration_1 :,0.042002,0.038998,0.041995,0.003005,7.629395e-06,-0.002997,False,False,True
iteration_2 :,0.040002,0.044000,0.038998,-0.003998,1.004457e-03,0.005002,True,False,False
iteration_3 :,0.041003,0.040997,0.045000,0.000006,-3.996849e-03,-0.004003,False,True,True
iteration_4 :,0.039003,0.041000,0.042997,-0.001997,-3.994465e-03,-0.001997,True,True,True
...,...,...,...,...,...,...,...,...,...
iteration_95 :,0.041996,0.041006,0.038995,0.000991,3.001928e-03,0.002011,False,False,False
iteration_96 :,0.041000,0.044003,0.041997,-0.003003,-9.975433e-04,0.002005,True,True,False
iteration_97 :,0.042001,0.035001,0.044001,0.007001,-1.999855e-03,-0.009001,False,True,True
iteration_98 :,0.042998,0.043000,0.042998,-0.000002,-2.384186e-07,0.000001,True,True,False


In [48]:
df["A@B_faster_A@CT"].value_counts(), df["A@B_faster_A@(BT)T"].value_counts(), df["A@CT_faster_A@(BT)T"].value_counts()

(A@B_faster_A@CT
 True     52
 False    48
 Name: count, dtype: int64,
 A@B_faster_A@(BT)T
 True     53
 False    47
 Name: count, dtype: int64,
 A@CT_faster_A@(BT)T
 False    53
 True     47
 Name: count, dtype: int64)

- It seems like is a little faster to compute $AB$ rhater than $AC^{\top}$ 
- There was runnings in wich  $A(B^\top)^\top$ was faster than $AB$ but most of the times $AB$ was faster.
- There wasn't a significant difference between $A(B^\top)^\top$ and $AC^{\top}$ sometimes the first what faster and sometimes the second was faster.
- The elements are initialized randomly so probably have some influence in the time it tooks to compute the multiplications

### 12 

In [54]:
A = torch.ones(100, 200, dtype=torch.int32)
B = A*2
C = A*3

X = torch.stack([A,B,C])


The dimensio of the tensor is (3, 100, 200)

In [55]:
X.shape

torch.Size([3, 100, 200])

In [67]:
(X[1] == B).prod()

tensor(1)