In [None]:
'''
 * Copyright (c) 2004 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## 2.3.4 Tensors

While you can go far in your machine learning journey with only scalars, vectors, and matrices, eventually you may need to work with higher-order tensors. Tensors give us a generic way to describe extensions to $n^{th}$-order arrays. We call software objects of the tensor class “tensors” precisely because they too can have arbitrary numbers of axes. While it may be confusing to use the word tensor for both the mathematical object and its realization in code, our meaning should usually be clear from context. We denote general tensors by capital letters with a special font face (e.g., $\mathcal{X}$, $\mathcal{Y}$, and $\mathcal{Z}$) and their indexing mechanism (e.g., $x_{ijk}$ and $[\mathcal{X}]_{1, 2i-1, 3}$) follows naturally from that of matrices.

Tensors will become more important when we start working with images. Each image arrives as a $3^{rd}$-order tensor with axes corresponding to the height, width, and channel. At each spatial location, the intensities of each color (red, green, and blue) are stacked along the channel. Moreover a collection of images is represented in code by a $4^{th}$-order tensor, where distinct images are indexed along the first axis. Higher-order tensors are constructed analogously to vectors and matrices, by growing the number of shape components.


import torch

torch.arange(24).reshape(2, 3, 4)

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

 Basic Properties of Tensor Arithmetic
Scalars, vectors, matrices, and higher-order tensors all have some handy properties. For example, elementwise operations produce outputs that have the same shape as their operands.

## Basic Properties of Tensor Arithmetic

Scalars, vectors, matrices, and higher-order tensors all have some handy properties. For example, elementwise operations produce outputs that have the same shape as their operands.

### Elementwise Addition
Given a matrix $A$:

```python
import torch
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone() # Assign a copy of A to B by allocating new memory
A, A + B
```

Output:

$$
A = \begin{bmatrix} 0 & 1 & 2 \\ 3 & 4 & 5 \end{bmatrix}, \quad A + B = \begin{bmatrix} 0+0 & 1+1 & 2+2 \\ 3+3 & 4+4 & 5+5 \end{bmatrix} = \begin{bmatrix} 0 & 2 & 4 \\ 6 & 8 & 10 \end{bmatrix}
$$

### Hadamard Product (Elementwise Multiplication)
The Hadamard product of two matrices $A, B \in \mathbb{R}^{m \times n}$ is computed as follows:

$$
A \odot B = \begin{bmatrix} a_{11} b_{11} & a_{12} b_{12} & \dots & a_{1n} b_{1n} \\
                          a_{21} b_{21} & a_{22} b_{22} & \dots & a_{2n} b_{2n} \\
                          \vdots & \vdots & \ddots & \vdots \\
                          a_{m1} b_{m1} & a_{m2} b_{m2} & \dots & a_{mn} b_{mn} \end{bmatrix}
$$

In Python:

```python
A * B
```

Output:

$$
\begin{bmatrix} 0 \cdot 0 & 1 \cdot 1 & 2 \cdot 2 \\
                  3 \cdot 3 & 4 \cdot 4 & 5 \cdot 5 \end{bmatrix} = \begin{bmatrix} 0 & 1 & 4 \\
                  9 & 16 & 25 \end{bmatrix}
$$

### Scalar-Tensor Operations
Adding or multiplying a scalar to a tensor produces a result with the same shape as the original tensor. Each element of the tensor is added to (or multiplied by) the scalar.

```python
import torch
a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape
```

Output:

$$
\begin{bmatrix} 
\begin{bmatrix} 2 & 3 & 4 & 5 \\
6 & 7 & 8 & 9 \\
10 & 11 & 12 & 13 \end{bmatrix},
\begin{bmatrix} 14 & 15 & 16 & 17 \\
18 & 19 & 20 & 21 \\
22 & 23 & 24 & 25 \end{bmatrix}
\end{bmatrix}, \quad \text{shape} = (2,3,4)
$$



## Reduction Operations in Tensor Arithmetic

### Summation
Often, we wish to calculate the sum of a tensor’s elements. To express the sum of the elements $\sum$ in a vector $x$ of length $n$, we write:

$$
\sum_{i=1}^{n} x_i
$$

There’s a simple function for it:

```python
import torch
x = torch.arange(3, dtype=torch.float32)
x, x.sum()
```

Output:

$$
(\text{tensor}([0., 1., 2.]), \text{tensor}(3.))
$$

To express sums over the elements of tensors of arbitrary shape, we simply sum over all of its axes. For example, the sum of the elements of an $m \times n$ matrix $A$ could be written:

$$
\sum_{i=1}^{m} \sum_{j=1}^{n} a_{ij}
$$

```python
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
A.shape, A.sum()
```

Output:

$$
(\text{torch.Size}([2, 3]), \text{tensor}(15.))
$$

By default, invoking the `sum` function reduces a tensor along all of its axes, eventually producing a scalar. Our libraries also allow us to specify the axes along which the tensor should be reduced.

Summing over all elements along the rows ($\text{axis}=0$):

```python
A.shape, A.sum(axis=0).shape
```

Output:

$$
(\text{torch.Size}([2, 3]), \text{torch.Size}([3]))
$$

Summing over columns ($\text{axis}=1$):

```python
A.shape, A.sum(axis=1).shape
```

Output:

$$
(\text{torch.Size}([2, 3]), \text{torch.Size}([2]))
$$

Reducing along both rows and columns:

```python
A.sum(axis=[0, 1]) == A.sum()
```

Output:

$$
\text{tensor}(\text{True})
$$

### Mean Calculation
A related quantity is the mean, also called the average:

$$
\text{mean}(A) = \frac{\sum A}{\text{num elements}}
$$

```python
A.mean(), A.sum() / A.numel()
```

Output:

$$
(\text{tensor}(2.5000), \text{tensor}(2.5000))
$$

We can also reduce along specific axes:

```python
A.mean(axis=0), A.sum(axis=0) / A.shape[0]
```

Output:

$$
(\text{tensor}([1.5000, 2.5000, 3.5000]), \text{tensor}([1.5000, 2.5000, 3.5000]))
$$

## Non-Reduction Sum
Sometimes it can be useful to keep the number of axes unchanged when invoking the function for calculating the sum or mean. This is important when using broadcasting.

```python
sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape
```

Output:

$$
(\text{tensor}([[ 3.], [12.]]), \text{torch.Size}([2, 1]))
$$

Since `sum_A` keeps its two axes after summing each row, we can divide `A` by `sum_A` using broadcasting to create a matrix where each row sums up to 1:

```python
A / sum_A
```

Output:

$$
\text{tensor}([[0.0000, 0.3333, 0.6667], [0.2500, 0.3333, 0.4167]])
$$

To calculate the cumulative sum along an axis, say `axis=0` (row by row), we use `cumsum`, which does not reduce the input tensor:

```python
A.cumsum(axis=0)
```

Output:

$$
\text{tensor}([[0., 1., 2.], [3., 5., 7.]])
$$




## Tensor Arithmetic and Operations

###  Basic Properties of Tensor Arithmetic

Scalars, vectors, matrices, and higher-order tensors all have some handy properties. For example, elementwise operations produce outputs that have the same shape as their operands.

```python
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone() # Assign a copy of A to B by allocating new memory
A, A + B
```

$$
A \odot B = \begin{bmatrix} a_{11} b_{11} & a_{12} b_{12} & ... & a_{1n} b_{1n} \\ a_{21} b_{21} & a_{22} b_{22} & ... & a_{2n} b_{2n} \\ ... & ... & ... & ... \\ a_{m1} b_{m1} & a_{m2} b_{m2} & ... & a_{mn} b_{mn} \end{bmatrix}
$$

```python
A * B
```

## 6 Reduction

Often, we wish to calculate the sum of a tensor’s elements.

```python
x = torch.arange(3, dtype=torch.float32)
x, x.sum()
```

To express sums over the elements of tensors of arbitrary shape:

```python
A.shape, A.sum()
```

By default, invoking the sum function reduces a tensor along all of its axes, eventually producing a scalar. We can also specify an axis for reduction:

```python
A.shape, A.sum(axis=0).shape
A.shape, A.sum(axis=1).shape
```

The mean of a tensor can be computed as follows:

```python
A.mean(), A.sum() / A.numel()
A.mean(axis=0), A.sum(axis=0) / A.shape[0]
```

## 7 Non-Reduction Sum

Sometimes it is useful to keep the number of axes unchanged:

```python
sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape
A / sum_A
```

To calculate cumulative sums:

```python
A.cumsum(axis=0)
```

## 8 Dot Products

One of the most fundamental operations is the dot product:

$$
x^T y = \sum_{i=1}^{d} x_i y_i
$$

```python
y = torch.ones(3, dtype=torch.float32)
x, y, torch.dot(x, y)
```

The dot product can also be computed using:

```python
torch.sum(x * y)
```

## 9 Matrix-Vector Products

Given an $m \times n$ matrix $A$ and an $n$-dimensional vector $x$:

$$
Ax = \begin{bmatrix} a_1^T x \\ a_2^T x \\ ... \\ a_m^T x \end{bmatrix}
$$

```python
A.shape, x.shape, torch.mv(A, x), A @ x
```

## 10 Matrix-Matrix Multiplication

Matrix-matrix multiplication is straightforward:

$$
C = AB = \begin{bmatrix} a_1^T b_1 & a_1^T b_2 & ... & a_1^T b_m \\ a_2^T b_1 & a_2^T b_2 & ... & a_2^T b_m \\ ... & ... & ... & ... \\ a_n^T b_1 & a_n^T b_2 & ... & a_n^T b_m \end{bmatrix}
$$

```python
B = torch.ones(3, 4)
torch.mm(A, B), A @ B
```
```



## 11 Norms

Some of the most useful operators in linear algebra are norms. Informally, the norm of a vector tells us how big it is. For instance, the $\ell_2$ norm measures the (Euclidean) length of a vector. Here, we are employing a notion of size that concerns the magnitude of a vector’s components (not its dimensionality). A norm is a function $\| \cdot \|$ that maps a vector to a scalar and satisfies the following three properties:

1. Given any vector $x$, if we scale (all elements of) the vector by a scalar $\alpha \in \mathbb{R}$, its norm scales accordingly:
   $$\|\alpha x\| = |\alpha|\|x\|.$$ (2.3.10)

2. For any vectors $x$ and $y$: norms satisfy the triangle inequality:
   $$\|x + y\| \leq \|x\| + \|y\|.$$ (2.3.11)

3. The norm of a vector is nonnegative and it only vanishes if the vector is zero:
   $$\|x\| > 0 \quad \text{for all } x \neq 0.$$ (2.3.12)

Many functions are valid norms and different norms encode different notions of size. The Euclidean norm that we all learned in elementary school geometry when calculating the hypotenuse of a right triangle is the square root of the sum of squares of a vector’s elements. Formally, this is called the $\ell_2$ norm and expressed as:
   $$\|x\|_2 = \sqrt{\sum_{i=1}^{n} x_i^2}.$$ (2.3.13)

The method `norm` calculates the $\ell_2$ norm:
```python
u = torch.tensor([3.0, -4.0])
torch.norm(u)
```
```
tensor(5.)
```

The $\ell_1$ norm is also popular and the associated metric is called the Manhattan distance. By definition, the $\ell_1$ norm sums the absolute values of a vector’s elements:
   $$\|x\|_1 = \sum_{i=1}^{n} |x_i|.$$ (2.3.14)
Compared to the $\ell_2$ norm, it is less sensitive to outliers. To compute the $\ell_1$ norm, we compose the absolute value with the sum operation:
```python
torch.abs(u).sum()
```
```
tensor(7.)
```

Both the $\ell_2$ and $\ell_1$ norms are special cases of the more general $\ell_p$ norms:
   $$\|x\|_p = \left(\sum_{i=1}^{n} |x_i|^p \right)^{1/p}.$$ (2.3.15)

In the case of matrices, matters are more complicated. After all, matrices can be viewed both as collections of individual entries and as objects that operate on vectors and transform them into other vectors. For instance, we can ask by how much longer the matrix-vector product $Xv$ could be relative to $v$. This line of thought leads to a norm called the spectral norm. For now, we introduce the Frobenius norm, which is much easier to compute and defined as the square root of the sum of the squares of a matrix’s elements:
   $$\|X\|_F = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n} x_{ij}^2}.$$ (2.3.16)

The Frobenius norm behaves as if it were an $\ell_2$ norm of a matrix-shaped vector. Invoking the following function will calculate the Frobenius norm of a matrix:
```python
torch.norm(torch.ones((4, 9)))
```
```
tensor(6.)
```

While we do not want to get too far ahead of ourselves, we can plant some intuition already about why these concepts are useful. In deep learning, we are often trying to solve optimization problems: maximize the probability assigned to observed data; maximize the revenue associated with a recommender model; minimize the distance between predictions and the ground-truth observations; minimize the distance between representations of photos of the same person while maximizing the distance between representations of photos of different people. These distances, which constitute the objectives of deep learning algorithms, are often expressed as norms.

## 11 Norms

Some of the most useful operators in linear algebra are norms. Informally, the norm of a vector tells us how big it is. For instance, the $\ell_2$ norm measures the (Euclidean) length of a vector. Here, we are employing a notion of size that concerns the magnitude of a vector’s components (not its dimensionality). A norm is a function $\| \cdot \|$ that maps a vector to a scalar and satisfies the following three properties:

1. Given any vector $x$, if we scale (all elements of) the vector by a scalar $\alpha \in \mathbb{R}$, its norm scales accordingly:
   $$\|\alpha x\| = |\alpha|\|x\|.$$ (2.3.10)

2. For any vectors $x$ and $y$: norms satisfy the triangle inequality:
   $$\|x + y\| \leq \|x\| + \|y\|.$$ (2.3.11)

3. The norm of a vector is nonnegative and it only vanishes if the vector is zero:
   $$\|x\| > 0 \quad \text{for all } x \neq 0.$$ (2.3.12)

Many functions are valid norms and different norms encode different notions of size. The Euclidean norm that we all learned in elementary school geometry when calculating the hypotenuse of a right triangle is the square root of the sum of squares of a vector’s elements. Formally, this is called the $\ell_2$ norm and expressed as:
   $$\|x\|_2 = \sqrt{\sum_{i=1}^{n} x_i^2}.$$ (2.3.13)

The method `norm` calculates the \(\ell_2\) norm:
```python
u = torch.tensor([3.0, -4.0])
torch.norm(u)
```
```
tensor(5.)
```

The $\ell_1$ norm is also popular and the associated metric is called the Manhattan distance. By definition, the $\ell_1$ norm sums the absolute values of a vector’s elements:
   $$\|x\|_1 = \sum_{i=1}^{n} |x_i|.$$ (2.3.14)
Compared to the $\ell_2$ norm, it is less sensitive to outliers. To compute the $\ell_1$ norm, we compose the absolute value with the sum operation:
```python
torch.abs(u).sum()
```
```
tensor(7.)
```

Both the $\ell_2$ and $\ell_1$ norms are special cases of the more general $\ell_p$ norms:
   $$\|x\|_p = \left(\sum_{i=1}^{n} |x_i|^p \right)^{1/p}.$$ (2.3.15)

In the case of matrices, matters are more complicated. After all, matrices can be viewed both as collections of individual entries and as objects that operate on vectors and transform them into other vectors. For instance, we can ask by how much longer the matrix-vector product $Xv$ could be relative to $v$. This line of thought leads to a norm called the spectral norm. For now, we introduce the Frobenius norm, which is much easier to compute and defined as the square root of the sum of the squares of a matrix’s elements:
   $$\|X\|_F = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n} x_{ij}^2}.$ (2.3.16)

The Frobenius norm behaves as if it were an $\ell_2$ norm of a matrix-shaped vector. Invoking the following function will calculate the Frobenius norm of a matrix:
```python
torch.norm(torch.ones((4, 9)))
```
```
tensor(6.)
```

While we do not want to get too far ahead of ourselves, we can plant some intuition already about why these concepts are useful. In deep learning, we are often trying to solve optimization problems: maximize the probability assigned to observed data; maximize the revenue associated with a recommender model; minimize the distance between predictions and the ground-truth observations; minimize the distance between representations of photos of the same person while maximizing the distance between representations of photos of different people. These distances, which constitute the objectives of deep learning algorithms, are often expressed as norms.



## Discussion

In this section, we reviewed all the linear algebra that you will need to understand a remarkable chunk of modern deep learning. There is a lot more to linear algebra and much of it is useful for machine learning. For example, matrices can be decomposed into factors, and these decompositions can reveal low-dimensional structure in real-world datasets. There are entire subfields of machine learning that focus on using matrix decompositions and their generalizations to high-order tensors to discover structure in datasets and solve prediction problems.

But this book focuses on deep learning. And we believe you will be more inclined to learn more mathematics once you have gotten your hands dirty applying machine learning to real datasets. So while we reserve the right to introduce more mathematics later on, we wrap up this section here. If you are eager to learn more linear algebra, there are many excellent books and online resources. For a more advanced crash course, consider checking out Kolter (2008), Petersen et al. (2008), Strang (1993).

### Recap:

- Scalars, vectors, matrices, and tensors are the basic mathematical objects used in linear algebra and have zero, one, two, and an arbitrary number of axes, respectively.
- Tensors can be sliced or reduced along specified axes via indexing, or operations such as sum and mean, respectively.
- Elementwise products are called Hadamard products. By contrast, dot products, matrix-vector products, and matrix-matrix products are not elementwise operations and in general return objects that have different shapes than the operands.
- Compared to Hadamard products, matrix-matrix products take considerably longer to compute (cubic rather than quadratic time).
- Norms capture various notions of the magnitude of a vector and are commonly applied to the difference of two vectors to measure their distance.
- Common vector norms include the $\ell_1$ and $\ell_2$ norms, and common matrix norms include the spectral and Frobenius norms.

$$
\|x\|_1 = \sum_{i=1}^{n} |x_i| \\
\|x\|_2 = \sqrt{\sum_{i=1}^{n} x_i^2} \\
\|X\|_F = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n} x_{ij}^2}
$$

These fundamental concepts form the foundation for understanding deep learning algorithms and models.

