<a href="https://colab.research.google.com/github/sxergiu/fac-year-3/blob/main/IPR/IPR3_Laboratory_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data manipulation in PyTorch

To get started with deep learning, we will need to develop a few basic skills. All machine learning
is concerned with extracting information from data. So we will begin by learning the practical
skills for storing and manipulating data.

To start, we introduce the
$n$-dimensional array, which is also called the *tensor*. No matter which framework we use,
its *tensor class* (`Tensor` in both PyTorch and TensorFlow) is similar to `numpy`'s `ndarray` with a few useful features. First, GPU is well-supported to accelerate the computation,
whereas `numpy` only supports CPU computation. Second, the tensor class
supports automatic differentiation.
These properties make the tensor class suitable for deep learning.

To start, we import `torch`. Note that though it's called PyTorch, we should
import `torch`, instead of `pytorch`.


In [None]:
import torch

A tensor represents a (possibly multi-dimensional) array of numerical values.
With one axis, a tensor is called a *vector*.
With two axes, a tensor is called a *matrix*.
With $k > 2$ axes, we drop the specialized names
and just refer to the object as a $k$*th-order tensor*.

PyTorch provides a variety of functions
for creating new tensors
prepopulated with values.
For example, by invoking `arange(n)`,
we can create a vector of evenly spaced values,
starting at $0$ (included)
and ending at `n` (not included).
By default, the interval size is $1$.
Unless otherwise specified,
new tensors are stored in main memory
and designated for CPU-based computation.

In [None]:
x = torch.arange(12, dtype=torch.float32)
x

tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.])

We can access a tensor's *shape* (the length along each axis) by inspecting its `shape` property.

In [None]:
x.shape

torch.Size([12])

If we just want to know the total number of elements in a tensor,
i.e., the product of all of the shape elements,
we can inspect its size.
Because we are dealing with a vector here,
the single element of its `shape` is identical to its size.

In [None]:
x.numel()

12

To change the shape of a tensor without altering
either the number of elements or their values,
we can invoke the `reshape()` function.
For example, we can transform our tensor, `x`,
from a row vector with shape $(12,)$ to a matrix with shape $(3, 4)$.
This new tensor contains the exact same values,
but views them as a matrix organized as $3$ rows and $4$ columns.
To reiterate, although the shape has changed,
the elements have not.
Note that the size is unaltered by reshaping.

In [None]:
X = x.reshape(3, 4)
X

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])

Reshaping by manually specifying every dimension is unnecessary.
If our target shape is a matrix with shape $(\text{height, width})$,
then, after we know the $\text{width}$, the $\text{height}$ is given implicitly.
Why should we have to perform the division ourselves?
In the example above, to get a matrix with $3$ rows,
we specified both that it should have $3$ rows and $4$ columns.
Fortunately, tensors can automatically work out one dimension given the rest.
We invoke this capability by placing `-1` for the dimension
that we would like tensors to automatically infer.
In our case, instead of calling `x.reshape(3, 4)`,
we could have equivalently called `x.reshape(-1, 4)` or `x.reshape(3, -1)`.

Typically, we will want our matrices initialized
either with zeros, ones, some other constants,
or numbers randomly sampled from a specific distribution. We can create a tensor representing a tensor with all elements set to $0$ and a shape of $(2, 3, 4)$ as follows:

In [None]:
torch.zeros((2, 3, 4))

tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]])

Similarly, we can create tensors with each element set to $1$ as follows:

In [None]:
torch.ones((2, 3, 4))

tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])

Often, we want to randomly sample the values
for each element in a tensor from some probability distribution.
For example, when we construct arrays to serve
as parameters in a neural network, we will
typically initialize their values randomly.
The following code creates a tensor with shape $(3, 4)$.
Each of its elements is randomly sampled
from a standard Gaussian (normal) distribution
with a mean of $0$ and a standard deviation of $1$.

In [None]:
torch.randn(3, 4)

tensor([[ 0.5189,  1.1389, -0.9604, -2.4524],
        [ 0.2156,  0.2228,  1.8153,  0.6495],
        [-1.8379, -0.6394,  0.5575, -0.8809]])

We can also specify the exact values for each element in the desired tensor
by supplying a Python list (or list of lists) containing the numerical values.
Here, the outermost list corresponds to axis $0$, and the inner list to axis $1$.

In [None]:
torch.tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])

tensor([[2, 1, 4, 3],
        [1, 2, 3, 4],
        [4, 3, 2, 1]])

Our interests are not limited to simply
reading and writing data from/to arrays.
We want to perform mathematical operations on those arrays.
Some of the simplest and most useful operations
are the *element-wise* operations.
These apply a standard scalar operation
to each element of an array.
For functions that take two arrays as inputs,
element-wise operations apply some standard binary operator
on each pair of corresponding elements from the two arrays.
We can create an element-wise function from any function
that maps from a scalar to a scalar.

In mathematical notation, we would denote such
a *unary* scalar operator (taking one input)
by the signature $f: \mathbb{R} \rightarrow \mathbb{R}$.
This just means that the function is mapping
from any real number ($\mathbb{R}$) to another.
Likewise, we denote a *binary* scalar operator
(taking two real inputs, and yielding one output)
by the signature $f: \mathbb{R}, \mathbb{R} \rightarrow \mathbb{R}$.
Given any two vectors $\mathbf{u}$ and $\mathbf{v}$ *of the same shape*,
and a binary operator $f$, we can produce a vector
$\mathbf{c} = F(\mathbf{u},\mathbf{v})$
by setting $c_i \gets f(u_i, v_i)$ for all $i$,
where $c_i, u_i$, and $v_i$ are the $i$th elements
of vectors $\mathbf{c}, \mathbf{u}$, and $\mathbf{v}$.
Here, we produced the vector-valued operator
$F: \mathbb{R}^d, \mathbb{R}^d \rightarrow \mathbb{R}^d$
by *lifting* the scalar function to an element-wise vector operation.

The common standard arithmetic operators
(`+`, `-`, `*`, `/`, and `**`)
have all been *lifted* to element-wise operations
for any identically-shaped tensors of arbitrary shape.
We can call element-wise operations on any two tensors of the same shape.
In the following example, we use commas to formulate a $5$-element tuple,
where each element is the result of an element-wise operation.

In [None]:
x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x ** y  # The ** operator is exponentiation

(tensor([ 3.,  4.,  6., 10.]),
 tensor([-1.,  0.,  2.,  6.]),
 tensor([ 2.,  4.,  8., 16.]),
 tensor([0.5000, 1.0000, 2.0000, 4.0000]),
 tensor([ 1.,  4., 16., 64.]))

Many more operations can be applied element-wise, including unary operators like *exponentiation*.

In [None]:
torch.exp(x)

tensor([2.7183e+00, 7.3891e+00, 5.4598e+01, 2.9810e+03])

In addition to element-wise computations,
we can also perform linear algebra operations,
including vector dot products and matrix multiplication.

We can also *concatenate* multiple tensors together, stacking them end-to-end to form a larger tensor.
We just need to provide a list of tensors
and tell the system along which axis to concatenate.
The example below shows what happens when we concatenate
two matrices along rows (axis $0$, the first element of the shape)
vs. columns (axis $1$, the second element of the shape).
We can see that the first output tensor's axis-$0$ length ($6$)
is the sum of the two input tensors' axis-$0$ lengths ($3 + 3$);
while the second output tensor's axis-$1$ length ($8$)
is the sum of the two input tensors' axis-$1$ lengths ($4 + 4$).

In [None]:
X = torch.arange(12, dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
torch.cat((X, Y), axis=0), torch.cat((X, Y), axis=1)

(tensor([[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.],
         [ 2.,  1.,  4.,  3.],
         [ 1.,  2.,  3.,  4.],
         [ 4.,  3.,  2.,  1.]]),
 tensor([[ 0.,  1.,  2.,  3.,  2.,  1.,  4.,  3.],
         [ 4.,  5.,  6.,  7.,  1.,  2.,  3.,  4.],
         [ 8.,  9., 10., 11.,  4.,  3.,  2.,  1.]]))

Sometimes, we want to construct a binary tensor via *logical statements*.
Take `X == Y` as an example.
For each position, if `X` and `Y` are equal at that position,
the corresponding entry in the new tensor takes a value of $1$,
meaning that the logical statement `X == Y` is true at that position;
otherwise, that position takes $0$.

In [None]:
X == Y

tensor([[False,  True, False,  True],
        [False, False, False, False],
        [False, False, False, False]])

Summing all the elements in the tensor yields a tensor with only one element.

In [None]:
X.sum()

tensor(66.)

We previously saw how to perform element-wise operations
on two tensors of the same shape. Under certain conditions,
even when shapes differ, we can still perform element-wise operations
by using the *broadcasting mechanism*.
This mechanism works in the following way:
first, expand one or both arrays
by copying elements appropriately
so that, after this transformation,
the two tensors have the same shape.
Second, carry out the element-wise operations
on the resulting arrays.

In most cases, we broadcast along an axis where an array
initially only has length $1$, such as in the following example:

In [None]:
a = torch.arange(3).reshape((3, 1))
b = torch.arange(2).reshape((1, 2))
a, b

(tensor([[0],
         [1],
         [2]]), tensor([[0, 1]]))

Since `a` and `b` are $3\times1$ and $1\times2$ matrices, respectively,
their shapes do not match up if we want to add them.
We *broadcast* the entries of both matrices into a larger $3\times2$ matrix as follows:
for matrix `a` it replicates the columns,
and for matrix `b` it replicates the rows,
before adding up both element-wise.

In [None]:
a + b

tensor([[0, 1],
        [1, 2],
        [2, 3]])

Just as in any other Python array, elements in a tensor can be accessed by index.
As in any Python array, the first element has index $0$
and ranges are specified to include the first, but *before* the last element.
As in standard Python lists, we can access elements
according to their relative position to the end of the list
by using negative indices.

Thus, `[-1]` selects the last element and `[1:3]`
selects the second and the third elements as follows:

In [None]:
X[-1], X[1:3]

(tensor([ 8.,  9., 10., 11.]), tensor([[ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]]))

Beyond reading, we can also write elements of a matrix by specifying indices.

In [None]:
X[1, 2] = 9
X

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  9.,  7.],
        [ 8.,  9., 10., 11.]])

If we want to assign multiple elements the same value,
we simply index all of them and then assign them the value.
For instance, `[0:2, :]` accesses the first and second rows,
where `:` takes all the elements along axis $1$ (column).
While we discussed indexing for matrices,
this obviously also works for vectors
and for tensors of more than $2$ dimensions.

In [None]:
X[0:2, :] = 12
X

tensor([[12., 12., 12., 12.],
        [12., 12., 12., 12.],
        [ 8.,  9., 10., 11.]])

Running operations can cause new memory to be
allocated to host results. For example, if we write `Y = X + Y`,
we will dereference the tensor that `Y` used to point to
and instead point `Y` at the newly allocated memory location.
In the following example, we demonstrate this with Python's `id()` function,
which gives us the exact address of the referenced object in memory.
After running `Y = Y + X`, we will find that `id(Y)` points to a different location.
That is because Python first evaluates `Y + X`,
allocating new memory for the result, and then makes `Y`
point to this new location in memory.

In [None]:
before = id(Y)
Y = Y + X
id(Y) == before

False

This might be undesirable for two reasons.
First, we do not want to allocate memory unnecessarily all the time.
In machine learning, we might have
hundreds of megabytes of parameters
and update all of them multiple times per second.
Typically, we will want to perform these updates *in place*.
Second, we might point at the same parameters from multiple variables.
If we do not update in place, other references will still point to
the old memory location, making it possible for parts of our code
to inadvertently reference old parameters.

Fortunately, performing in-place operations is easy.
We can assign the result of an operation
to a previously allocated array with slice notation,
e.g., `Y[:] = <expression>`.
To illustrate this concept, we first create a new matrix `Z`
with the same shape as another `Y`,
using `zeros_like()` to allocate a block of $0$ entries.

In [None]:
Z = torch.zeros_like(Y)
print('id(Z):', id(Z))
Z[:] = X + Y
print('id(Z):', id(Z))

id(Z): 140490458020848
id(Z): 140490458020848


If the value of `X` is not reused in subsequent computations,
we can also use `X[:] = X + Y` or `X += Y`
to reduce the memory overhead of the operation.

In [None]:
before = id(X)
X += Y
id(X) == before

True

Converting to a `numpy` tensor (`ndarray`), or vice versa, is easy.
The PyTorch `Tensor` and `numpy` `ndarray` will share their underlying memory
locations, and changing one through an in-place operation will also
change the other.

In [None]:
A = X.numpy()
B = torch.from_numpy(A)
type(A), type(B)

(numpy.ndarray, torch.Tensor)

To convert a size-$1$ tensor to a Python scalar,
we can invoke the `item` function or Python's built-in functions.

In [None]:
a = torch.tensor([3.5])
a, a.item(), float(a), int(a)

(tensor([3.5000]), 3.5, 3.5, 3)

#Linear algebra in PyTorch

Now that we know how to store and manipulate data, we will introduce the basic mathematical objects, arithmetic,
and operations in *linear algebra*,
expressing them through mathematical notation
and the corresponding implementation in code.

Formally, we call values consisting
of just one numerical quantity *scalars*.

A scalar is represented by a tensor with just one element. Next, we instantiate two scalars
and perform some familiar arithmetic operations with them,
namely addition, multiplication, division, and exponentiation.

In [None]:
x = torch.tensor(3.0)
y = torch.tensor(2.0)
x + y, x * y, x / y, x**y

(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))

We can think of a *vector* as simply a list of scalar values. We call these values the *elements* (*entries* or *components*) of the vector.

We work with vectors via one-dimensional tensors.
In general, tensors can have arbitrary lengths,
subject to the memory limits of our machine.

In [None]:
x = torch.arange(4)
x

tensor([0, 1, 2, 3])

We can access any element by indexing into the tensor.

In [None]:
x[3]

tensor(3)

The length of a vector is commonly called the *dimension* of the vector.

As with an ordinary Python array,
we can access the length of a tensor by calling Python's built-in `len()` function.

In [None]:
len(x)

4

When a tensor represents a vector (with precisely one axis),
we can also access its length via the `.shape` attribute.
The shape is a tuple that lists the length (dimensionality)
along each axis of the tensor. For tensors with just one axis, the shape has just one element.

In [None]:
x.shape

torch.Size([4])

Note that the word "dimension" tends to get overloaded
in these contexts and this tends to be confusing.
To clarify, we use the dimensionality of a *vector* or an *axis*
to refer to its length, i.e., the number of elements of a vector or an axis.
However, we use the dimensionality of a tensor
to refer to the number of axes that a tensor has.
In this sense, the dimensionality of some axis of a tensor
will be the length of that axis.

Just as vectors generalize scalars from order zero to order one,
matrices generalize vectors from order one to order two.

We can create an $m \times n$ matrix by specifying a shape with two components, $m$ and $n$, when calling any of the functions for instantiating a tensor.

In [None]:
A = torch.arange(20).reshape(5, 4)
A

tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19]])

Sometimes, we want to flip the axes of a matrix.
When we exchange a matrix's rows and columns,
the result is the *transpose* of the matrix.

We can access a matrix's transpose in code by:

In [None]:
A.T

tensor([[ 0,  4,  8, 12, 16],
        [ 1,  5,  9, 13, 17],
        [ 2,  6, 10, 14, 18],
        [ 3,  7, 11, 15, 19]])

As a special type of square matrix, a *symmetric matrix* is equal to its transpose. Here, we define a symmetric matrix `B`.

In [None]:
B = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
B

tensor([[1, 2, 3],
        [2, 0, 4],
        [3, 4, 5]])

Now we compare `B` with its transpose.

In [None]:
B == B.T

tensor([[True, True, True],
        [True, True, True],
        [True, True, True]])

Although the default orientation of a single vector is a column vector, in a matrix that represents a tabular dataset, it is more
conventional to treat each data example as a row vector in the matrix. For example, along the
outermost axis of a tensor, we can access or enumerate mini-batches of data examples, or just data
examples, if no mini-batch exists.

Just as vectors generalize scalars, and matrices generalize vectors, we can build data structures with even more axes. *Tensors* give us a generic way of describing $n$-dimensional arrays with an arbitrary number of axes. Vectors, for example, are first-order tensors, and matrices are second-order tensors.

Tensors will become more important when we start working with images,
 which are represented as $n$-dimensional arrays with $3$ axes corresponding to the height, width, and a *channel* axis for stacking the color channels (red, green, and blue). For now, we will skip over higher-order tensors and focus on the basics.

In [None]:
X = torch.arange(24).reshape(2, 3, 4)
X

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

Scalars, vectors, matrices, and tensors of an arbitrary number of axes have some nice properties that are often useful.
For example, we might have noticed
from the definition of an element-wise operation
that any element-wise unary operation does not change the shape of its operand.
Similarly, given any two tensors with the same shape,
the result of any binary element-wise operation
will be a tensor of that same shape.
For example, adding two matrices of the same shape
performs element-wise addition over these two matrices.

In [None]:
A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
B = A.clone()  # Assign a copy of `A` to `B` by allocating new memory
A, A + B

(tensor([[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.],
         [12., 13., 14., 15.],
         [16., 17., 18., 19.]]), tensor([[ 0.,  2.,  4.,  6.],
         [ 8., 10., 12., 14.],
         [16., 18., 20., 22.],
         [24., 26., 28., 30.],
         [32., 34., 36., 38.]]))

Specifically, element-wise multiplication of two matrices is called their *Hadamard product*, and is denoted by $\odot$.

In [None]:
A * B

tensor([[  0.,   1.,   4.,   9.],
        [ 16.,  25.,  36.,  49.],
        [ 64.,  81., 100., 121.],
        [144., 169., 196., 225.],
        [256., 289., 324., 361.]])

Multiplying or adding a tensor by a scalar also does not change the shape of the tensor,
where each element of the operand tensor will be added or multiplied by the scalar.

In [None]:
a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape

(tensor([[[ 2,  3,  4,  5],
          [ 6,  7,  8,  9],
          [10, 11, 12, 13]],
 
         [[14, 15, 16, 17],
          [18, 19, 20, 21],
          [22, 23, 24, 25]]]), torch.Size([2, 3, 4]))

One useful operation that we can perform with arbitrary tensors
is to calculate the sum of their elements.
We can just call the `sum()` function for calculating the sum.

In [None]:
x = torch.arange(4, dtype=torch.float32)
x, x.sum()

(tensor([0., 1., 2., 3.]), tensor(6.))

We can express sums over the elements of tensors of arbitrary shape.

In [None]:
A.shape, A.sum()

(torch.Size([5, 4]), tensor(190.))

By default, invoking the function for calculating the sum
*reduces* a tensor along all its axes to a scalar.
We can also specify the axes along which the tensor is reduced via summation.
Take matrices as an example.
To reduce the row dimension (axis $0$) by summing up elements of all the rows,
we specify `axis=0` when invoking the function.
Since the input matrix reduces along axis $0$ to generate the output vector,
the dimension of axis $0$ of the input is lost in the output shape.

In [None]:
A_sum_axis0 = A.sum(axis=0)
A_sum_axis0, A_sum_axis0.shape

(tensor([40., 45., 50., 55.]), torch.Size([4]))

Specifying `axis=1` will reduce the column dimension (axis $1$) by summing up elements of all the columns. Thus, the dimension of axis $1$ of the input is lost in the output shape.

In [None]:
A_sum_axis1 = A.sum(axis=1)
A_sum_axis1, A_sum_axis1.shape

(tensor([ 6., 22., 38., 54., 70.]), torch.Size([5]))

Reducing a matrix along both rows and columns via summation
is equivalent to summing up all the elements of the matrix.

In [None]:
A.sum(axis=[0, 1])  # Same as `A.sum()`

tensor(190.)

A related quantity is the *mean*, which is also called the *average*. We calculate the mean by dividing the sum by the total number of elements.
In code, we could just call the function for calculating the mean
on tensors of arbitrary shape.

In [None]:
A.mean(), A.sum() / A.numel()

(tensor(9.5000), tensor(9.5000))

Likewise, the function for calculating the mean can also reduce a tensor along the specified axes.

In [None]:
A.mean(axis=0), A.sum(axis=0) / A.shape[0]

(tensor([ 8.,  9., 10., 11.]), tensor([ 8.,  9., 10., 11.]))

However, sometimes it can be useful to keep the number of axes unchanged when invoking the
function for calculating the sum or mean.

In [None]:
sum_A = A.sum(axis=1, keepdims=True)
sum_A

tensor([[ 6.],
        [22.],
        [38.],
        [54.],
        [70.]])

For instance, since `sum_A` still keeps its two axes after summing each row, we can divide `A` by `sum_A` with broadcasting.

In [None]:
A / sum_A

tensor([[0.0000, 0.1667, 0.3333, 0.5000],
        [0.1818, 0.2273, 0.2727, 0.3182],
        [0.2105, 0.2368, 0.2632, 0.2895],
        [0.2222, 0.2407, 0.2593, 0.2778],
        [0.2286, 0.2429, 0.2571, 0.2714]])

If we want to calculate the cumulative sum of elements of `A` along some axis, say `axis=0` (row by row),
we can call the `cumsum()` function. This function will not reduce the input tensor along any axis.

In [None]:
A.cumsum(axis=0)

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  6.,  8., 10.],
        [12., 15., 18., 21.],
        [24., 28., 32., 36.],
        [40., 45., 50., 55.]])

So far, we have only performed element-wise operations, sums, and averages. However, one of the most fundamental operations is the dot product.
Given two vectors, their *dot product* is a sum over the products of the elements at the same position.

In [None]:
y = torch.arange(1, 5, dtype = torch.float32)
x, y, torch.dot(x, y)

(tensor([0., 1., 2., 3.]), tensor([1., 2., 3., 4.]), tensor(20.))

Note that we can express the dot product of two vectors equivalently by performing an element-wise multiplication and then a sum:

In [None]:
torch.sum(x * y)

tensor(20.)

Dot products are useful in a wide range of contexts.
For example, given some set of values,
denoted by a vector $\mathbf{x}  \in \mathbb{R}^d$,
and a set of weights denoted by $\mathbf{w} \in \mathbb{R}^d$,
the weighted sum of the values in $\mathbf{x}$
according to the weights $\mathbf{w}$
could be expressed as the dot product $\mathbf{x}^\top \mathbf{w}=\mathbf{w}^\top \mathbf{x}$.
When the weights are non-negative
and sum to one (i.e., $\sum_{i=1}^{d} {w_i} = 1$),
the dot product expresses a *weighted average*.
After normalizing two vectors to have the unit length,
the dot products express the cosine of the angle between them.

Now that we know how to calculate dot products,
we can begin to understand *matrix-vector products*.
Let matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$
and the vector $\mathbf{x} \in \mathbb{R}^n$. We start off by visualizing the matrix $\mathbf{A}$ in terms of its row vectors

$$\mathbf{A}=
\begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_m \\
\end{bmatrix},$$

where each $\mathbf{a}^\top_{i} \in \mathbb{R}^n$
is a row vector representing the $i$th row of the matrix $\mathbf{A}$.

The matrix-vector product $\mathbf{A}\mathbf{x}$
is simply a column vector of length $m$,
whose $i$th element is the dot product $\mathbf{a}^\top_i \mathbf{x}$:

$$
\mathbf{A}\mathbf{x}
= \begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_m \\
\end{bmatrix}\mathbf{x}
= \begin{bmatrix}
 \mathbf{a}^\top_{1} \mathbf{x}  \\
 \mathbf{a}^\top_{2} \mathbf{x} \\
\vdots\\
 \mathbf{a}^\top_{m} \mathbf{x}\\
\end{bmatrix}.
$$

We can think of multiplication by a matrix $\mathbf{A}\in \mathbb{R}^{m \times n}$
as a transformation that projects vectors
from $\mathbb{R}^{n}$ to $\mathbb{R}^{m}$.
These transformations turn out to be very useful.
For example, we can represent rotations
as multiplications by a square matrix.

Expressing matrix-vector products in code with tensors, we use
the `mv()` function. When we call `torch.mv(A, x)` with a matrix
`A` and a vector `x`, the matrix-vector product is performed.
Note that the column dimension of `A` (its length along axis $1$)
must be the same as the dimension of `x` (its length).

In [None]:
A.shape, x.shape, torch.mv(A, x)

(torch.Size([5, 4]), torch.Size([4]), tensor([ 14.,  38.,  62.,  86., 110.]))

Now that we understand dot products and matrix-vector products, *matrix-matrix multiplication* should be straightforward.

Assume that we have two matrices $\mathbf{A} \in \mathbb{R}^{n \times k}$ and $\mathbf{B} \in \mathbb{R}^{k \times m}$:

$$\mathbf{A}=\begin{bmatrix}
 a_{11} & a_{12} & \cdots & a_{1k} \\
 a_{21} & a_{22} & \cdots & a_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
 a_{n1} & a_{n2} & \cdots & a_{nk} \\
\end{bmatrix},\quad
\mathbf{B}=\begin{bmatrix}
 b_{11} & b_{12} & \cdots & b_{1m} \\
 b_{21} & b_{22} & \cdots & b_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
 b_{k1} & b_{k2} & \cdots & b_{km} \\
\end{bmatrix}.$$


Denote by $\mathbf{a}^\top_{i} \in \mathbb{R}^k$
the row vector representing the $i$th row of the matrix $\mathbf{A}$,
and let $\mathbf{b}_{j} \in \mathbb{R}^k$
be the column vector from the $j$th column of the matrix $\mathbf{B}$.
To produce the matrix product $\mathbf{C} = \mathbf{A}\mathbf{B}$, it is easiest to think of $\mathbf{A}$ in terms of its row vectors and of $\mathbf{B}$ in terms of its column vectors:

$$\mathbf{A}=
\begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_n \\
\end{bmatrix},
\quad \mathbf{B}=\begin{bmatrix}
 \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\
\end{bmatrix}.
$$


Then, the matrix product $\mathbf{C} \in \mathbb{R}^{n \times m}$ is produced by simply computing each element $c_{ij}$ as the dot product $\mathbf{a}^\top_i \mathbf{b}_j$:

$$\mathbf{C} = \mathbf{AB} = \begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_n \\
\end{bmatrix}
\begin{bmatrix}
 \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\
\end{bmatrix}
= \begin{bmatrix}
\mathbf{a}^\top_{1} \mathbf{b}_1 & \mathbf{a}^\top_{1}\mathbf{b}_2& \cdots & \mathbf{a}^\top_{1} \mathbf{b}_m \\
 \mathbf{a}^\top_{2}\mathbf{b}_1 & \mathbf{a}^\top_{2} \mathbf{b}_2 & \cdots & \mathbf{a}^\top_{2} \mathbf{b}_m \\
 \vdots & \vdots & \ddots &\vdots\\
\mathbf{a}^\top_{n} \mathbf{b}_1 & \mathbf{a}^\top_{n}\mathbf{b}_2& \cdots& \mathbf{a}^\top_{n} \mathbf{b}_m
\end{bmatrix}.
$$


We can think of the matrix-matrix multiplication $\mathbf{AB}$ as simply performing $m$ matrix-vector products and stitching the results together to form an $n \times m$ matrix.
In the following, we perform matrix multiplication on `A` and `B`.
Here, `A` is a matrix with $5$ rows and $4$ columns,
and `B` is a matrix with $4$ rows and $3$ columns.
After multiplication, we obtain a matrix with $5$ rows and $3$ columns.

In [None]:
B = torch.ones(4, 3)
torch.mm(A, B)

tensor([[ 6.,  6.,  6.],
        [22., 22., 22.],
        [38., 38., 38.],
        [54., 54., 54.],
        [70., 70., 70.]])

Matrix-matrix multiplication can be simply called *matrix multiplication*, and should not be confused
with the Hadamard product. Matrix multiplication can also be performed using the `@` operator:

In [None]:
A @ B, A @ x

(tensor([[ 6.,  6.,  6.],
         [22., 22., 22.],
         [38., 38., 38.],
         [54., 54., 54.],
         [70., 70., 70.]]), tensor([ 14.,  38.,  62.,  86., 110.]))

Some of the most useful operators in linear algebra are *norms*.
Informally, the norm of a vector tells us how *big* a vector is.
The notion of *size* under consideration here
concerns not dimensionality,
but rather the magnitude of the components.

The familiar Euclidean distance is a norm:
specifically, it is the $\ell_2$ norm.
Suppose that the elements in the $n$-dimensional vector
$\mathbf{x}$ are $x_1, \ldots, x_n$.

The $\ell_2$ *norm* of $\mathbf{x}$ is the square root of the sum of the squares of the vector elements:

$$\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2},$$

In code, we can calculate the $\ell_2$ norm of a vector as follows:

In [None]:
u = torch.tensor([3.0, -4.0])
torch.norm(u)

tensor(5.)

We will also frequently encounter the $\ell_1$ *norm*, which is expressed as the sum of the absolute values of the vector elements:

$$\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.$$


As compared with the $\ell_2$ norm,
it is less influenced by outliers.
To calculate the $\ell_1$ norm, we compose
the absolute value function with a sum over the elements.

In [None]:
torch.abs(u).sum()

tensor(7.)

Analogous to $\ell_2$ norms of vectors, the *Frobenius norm* of a matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ is the square root of the sum of the squares of the matrix elements:

$$\|\mathbf{X}\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.$$

The Frobenius norm satisfies all the properties of vector norms.
It behaves as if it were an $\ell_2$ norm of a matrix-shaped vector.
Invoking the `norm()` function will calculate the Frobenius norm of a matrix.

In [None]:
torch.norm(torch.ones((4, 9)))

tensor(6.)

# Automatic differentiation in PyTorch

Differentiation is a crucial step in nearly all deep learning optimization algorithms.
While the calculations for taking these derivatives are straightforward,
requiring only some basic calculus,
for complex models, working out the updates by hand
can be very complicated (and often error-prone).

Deep learning frameworks speed up this work
by automatically calculating derivatives, i.e., *automatic differentiation*.
In practice,
based on our designed model,
the system builds a *computational graph*,
tracking which data combined through
which operations to produce the output.
Automatic differentiation enables the system to subsequently backpropagate gradients.
Here, *backpropagate* simply means to trace through the computational graph,
filling in the partial derivatives with respect to each parameter.

As an example, assume that we are interested
in differentiating the function
$y = 2\mathbf{x}^{\top}\mathbf{x}$
with respect to the column vector $\mathbf{x}$.
To start, we create the variable `x` and assign it an initial value.

In [None]:
x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

Before we even calculate the gradient
of $y$ with respect to $\mathbf{x}$,
we will need a place to store it.
It is important that we do not allocate new memory
every time we take a derivative with respect to a parameter,
because we will often update the same parameters
thousands or millions of times,
and could quickly run out of memory.
Note that a gradient of a scalar-valued function
with respect to a vector $\mathbf{x}$
is itself vector-valued and has the same shape as $\mathbf{x}$.

In [None]:
x.requires_grad_(True)  # Same as `x = torch.arange(4.0, requires_grad=True)`
x.grad  # The default value is None

Now, let us calculate $y$.

In [None]:
y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=<MulBackward0>)

Since `x` is a vector of length $4$,
a dot product of `x` and `x` is performed,
yielding the scalar output that we assign to `y`.
Next, we can automatically calculate the gradient of `y`
with respect to each component of `x`
by calling the `backward()` function for backpropagation, and then printing the gradient.

In [None]:
y.backward()
x.grad

tensor([ 0.,  4.,  8., 12.])

The gradient of the function $y = 2\mathbf{x}^{\top}\mathbf{x}$
with respect to $\mathbf{x}$ should be $4\mathbf{x}$.
Let us quickly verify that our desired gradient was calculated correctly.

In [None]:
x.grad == 4 * x

tensor([True, True, True, True])

Now, let us calculate another function of `x`, the sum of its elements.

In [None]:
# PyTorch accumulates the gradient by default, we need to clear the previous
# values
x.grad.zero_()
y = x.sum()
y.backward()
x.grad

tensor([1., 1., 1., 1.])

Sometimes, we wish to move some calculations outside of the recorded computational graph.
For example, say that `y` was calculated as a function of `x`,
and that subsequently `z` was calculated as a function of both `y` and `x`.
Now, imagine that we wanted to calculate
the gradient of `z` with respect to `x`,
but wanted for some reason to treat `y` as a constant,
and only take into account the role
that `x` played after `y` was calculated.

Here, we can *detach* `y` to return a new variable `u`
that has the same value as `y`, but discards any information
about how `y` was computed in the computational graph.
In other words, the gradient will not flow backwards through `u` to `x`.
Thus, the following backpropagation function computes
the partial derivative of `z = u * x` with respect to `x`, while treating `u` as a constant,
instead of the partial derivative of `z = x * x * x` with respect to `x`.

In [None]:
x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
x.grad == u

tensor([True, True, True, True])

Since the computation of `y` was recorded,
we can subsequently invoke backpropagation on `y` to get the derivative of `y = x * x` with respect to `x`, which is `2 * x`.

In [None]:
x.grad.zero_()
y.sum().backward()
x.grad == 2 * x

tensor([True, True, True, True])

One benefit of using automatic differentiation
is that, even if building the computational graph of a function
required passing through Python control flow (e.g., conditionals, loops, and arbitrary function calls), we can still calculate the gradient of the resulting variable.
In the following, note that
the number of iterations of the `while` loop
and the evaluation of the `if` statement
both depend on the value of the input `a`.

In [None]:
def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

Let us compute the gradient.

In [None]:
a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

We can now analyze the `f` function defined above.
Note that it is piecewise linear in its input `a`.
In other words, for any `a`, there exists some constant scalar `k`
such that `f(a) = k * a`, where the value of `k` depends on the input `a`.
Consequently `d / a` allows us to verify that the gradient is correct.

In [None]:
a.grad == d / a

tensor(True)