# Linear Algebra

Now that you can store and manipulate data, let's briefly review the subset of basic linear algebra that you'll need to understand most of the models. We'll introduce all the basic concepts, the corresponding mathematical notation, and their realization in code all in one place. If you're already confident basic linear algebra, free to skim or skip this chapter. 

In [3]:
import mxnet as mx
import mxnet.ndarray as nd

## Scalars

If you never studied linear algebra or machine learning, you're probably used to working with single numbers, like $42.0$ and know how to do basic things like add them together, multiply them. In mathematical notation, we'll represent salars with ordinary lower cased letters ($x$, $y$, $z$). In MXNet, we can work with scalars by creating NDArrays with just one element. 

In [4]:
x = nd.array([3.0]) 
y = nd.array([2.0])
print(x + y)
print(x * y)
print(x / y)
print(nd.power(x,y))


[ 5.]
<NDArray 1 @cpu(0)>

[ 6.]
<NDArray 1 @cpu(0)>

[ 1.5]
<NDArray 1 @cpu(0)>

[ 9.]
<NDArray 1 @cpu(0)>


We can convert NDArrays to Python floats by calling their ``.asscalar()

In [5]:
x.asscalar()

3.0

## Vectors 
You can think of vectors are simply a list of numbers ([1.0,3.0,4.0,2.0]). A vector could represent numerical features of some real-world person or object, like the last-record measurements across various vital signs for a patient in the hospital. In math notation, we'll always denote vectors as bold-faced lower-cased letters ($\boldsymbol{u}$, $\boldsymbol{v}$, $\boldsymbol{w})$. In MXNet, we work with vectors via 1D NDArrays with an arbitrary number of components.

In [6]:
u = nd.zeros(shape=10)
v = nd.ones(shape=10)
print(u)
print(v)


[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
<NDArray 10 @cpu(0)>

[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
<NDArray 10 @cpu(0)>


We can refer to any element of a vector by using a subscript. For example, we can refer to the $4$th element of $\boldsymbol{u}$ by $u_4$. Note that the element $u_4$ is a scalar, so we don't bold-face the font when referring to it.

## Matrices

Just as vectors are an extension of scalars from 0 to 1 dimension, matrices generalization vectors to two dimensions. Matrices, which we'll denote with capital letters ($A$, $B$, $C$) are 2D arrays. 

In [7]:
A = nd.random_normal(shape=(5,4))
B = nd.random_normal(shape=(5,4))
print(A)
print(B)


[[ 2.21220636  1.16307867  0.7740038   0.48380461]
 [ 1.04344046  0.29956347  1.18392551  0.15302546]
 [ 1.89171135 -1.16881478 -1.23474145  1.55807114]
 [-1.771029   -0.54594457 -0.45138445 -2.35562968]
 [ 0.57938355  0.54144019 -1.85608196  2.67850661]]
<NDArray 5x4 @cpu(0)>

[[-1.9768796   1.25463438 -0.20801921 -0.54877394]
 [ 0.2444218  -0.68106437 -0.03716067 -0.13531584]
 [-0.48774993  0.37723127 -0.02261727  0.41016445]
 [ 0.57461417  0.5712682   1.4661262  -2.7579627 ]
 [ 0.68629038  1.07628     0.35496104 -0.61413258]]
<NDArray 5x4 @cpu(0)>


Matrices are useful data structures, they allow us to organize data that has different modalities of variation. For example, returning to the example of medical data, rows in our matrix might correspond to different patients, while columns might correspond to different attributes.

We can access the scalar elements $a_{ij}$ of a matrix A by specifying the indices for the row ($i$) and column ($j$) respectively. Let's grab the element $a_{2,3}$ from the random matrix we initialized above.

In [8]:
A[2,3]


[ 1.55807114]
<NDArray 1 @cpu(0)>

We can also grab the vectors corresponding to entire rows $\boldsymbol{a}_{i,:}$ or columns $\boldsymbol{a}_{:,j}$.

In [9]:
print(A[2,:])
print(A[:,3])


[ 1.89171135 -1.16881478 -1.23474145  1.55807114]
<NDArray 4 @cpu(0)>

[ 0.48380461  0.15302546  1.55807114 -2.35562968  2.67850661]
<NDArray 5 @cpu(0)>


## Tensors 

Just as vectors generalize scalars, and matrices generalize vectors, we can actually build data structures with even more axes. Tensors, give us a generic way of discussing arrays with an arbitrary number of axes. Vectors, for example are first-order tensors, and matrices are second-order tensors.

Using tensors will become more important when we start working with images, which arrive as 3D data structures, with axes corresponding to the height, width, and the three (RGB) color channels. But in this chapter, we're going to skip past and make sure you know the basics.

## Element-wise operations

Oftentimes, we want to perform element-wise operations. This means that we perform a scalar operation on the corresponding elements of two vectors. So given any two vectors $\boldsymbol{u}$ and $\boldsymbol{v}$ *of the same shape*, and a scalar function $f$, we can perform the operation to produce vector $\boldsymbol{c} = f(\boldsymbol{u},\boldsymbol{v})$ by setting $c_i \gets f(u_i, v_i)$. In MXNet, calling any of the standard arithmetic operators (+,-,/,\*,\*\*) will invoke an elementwise operation.

In [10]:
print(u)
print(v) 
print(u + v)
print(u - v)
print(u * v)
print(u / v)


[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
<NDArray 10 @cpu(0)>

[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
<NDArray 10 @cpu(0)>

[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
<NDArray 10 @cpu(0)>

[-1. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
<NDArray 10 @cpu(0)>

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
<NDArray 10 @cpu(0)>

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
<NDArray 10 @cpu(0)>


We can call element-wise operations on any two tensors of the same shape, including matrices.

In [11]:
print(A + B)
print(A[0,0] + B[0,0])


[[ 0.23532677  2.41771317  0.56598461 -0.06496933]
 [ 1.2878623  -0.3815009   1.14676487  0.01770963]
 [ 1.40396142 -0.79158354 -1.25735867  1.96823561]
 [-1.19641483  0.02532363  1.01474178 -5.11359215]
 [ 1.26567388  1.61772013 -1.50112092  2.06437397]]
<NDArray 5x4 @cpu(0)>

[ 0.23532677]
<NDArray 1 @cpu(0)>


## Sums and means 

The next more sophisticated thing we can do with arbitrary tensors is to calculate the sum of their elements. In mathematical notation, we express sums using the $\sum$ symbol. To express the sum of the elements in a vector $\boldsymbol{u}$ of length $d$, we can write $\sum_{i=1}^d u_i$. In code, we can just call ``nd.sum()``.

In [12]:
print(nd.sum(u))


[ 0.]
<NDArray 1 @cpu(0)>


We can similarly express sums over the elements of tensors of arbitrary shape. For example, the sum of the elements of an $m \times n$ matrix A could be written $\sum_{i=1}^{m} \sum{j=1}^{n} a_{i,j}$. 

In [13]:
print(nd.sum(A))


[ 5.17853642]
<NDArray 1 @cpu(0)>


A related quantity to the sum is the *mean*, also commonly called the *average*. We calculate the mean by dividing the sum by the total number of elements. With mathematical notation, we could write the average over a vector ${\boldsymbol{u}$ as \frac{1}{d} \sum_{i=1}^{d} u_i$ and the average over a matrix $A$ as  $\frac{1}{n \cdot m} \sum_{i=1}^{m} \sum_{j=1}^{n} a_{i,j}$. In code, we could just call ``nd.mean()`` tensors of arbitrary shape:

In [14]:
print(nd.mean(u))
print(nd.mean(A))


[ 0.]
<NDArray 1 @cpu(0)>

[ 0.25892681]
<NDArray 1 @cpu(0)>


## Dot products

<!-- So far, we've only performed element-wise operations, sums and averages. And if this was we could do, linear algebra probably wouldn't deserve it's own chapter. However, -->

One of the most fundamental operations is the dot product. Given two vectors $\boldsymbol{u}$ and $\boldsymbol{v}$, the dot product $\boldsymbol{u}^T \cdot \boldsymbol{v}$ is a sum over the products of the corresponding elements: $\boldsymbol{u}^T \cdot \boldsymbol{v} = \sum_{i=1}^{d} u_i \cdot v_i$.

In [15]:
u = nd.arange(0,5,1.)
v = nd.flip(nd.arange(0,5,1.), 0)
print(u)
print(v)
print(nd.dot(u,v))


[ 0.  1.  2.  3.  4.]
<NDArray 5 @cpu(0)>

[ 4.  3.  2.  1.  0.]
<NDArray 5 @cpu(0)>

[ 10.]
<NDArray 1 @cpu(0)>


Note that we can code the dot product over two vectors ``nd.dot(u, v)`` equivalently by performing an element-wise multiplication and then a sum:

In [16]:
nd.sum(u * v)


[ 10.]
<NDArray 1 @cpu(0)>

Dot products are useful in a wide range of contexts. For example, given a set of weights $\boldsymbol{w}$, the weighted sum of some values ${u}$ could be expressed as the dot product $\boldsymbol{u}^T \boldsymbol{w}$. When the weights are non-negative and sum to one ($\sum_{i=1}^{d} {w_i} = 1$), the dot product expresses a *weighted average*. When two vectors each have length one (we'll discuss what *length* means below in the section on norms), dot products can also capture the cosine of the angle between two vectors.

## Matrix-vector products

Now that we know how to calculate dot products we can begin to understand matrix-vector products. Let's start off by visualizing a matrix $A$ and a column vector $\boldsymbol{x}$.

$$\mathbf{A}=\begin{pmatrix}
 a_{11} & a_{12} & \cdots & a_{1m} \\
 A_{21} & a_{22} & \cdots & a_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
 a_{n1} & a_{n2} & \cdots & a_{nm} \\
\end{pmatrix},\quad\mathbf{x}=\begin{pmatrix}
 \mathbf{x}_{1}  \\
 \mathbf{x}_{2} \\
\vdots\\
 \mathbf{x}_{m}\\
\end{pmatrix} $$

We can visualize the matrix in terms of its row vectors

$$\mathbf{A}=
\begin{pmatrix}
\cdots & \mathbf{a}^T_{1} &...  \\
\cdots & \mathbf{a}^T_{2} & \cdots \\
 & \vdots &  \\
 \cdots &\mathbf{a}^T_n & \cdots \\
\end{pmatrix},$$

where each $\mathbf{a}^T_{i} \in \mathcal{R}^{m}$
is a row vector representing the $i$-th row of the matrix A.

Then the matrix vector product $\mathbf{y} = A\mathbf{x}$ is simply a column vector $y \in \mathcal{R^n}$ where each entry $y_i$ is the dot product $\mathbf{a}^T_i \cdot \mathbf{x}$.

$$\mathbf{A}=
\begin{pmatrix}
\cdots & \mathbf{a}^T_{1} &...  \\
\cdots & \mathbf{a}^T_{2} & \cdots \\
 & \vdots &  \\
 \cdots &\mathbf{a}^T_n & \cdots \\
\end{pmatrix}
\begin{pmatrix}
 x_{1}  \\
 x_{2} \\
\vdots\\
 x_{m}\\
\end{pmatrix}
= \begin{pmatrix}
 \mathbf{a}^T_{1} \cdot \mathbf{x}  \\
 \mathbf{a}^T_{2} \cdot \mathbf{x} \\
\vdots\\
 \mathbf{a}^T_{n} \mathbf{x}\\
\end{pmatrix}
$$

So you can think of multiplication by a matrix $A\in \mathcal{R}^{n \times n}$ as a transformation that projects vectors from $\mathcal{R}^{m}$ to $\mathcal{R}^{n}$.

These transformations turn out to be quite useful. For example, we can represent rotations as multiplications by a square matrix. As we'll see in subsequent chapters, we can also use matrix-vector products to describe the calculations of each layer in a neural network. 

Expressing matrix-vector products in code with ``ndarray``, we use the same ``nd.dot()`` function as for dot products. When we call ``nd.dot(A, x)`` with a matrix ``A`` and a vector ``x``, ``mxnet`` knows to perform a matrix-vector product. Note that the column dimension of ``A`` must be the same as the dimension of ``x``.

In [18]:
A = nd.array([[1,2,3],[4,5,6], [7,8,9]])
x = nd.ones(3)
nd.dot(A,x)


[  6.  15.  24.]
<NDArray 3 @cpu(0)>

## Matrix-matrix multiplication

If you've gotten the hang of dot products and matrix-vector multiplication, then matrix-matrix multiplications should be pretty straightforward.

Say we have two matrices, $A \in \mathcal{R}^{n \times k}$ and $B \in \mathcal{R}^{k \times m}$:

$$\mathbf{A}=\begin{pmatrix}
 a_{11} & a_{12} & \cdots & a_{1k} \\
 a_{21} & a_{22} & \cdots & a_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
 a_{n1} & a_{n2} & \cdots & a_{nk} \\
\end{pmatrix},\quad
\mathbf{B}=\begin{pmatrix}
 b_{11} & b_{12} & \cdots & b_{1m} \\
 b_{21} & b_{22} & \cdots & b_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
 b_{k1} & b_{k2} & \cdots & b_{km} \\
\end{pmatrix}$$

To produce the matrix product $C = AB$, it's easiest to think of $A$ in terms of its row vectors and $B$ in terms of its column vectors:

$$\mathbf{A}=
\begin{pmatrix}
\cdots & \mathbf{a}^T_{1} &...  \\
\cdots & \mathbf{a}^T_{2} & \cdots \\
 & \vdots &  \\
 \cdots &\mathbf{a}^T_n & \cdots \\
\end{pmatrix},
\quad \mathbf{B}=\begin{pmatrix}
\vdots & \vdots &  & \vdots \\
 \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\
 \vdots & \vdots &  &\vdots\\
\end{pmatrix}.
$$
Note here that each row vector $\mathbf{a}^T_{i}$ lies in $\mathcal{R}^k$ and that each column vector $\mathbf{b}$ also lies in $\mathcal{R}^k$.

Then to produce the matrix product $C \in \mathcal{R}^{n \times m}$ we simply compute each entry $c_{ij}$ as the dot product $\mathbf{a}^T_i \cdot \mathbf{b}_j$.

$$\begin{pmatrix}
\cdots & \mathbf{a}^T_{1} &...  \\
\cdots & \mathbf{a}^T_{2} & \cdots \\
 & \vdots &  \\
 \cdots &\mathbf{a}^T_n & \cdots \\
\end{pmatrix} \cdot
\begin{pmatrix}
\vdots & \vdots &  & \vdots \\
 \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\
 \vdots & \vdots &  &\vdots\\
\end{pmatrix}
= \begin{pmatrix}
\mathbf{a}^T_{1} \cdot \mathbf{b}_1 & \mathbf{a}^T_{1} \cdot \mathbf{b}_2&  & \mathbf{a}^T_{1} \cdot \mathbf{b}_m \\
 \mathbf{a}^T_{2} \cdot \mathbf{b}_1 & \mathbf{a}^T_{2} \cdot \mathbf{b}_2 & \cdots & \mathbf{a}^T_{2} \cdot \mathbf{b}_m \\
 \vdots & \vdots &  &\vdots\\
\mathbf{a}^T_{n} \cdot \mathbf{b}_1 & \mathbf{a}^T_{n} \cdot \mathbf{b}_2& \cdots& \mathbf{a}^T_{n} \cdot \mathbf{b}_m 
\end{pmatrix}
$$

You can think of the matrix-matrix multiplication $AB$ as simply performing $m$ matrix-vector products and stitching the results together to form an $n \times m$ matrix.

Just as with ordinary dot products and matrix-vector products, we can computer matrix-matrix products in ``mxnet`` by using ``nd.dot()``.

In [5]:
A = nd.array([[1,2,3],[4,5,6],[7,8,9]])
B = nd.array([[9,8,7],[6,5,4],[3,2,1]])
C = nd.dot(A,B)
print(C)


[[  30.   24.   18.]
 [  84.   69.   54.]
 [ 138.  114.   90.]]
<NDArray 3x3 @cpu(0)>


## Norms

Before we can start implementing models, 
there's one last concept we're going to introduce. 
Some of the most useful operators in linear algebra are norms.
Informally, they tell ushow big a vector or matrix is. 
We represent norms with the notation $||\cdot||$. 
The $\cdot$ in this expression is just a placeholder. 
For example, we would represent the norm of a vector $\mathbf{x}$ 
or matrix $A$ as $||\mathbf{x}||$ or $||A||$, respectively. 

All norms must satisfy a handful of properties:
1. $||\alpha A|| = |\alpha| ||A||$
2. $||A + B|| \leq ||A|| + ||B||$
3. $||A|| \geq 0$
4. If $\forall {i,j}, a_{ij} = 0$, then $||A||=0$

To put it in words, the first rule says 
that if we scale all the components of a matrix or vector 
by a constant factor $\alpha$, 
its norm also scales by the *absolute value* 
of the same constant factor. 
The second rule is the familiar triangle inequality.
The third rule simple says that the norm must be non-negative. 
That makes sense, in most contets the smallest *size* for anything is 0.
The final rule basically says that the smallest norm is achieved by a matrix or vector consisting of all zeros.
It's possible to define a norm that gives zero norm to nonzero matrices,
but you can't give nonzero norm to zero matrices. 
That's a mouthful, but if you digest it then you probably have grepped the important concepts here.

If you remember Euclidean distances (think Pythagoras' theorem) from gradeschool, 
then non-negativity and the triangle inequality might ring a bell.
You might notice that norms sound a lot like measures of distance.

In fact the Euclidean distance $\sqrt{x_1^2 + \cdots + x_n^2}$ is a norm. 
Specifically it's the $\ell_2$-norm. 
When applied over the entries of a matrix, e.g. $\sqrt{\sum_{i,j} a_{ij}}$, 
the $\ell_2$ norm is also called the Frobenius norm. 
More often, in machine learning we work with with the squared $\ell_2$ norm (notated $\ell_2^2$).
We also commonly work with the $\ell_1$ norm.
The $\ell_1$ norm is simply the sum of the absolute values. 
It has the convenient property of placing less emphasis on outliers.

To calculate the $\ell^2_2$ norm, we can just call ``nd.norm()``.  

In [15]:
nd.norm(nd.array([1,-2,3]))


[ 3.7416575]
<NDArray 1 @cpu(0)>

To calculate the one norm we can simply perform the absolute value and then sum over the elements.

In [16]:
x = nd.array([1,-2,3])
norm = nd.sum(nd.abs(x))
print(norm)


[ 6.]
<NDArray 1 @cpu(0)>


## Norms and Objectives

While we don't want to get to far ahead of ourselves, we do want you to anticipate why these concepts are useful.
In machine learning we're often trying to optimization problems: *Maximize* the probability assigned to observed data. *Minimize* the distance between predictions and the groundtruth observations. Assign vector represenatations to items (like words, products, or news articles) such that the distance between similar items is minimized, and the distance between dissimilar times is maximized. Oftentimes, these objectives, perhaps the most important component of a machine learning algorithm (besides the data itself), are expressed as norms.


## Conclusions

In just a few pages (or one Jupyter notebook) we've taught you all the linear algebra you'll need to understand a good chunk of neural networks. Of course there's a *lot* more to linear algebra. And a lot of that math *is* useful for machine learning. For example, matrices can be decomposed into factors, and these decompositions can reveal low-dimensinoal structure in real-world datasets. There are entire subfields of machine learning that focus on using matrix decompositions and their generalizations to high-order tesors to discover structure in datasets and solve prediction problems. But this book focuses on deep learning. And we believe you'll be much more inclined to learn more mathematics once you've gotten your hands dirty deploying useful machine learning models on real datasets. So while reserve the right to introduce more math much later on, we'll wrap up this chapter here.