# DSCI 6001 5.1 Lecture


## The Fundamental Theorem of Linear Algebra and Data Science

So far most of what we've learned so far has been delivered to you in terms of relatively atomized topics. To wit we've covered:

1. Properties of Vectors
2. Properties of Matrices including the four matrix spaces
3. Basic operations on Vectors and Matrices
4. Linearity, Nonlinearity and solvability
5. Special matrices and their properties
6. The standard basis and changes of basis
7. Orthogonality
8. Operations on lines, planes and points
9. Matrix decomposition

As a whole, this is all you really need to know from linear algebra to proceed with your career. Linear algebraists recognized in the last century that the whole of linear algebra could be summed up in a few relatively simple ideas, and as a capstone on your linear algebra education, we will require that you study the **fundamental theorem of linear algebra** as posed by Gilbert Strang.


## The four elements of the Fundamental Theorem:

The fundamental theorem of linear algebra contains all of the mathematical elements that we've talked about thus far, listed above as nine bullet points.  It describes the construction of any matrix and how any presumably real-valued matrix can be decomposed into its SVD components by breaking it down into rotations about its subspaces. This is the 

1. The dimensions of the four subspaces.
2. The orthogonality of the four subspaces.
3. The basis vectors of the subspaces are orthonormal.
4. The matrix with respect to these bases is diagonal.


## Part 1:

The below figure describes how $A$ takes members of $\vec{x}$ in the row space (effectively the domain) into the column space (typically called the codomain) described by $b$. In data science, this can be related directly to taking an array of observations $A$ and mapping them directly to predictions of variable importance $\vec{x}$. 

![LA1](./images/LA1_small.png)

The nullspace (kernel) of A is a space orthogonal to the row space of A. Vectors that are in the nullspace get mapped by A to the 0 on the right-hand side of the diagram. This is also known as the "left nullspace" due to that the vectors of the kernel belong to the row space on the left.

## Part 2: Least squares regression

If $b$ is not in the column space, $A\vec{x} = b$ cannot be solved. Therefore, we have to come up with a "solution" for the map. As Strang says, it is far more common than not to have more equations than unknowns (a tall matrix).

It is in this case we apply the techniques of regression. We select a definition of the transformation that maps to a combination of the variables:

$\vec{b} = (C+Dt)$

or

$\vec{b} = (C+Dt+Et^{2})$

that remains linear. 

However, these are mappings from $\vec{x}$ to some point $p$ where $p$ is not in the column space. The goal is to minimize the difference between $\vec{b}$ and $\vec{p}$, this being the error vector $\vec{e} = \vec{b}-\vec{p}$.

The best combination of mappings is $\tilde{x}$ such that $\vec{p} = A\tilde{x}$ is the exact projection of $\vec{b}$ onto the column space $im(A)$. The vector $\vec{e}$ is therefore **orthogonal to this projection**. 

$\vec{e} = \vec{b} - A\tilde{x} = \vec{b} - proj_{im(A)}b $

Because $\vec{e}$ is orthogonal to $im(A)$, it follows that it is orthogonal with respect to $A^{T}$. This means that the dot product of it with $A^{T}$ is 0:

$A^{T}(b-A\tilde{x}) = 0$

$A^{T}b-A^{T}A\tilde{x} = 0$

$A^{T}b = A^{T}A\tilde{x}$

This latter representation is known as the "normal equation," and allows us to find $\tilde{x}$:

$\tilde{x} = (A^{T}A)^{-1}A^{T}b $

$A^{T}A$ is symmetric and real and thus $(A^{T}A)^{-1}$ exists **Given that the columns of $A$ are independent**. Under the use of the normal equations, we make that assumption. Again, $A$ does not have to be invertible, just in possession of independent columns. The below figure describes this operation, particularly the splitting of $\vec{b}$  into the reachable space $\vec{p}$ and the unreachable space $\vec{e}$.

![LA2](./images/LA2_small_1.png)

## Part 3: Construction of Orthogonal Bases for $A$

To complete the theorem, we need to produce bases $v_{1}, \cdots, v_r \in V$ for the row space and $u_{1}, \cdots, u_r \in U$ for the column space.
We can then say that $AV = \Sigma\ U$, by definition. $\Sigma$ is a diagonal matrix in this case, such that  $Av_i = \sigma_i\ U_i$. We should also require that $\sigma_i > 0$.

By forcing $V$ to be the (orthogonal) eigenspace of $A^{T}A$, we can write:

$A^{T}Av_i = \sigma_i^{2}\ v_i$

Because $A^{T}A$ and $AA^{T}$ are symmetric, positive semidefinite (and square) they have nonnegative eigenvalues and their eigenvectors in $V$ are orthonormal. This means that the magnitude of $Av_i$ is always equal to the matching eigenvalue.

$\|Av_i\|=\sigma_i$

Taking the above equation and multiplying again by $A$ from the left, we get:

$AA^{T}Av_i = \sigma_i^{2}A\ v_i$

if you set $u_i = \frac{1}{\sigma_i}Av_i$, then you can see that $u_i$ is a unit eigenvector of $AA^{T}$ with eigenvalue $sigma_i$.

When you put the whole construction together, you get the SVD:

1. $U$, an $m \times m$ orthogonal matrix. The columns of $U$:$u_{1} \cdots u_{r} \cdots u_{m}$ are the basis vectors for the column space and left nullspace. Note that these are orthogonal to each other, so the vectors span both spaces, but they do not overlap (obviously).
2. $\Sigma$, an $m \times n$ diagonal matrix of eigenvalues 
3. V is an $n \times n$ orthogonal matrix. Its columns are the basis vectors for row space and right (the regular) nullspace.


This entire construction is depicted in the below figure.

![LA3](./images/LA3_small.png)

Therefore the SVD expresses A as a combination of $r$ rank-one matrices ( essentially a spectral decomposition):

$A = U\ \Sigma\ V^{T} = u_1\sigma_1\ v_1^{T}+ \cdots +u_r\sigma_r\ v_r^{T}$

# Part 4: Construction of the Full Pseudoinverse

The SVD enables us to do something that would otherwise be impossible: We can invert singular matrices using a special construction that recovers the relationship between the left nullspace and the row space.

This construction for $A$ takes the column space back to the row space and is called the pseudoinverse, or $A^{+}$. $A^{+} = A^{-1}$ when $det(A) \neq 0$.

The least squares solution of minimum length is by definition: $x^{+} = A^{+}b$. $x^{+} = \widetilde{x}$ when $A$ has full column rank $r=n$
(see the below figure)

![LA4](./images/LA4_small.png)

In this case, the error vector $e$ described in the second figure is actually the kernel of $A^{+}$, such that $b$ can be thought of as the sum of vectors belonging to $p$ (the projection of $b$ onto the column space) and $e$ the error of this fit. 

By definition, $A^{+} = V\ \Sigma^{+} U^{T}$.


## Part 5: Summary

Summarizing the effects of these different constructions,

$Av_{i} = \sigma_{i}u_{i}$

$A^{T}u_{i} = \sigma_{i}v_{i}$

$A^{+}u_{i} = \frac{1}{\sigma_{i}}v_{i}$

![LA5](./images/LA5_small.png)