In [2]:
import numpy as np
import torch

## Linear Algebra Basics

It refers to solving for unknowns within a system of linear equations where we can have many equations (multiple data points) and many unknowns in a equation (multiple parameters). 

Current Uses: 

- Solving for unknowns in ML/DL algorithms
- Reducing dimensionality of data while preserving information (PCA)
- Eigenvector scoring of webpages
- Recommender systems (SVD) 
- NLP like topic modelling or semantic analysis (SVD, Matrix Factorization)


There can be different number of solutions: 

- One solution (intersecting graphs)
- No solution (parallel graphs)
- Infinite solutions (overlapping graphs)

## Common Data Structures 

The most important structure for linear algbera is Tensors, which are arrays of numbers. They are the ML generalization of vectors/matrices to any number of dimensions. 

- 0 dim tensor: Scalar which has a magnitude only
- 1 dim tensor: Vector which is an array or a list of numbers
- 2 dim tensor: Matrix which is a flat table of numbers
- 3 dim tensor: Tensor which is a 3D table of numbers

Libraries: Pytorch and tensorflow are the most popular automatic differentiation libraries where Pytorch is more popular due to its pythonic tensors which behave like NumPy arrays but are better suited for parallel computation in GPUs.

### Scalars

It is a single number with no dimensions which is denoted in lowercase like $x$. Scalars are typically typed.

For pytorch, it is easy to create tensors while for tensorflow we need to use a wrapper like `tf.Variable` or `tf.constant`.

In [23]:
# scalars in pytorch
x_pt=torch.tensor(25,dtype=torch.float16)
print(x_pt)
print(type(x_pt))
print(x_pt.shape) #no dimensionality

y_pt=torch.tensor(20,dtype=torch.float16)
print(y_pt)
print(type(y_pt))
print(y_pt.shape) #no dimensionality

# adding tensors
z_pt= x_pt+ y_pt
print(z_pt)
print(type(z_pt))
print(z_pt.shape)

tensor(25., dtype=torch.float16)
<class 'torch.Tensor'>
torch.Size([])
tensor(20., dtype=torch.float16)
<class 'torch.Tensor'>
torch.Size([])
tensor(45., dtype=torch.float16)
<class 'torch.Tensor'>
torch.Size([])


### Vectors

It is an one dimensional array of numbers arranged in order which can be considered to represent a point in n-dimensional space.

In [24]:
# Vectors in Numpy 
# one dim vector
x= np.array([25,2,5])
print(x)
print(len(x))
print(x.shape)
print(type(x))

# matrix style vector
# each inner bracket is a row and number of elements within the inner bracket is number of columns
y=np.array([[25,2,5]])
print(y)
print(len(y))
print(y.shape)
print(type(y))

# zero vector
z=np.zeros(3)
print(z)

[25  2  5]
3
(3,)
<class 'numpy.ndarray'>
[[25  2  5]]
1
(1, 3)
<class 'numpy.ndarray'>
[0. 0. 0.]


#### Vector Transpose

It consists of reversing the row and column identities for each element in a vector. 

In [None]:
# Vector Transpose
x_t= x.T
print(x_t)
print(x_t.shape)


y_t= y.T
print(y_t)
print(y_t.shape)

#### Vector Normalization 

It refers to dividing the elements of a vector by its norm which represents length of the vector from the origin. 

- Distance Calculation: The norm can also be used to express distances between two vectors.
- Unit Vectorization: The norm can be used to create a unit vector after normalization when length is 1. 


The general Lp Norm Formula is

$$ ||x||_p = (\sum|x_i|^p)^{1/p}  $$

There are different types of norm calculations which are shown below

In [None]:
# Vector Norms
# L1 norm/ Absolute norm/ Taxicab Norm/ Manhattan Norm
# It varies linearly at all locations in space i.e its useful when difference between zero and non-zero is key
x= np.array([25,2,5])
l1= np.abs(25)+ np.abs(2) + np.abs(5)
print(f"L1 norm is {l1}")

# L2 norm/ Root square norm/ Euclidean norm (Most Popular)
# It calculates the euclidean distance of the vector from the origin
x= np.array([25,2,5])
l2= (25**2 + 2**2 + 5**2)**0.5
print(f"L2 norm is {l2}")
l2= np.linalg.norm(x)
print(f"L2 norm is {l2}")

# Squared L2 norm
# It is equivalent to getting the dot product between the transpose of x and itself i.e xT.x
# It is computationally cheaper since it doesn't involve root calculation
# It is easily differentiable since calculation of element x requires that element only and not root over all elements
x= np.array([25,2,5])
sl2= (25**2 + 2**2 + 5**2)
print(f"Squared L2 norm is {sl2}")
sl2= np.dot(x.T,x)
print(f"Squared L2 norm is {sl2}")

# Max norm/ L-infinity norm
# It takes maximum of the absolute values of each individual element
x= np.array([25,2,5])
lmax=np.max([np.abs(25), np.abs(2), np.abs(5)])
print(f"LMax norm is {lmax}")

#### Vector Regularization

Regularization or cost function regularization refers to the process of adding a norm-based penalty term to the cost function to control the values of model parameters/features during model training. Depending on the type of norm used in the penalty, there are different types of regularization.  


**L1/ Lasso Regression**:  When $||\beta||_1$ is added as the penalty term, its called lasso regression.  Since Lasso uses the absolute values of the coefficients, it has the ability to shrink some coefficients to exactly zero, effectively performing feature selection. This makes Lasso regression useful when you want a sparse model that selects only the most important features.

**L2/ Ridge Regression**: When $||\beta||_2$ is added as penalty, its called ridge regression. Since the penalty is based on the squared values of the coefficients, ridge tends to shrink the coefficients but does not drive any coefficients to zero. As a result, all features generally remain in the model, but their effect is reduced.

#### Orthogonal Vectors

Any two vectors can be considered orthogonal if they are at 90 degrees to each other. In terms of vector operations, their transpose dot product is zero i.e  $x^T.y=0$. 

- For any n-dimensional vector space, there are a maximum n orthogonal vectors (assuming non-zero norms).
- Orthonormal vectors are orthogonal and all have a unit L2 norm.

#### Basis Vectors

The basis vectors for a vector space indicate the set of vectors such that any other vector in that space can be uniquely represented as a linear combination of these vectors by scaling and adding.

Some features of basis vectors: 

- All basis vectors must be linearly independent
- The combination of basis vectors must span the whole vector space
  
Typically, the basis vectors are n orthonormal vectors along the n axes of a n-dimensional space though there can be other basis vectors too. 

### Matrices

It is two dimensional array of numbers which are denoted in uppercase $X$

- The notation of matrix shape is in form of (rows, columns)
- The n th row can be accessed with $X_{n,:}$
- The nth column can be accessed with $X_{:,n}$

In [7]:
X= np.array ([[25,2],[5,26],[3,7]])
print(f"The size of X is {X.size}")
print(f"The shape of X is {X.shape}")
print(f"The 1st row of X is {X[0,:]}")
print(f"The 1st column of X is {X[:,0]}")

The size of X is 6
The shape of X is (3, 2)
The 1st row of X is [25  2]
The 1st column of X is [25  5  3]


#### Matrix Spaces

There are two important aspects of a matrix that determine certain important properties of the matrix. They are: 

1. Row Space
2. Column Space

##### Row Space

The row space of a matrix indicates that the span of its row vectors i.e all possible linear combinations of the row vectors. 

- It gives an insight into the relations between equations in a system.

##### Column Space

The column space of a matrix indicates the span of its column vectors i.e all possible linear combinations of the column vectors. 

- It represents all the vectors that can be reached by the linear transformation defined by the matrix. For a matrix A, when we multiply with vector x then $Ax$ lies in the columnspace of A.
- Because of this the columspace is the range or image of the matrix since it shows where the matrix will send any input from its domain.
- For $Ax=B$ to have a solution, B must lie in the column space of A or the system has no solutions.

Both row space and column space are helpful in understanding different aspects of a matrix. 

#### Special Matrices

There are some special matrices which have fixed properties: 

1. Symmetric Matrix
2. Identity Matrix
3. Inverse Matrix
4. Diagonal Matrix
5. Orthogonal Matrix

##### Symmetric Matrix

A symmetric matrix is one which is square (same number of rows and columns) and $X^T=X$

##### Identity Matrix

An identity matrix is a symmetric matrix where: 

1. Every element along main diagonal is 1 and all other elements are zero.
2. The identity matrix is same for all matrices of the same dimension i.e all $4 \times 4$ matrices have the identity matrix $I_4$


##### Inverse Matrix

The inverse of a matrix is a matrix whose dot product with the original matrix gives the identity matrix. If A is a matrix and $A^{-1}$ is its inverse matrix then

$$ AA^{-1}= I_A$$

The conditions for matrix inverse to exist are are: 

- They are non-singular i.e they have linearly independent rows and thus have intersecting graphs instead of overlap or parallel.
- They are square i.e vector range (rowspace)= vector span (columnspace) which avoids overdetermination ($n_{equations}>n_{dimensions}$) or underdetermination ($n_{equations}<n_{dimensions}$)

The inverse of a matrix is important because it helps in in determining solutions of matrix equations of the form $AX=B$ where X is solved by $A^{-1}B$ if the inverse exists. 

##### Diagonal Matrix

The diagonal matrices have non-zero elements along main diagonal and null elements everywhere else. They are computational efficient where multiplication and inversion can be derived via pointwise operations. 

##### Orthogonal Matrix

The orthogonal matrice have orthonormal vectors in all rows and columns. Thus, for an orthogonal matrix A, 

$$A^TA= AA^T= I$$
$$A^T= A^{-1}I= A^{-1}$$

This makes orthogonal matrices computationally efficient as calculating $A^T$ is cheap and for such matrices calculating $A^{-1}$ also becomes cheap.

#### Eigenconcept

We talk of eigens of a transformation as those elements in the columspace of a transformation which remain unchanged. When talking of eigenconcepts, there are two important parts: 

- Eigenvectors are those vectors whose orientation remain unchanged with a transformation
- Eigenvalues are the values by which the eigenvectors change length with the transformation.

#### Features of Matrices

There are some features of matrices that give us important information about a matrix that is useful for further operations: 

1. Matrix Rank
2. Matrix Norm
3. Matrix Kernel
4. Matrix Determinant
5. Trace Operator

##### Matrix Rank

The rank of a matrix is the number of linearly independent rows in a matrix and indicates the amount of information that the matrix contains. 

##### Matrix Norm 

Also known as the Frobenius norm, it gives the norm of a matrix: 

$$ ||X||_F= \sqrt\sum_{i,j}X_{i,j}^2 $$

This norm is analogous to the L2 vector norm and measures the size of matrix in terms of Euclidean distance. 

In [16]:
X= torch.tensor([[1.,2.],
                [2.,3.]])
print(torch.norm(X))

tensor(4.2426)


##### Matrix Kernel

The kernel or null space of a matrix consists of all vectors x such that $Ax=0$ i.e when a vector undergoes the transformation defined by the matrix, the null space is the set of all points that are transformed to the origin. 

The null space indicates whether a matrix has non trivial solutions to the homogenous equation and is closely related to rank by the rank-nullity theorem. 

##### Matrix Determinant

The determinant of a matrix is a scalar value that helps determine whether an matrix is invertible.

- Zero Determinant: It means the matrix is not invertible/ singular
- Non-zero Determinant: It means the matrix is invertible and that its rows/columns are linearly independent

Determinants are also useful in solving linear equations with Cramer's rule and understanding permitted geometric transformations. 

##### Trace Operator

The trace operator of a mtrix is the sum of all diagonal elements

$$Tr(A)= \sum A_{i,j} \text{where i=j}$$

Thus trace of a matrix and its transpose are the same. It is also useful in calculating the Frobenius Norm of a matrix

$$||A_F||= \sqrt(Tr(AA^T)$$

### Tensors

Tensors are higher dimensional arrays which are commonly used to represent real world data. 

Example: Images in a training set for a model are 4-dimension tensors:

- Dim 1: Number of images in a training batch eg 32
- Dim 2: Image height in pixels eg 28
- Dim 3: Image width in pixels eg 28
- Dim 4: Number of color channels eg 3

In [9]:
images= torch.zeros([32,28,28,3])
images

tensor([[[[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.],
          ...,
          [0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.],
          ...,
          [0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.],
          ...,
          [0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         ...,

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.],
          ...,
          [0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.],
          ...,
          [0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.],
          ...,
          [0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]]],


        [[[0., 0.

## Common Tensor Operations

### Scalar Operations

Both scalar addition and multiplication applies to all elements in a tensor while retaining tensor shape

In [6]:
X= np.array([[2,4,6],
             [8,10,12]])
print(X)
print(X+2) #torch.add
print(X*2) #torch.mul

[[ 2  4  6]
 [ 8 10 12]]
[[ 4  6  8]
 [10 12 14]]
[[ 4  8 12]
 [16 20 24]]


### Elementwise Operations

For two tensors with same size, we can add them together (elementwise addition) or multiply corresponding elements (elementwise product/Hadamard product).

In [7]:
X= np.array([[1,2,3],
             [4,5,6]])
Y= np.array([[-1,-2,-3],
             [-4,-5,-6]])
print(X+Y)
print(X*Y)

[[0 0 0]
 [0 0 0]]
[[ -1  -4  -9]
 [-16 -25 -36]]


### Reduction Operations

For any given tensor, we can reduce it by summing across all elements or multiplying across all elements. The sum reduction is most common and can be represented as follows for a 2-D matrix X with $m \times n$ dimensions: 

$$ \sum_{i=1}^m \sum_{i=1}^n X_{i,j} $$

In [9]:
X= np.array([[1,2,3],
             [4,5,6]])
print(X.sum())
print(X.sum(axis=0)) #sum all rows
print(X.sum(axis=1)) #sum all columns

21
[5 7 9]
[ 6 15]


### Dot Product

The dot product of two vectors with same length is represented by $x.y$ or $x^Ty$ which is calculated as follows: 

$$ x.y = \sum_{i=1}^n x_i . y_i$$

In [10]:
x=np.array([1,2,3])
y=np.array([1,1,1])
print(np.dot(x,y)) #also torch.dot if elements are float

6


### Matrix Multiplication

It is one of the most important types of tensor operations and involves dot products between rows of one matrix and columns of the other. It can be defined as

$$ C _{i,k}= \sum_j A_{i,j} B_{j,k} $$

If we imagine the matrix A to define a transformation, then the matrix multiplication indicates where A takes the matrix B in the new transformed space. 

In [14]:
A = torch.tensor ([[1,2,3],
                  [4,5,6]])
B= torch.tensor ([[1,2],
                  [1,2],
                  [1,2]])
C= torch.matmul(A,B)  # in numpy we can still use np.dot for matrix multiplication
print(C)

tensor([[ 6, 12],
        [15, 30]])


## Solving Linear Equations

There are two main ways of solving linear systems of equations by hand

1. Substitution
2. Elimination

### Substitution

The method involves the following steps: 

1. Isolate a variable i.e make one variable have a coefficient of 1.
2. Substitute this variable into the other equation and get value of the other variable
3. Substitute this value into the first equation and get value of the first variable. 

<img src="images/linalg_sub.png" 
        alt="Picture" 
        width="400" 
        height="400" 
        style="display: block; margin: 0 auto" />

Image Source: [Expii blogpost](https://www.expii.com/t/solving-linear-systems-with-substitution-definition-examples-4412) 

### Elimination

The method involves the following steps: 

1. Add equations together (with our without scalar multiplication) to eliminate one of the variables.
2. Solve for the remaining variable
3. Solve for the other variable by substituting the previous value in one of the equations.

   
<img src="images/linalg_elim.png" 
        alt="Picture" 
        width="400" 
        height="400" 
        style="display: block; margin: 0 auto" />

Image Source: [Expii blogpost](https://www.expii.com/t/elimination-by-addition-and-subtraction-examples-practice-4416) 

## Applying Linear Equations

The theory of linear equations is frequently applied in machine learning and deep learning when the data is represented in the form of vectors/matrices/ tensors. Some common applications are highlighted below

### Linear Regression

In typical linear regression problem, the data can be represented as matrices where: 

1. Rows correspond to different data points or records in the data
2. Columns correspond to multiple coefficients of a parameter

In such cases, solving a linear equation system  gives us the values of variables/parameters in the row equations so that we can use them for calculating output values for novel data points. 

<img src="images/linalg_matrix.png" 
        alt="Picture" 
        width="400" 
        height="400" 
        style="display: block; margin: 0 auto" />

Image Source: [John Kron's Course](https://learning.oreilly.com/videos/the-essential-machine/9780137903245/9780137903245-LAM1_01_05_05/) 

In [4]:
# Closed form solution
# Let equations be 4b+2c=4 and -5b-3c=-7
y= np.array([4,-7])
X= np.array ([[4,2],[-5,-3]])
# w= inv(X).y
X_inv=np.linalg.inv(X)
w= np.dot(X_inv,y)
print(w) # this gives the feature matrix

[-1.  4.]


In [5]:
# confirming the solution
y= np.dot(X,w)
print(y)

[ 4. -7.]


### Artificial Neural Networks

In artificial neural networks, the output of each layer of neurons is calculated in terms  of the following: 

1. x: The input feature vector
2. w: The weights of neurons
3. b: The bias of neurons

where the output z is: 

$$ z= wx + b$$

Here, the $w$ and $x$ undergo matrix multiplication after which the bias is added. The output $z$ again serves as the input feature vector for the next layer and so it goes on. Thus, we see that even artificial neural networks solve linear equations through their layers. 

<img src="images/linalg_ann.png" 
        alt="Picture" 
        width="500" 
        height="500" 
        style="display: block; margin: 0 auto" />

Image Source: [Jeremy Jordan's Blog](https://www.jeremyjordan.me/intro-to-neural-networks/) 