## Understanding the Mathematics behind PCA

* Step 1: Standardize the dataset.

* Step 2: Calculate the covariance matrix for the features in the dataset.

* Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.

* Step 4: Sort eigenvalues and their corresponding eigenvectors.

* Step 5: Pick *k* eigenvalues and form a matrix of eigenvectors. (Feature Vector)

* Step 6: Transform the original matrix. Recast the data along the PCA Axes

In [1]:
import numpy as np
import pandas as pd

---
### Reading in the a dataset



In [3]:
A = np.matrix([[1,2,3,5],
               [5,5,6,7],
               [0,4,2,3],
               [5,3,2,1],
               [8,1,2,2]])
A

matrix([[1, 2, 3, 5],
        [5, 5, 6, 7],
        [0, 4, 2, 3],
        [5, 3, 2, 1],
        [8, 1, 2, 2]])

In [4]:
df = pd.DataFrame(A,columns  = ['f1','f2','f3','f4'])
df

Unnamed: 0,f1,f2,f3,f4
0,1,2,3,5
1,5,5,6,7
2,0,4,2,3
3,5,3,2,1
4,8,1,2,2


---
### Step 1: Standardize the dataset

In [8]:
df_std  = (df - df.mean()) / (df.std())
df_std

Unnamed: 0,f1,f2,f3,f4
0,-0.855985,-0.632456,0.0,0.581318
1,0.366851,1.264911,1.732051,1.411773
2,-1.161694,0.632456,-0.57735,-0.249136
3,0.366851,0.0,-0.57735,-1.079591
4,1.283977,-1.264911,-0.57735,-0.664364


---
### Step 2: Calculate the covariance matrix for the features in the dataset.
Find the covariance matrix for the given dataset.
There are two methods to do this
- Sample formula 
- Population formula

Note: Any of the formula, can be used result will be same

#### Covariance population formula (divide by N)

In [10]:
df_cov = np.cov(df_std.T, bias = 1)
df_cov

array([[ 0.8       , -0.27068622,  0.07060045, -0.18786931],
       [-0.27068622,  0.8       ,  0.51120772,  0.42018059],
       [ 0.07060045,  0.51120772,  0.8       ,  0.71919495],
       [-0.18786931,  0.42018059,  0.71919495,  0.8       ]])

#### Covariance sample formula (divide by N-1)

In [6]:
cov_mat = np.cov(df_std.T, bias = 0)
cov_mat

array([[ 1.        , -0.33835777,  0.08825056, -0.15271019],
       [-0.33835777,  1.        ,  0.63900965,  0.61812254],
       [ 0.08825056,  0.63900965,  1.        ,  0.94044349],
       [-0.15271019,  0.61812254,  0.94044349,  1.        ]])

In [12]:
## verify varinace(f1) is as expected
print('var(f1) (population formula): ',((df_std.f1)**2).sum()/5)
print('var(f1) (sample formula): ',((df_std.f1)**2).sum()/4)

var(f1) (population formula):  0.8
var(f1) (sample formula):  1.0


In [8]:
## verify covarinace(f1,f2) is as expected
print('covar(f1,f2) (population formula): ',((df_std.f1)*(df_std.f2)).sum()/5)
print('covar(f1,f2) (sample formula): ',((df_std.f1)*(df_std.f2)).sum()/4)

covar(f1,f2) (population formula):  -0.27068621693278583
covar(f1,f2) (sample formula):  -0.3383577711659823



---
## Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.

An eigenvector is a non-zero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue is the factor by which the eigenvector is scaled. 

Let `A` be a square matrix (in our case the covariance matrix), `ν` a vector and `λ` a scalar that satisfies `Aν = λν`, then `λ` is called eigenvalue associated with eigenvector `ν `of `A`.
Rearranging the above equation,

    Aν-λν =0 ; (A-λI)ν = 0

Since we have already know ν is a non- zero vector, only way this equation can be equal to zero, if

    det(A-λI) = 0


In [9]:
eigen_val, eigen_vectors = np.linalg.eig(cov_mat)

In [10]:
print(eigen_val)

[2.50696878 1.1041229  0.36948269 0.01942564]


In [11]:
print(eigen_vectors)

[[ 0.14433795 -0.90530522 -0.33700477  0.21451526]
 [-0.52653122  0.24817676 -0.80016989  0.14457288]
 [-0.58754506 -0.32335099  0.15527642 -0.72534418]
 [-0.59726229 -0.11947756  0.47121669  0.63793237]]


---
### Step 4: Sort eigenvalues and their corresponding eigenvectors.

##### Since the eigen values are already sorted in our case, so no need of this step

In [12]:
n_components=3

#### Step 5: Pick top *k* eigenvalues and form a matrix of eigenvectors.

If we choose the top 3 eigenvectors, the matrix will look like this:

In [13]:
top_eigen_vectors = eigen_vectors[:,:n_components]

In [14]:
top_eigen_vectors

array([[ 0.14433795, -0.90530522, -0.33700477],
       [-0.52653122,  0.24817676, -0.80016989],
       [-0.58754506, -0.32335099,  0.15527642],
       [-0.59726229, -0.11947756,  0.47121669]])

In [15]:
top_eigen_vectors.shape

(4, 3)

In [16]:
np.array(df_std).shape

(5, 4)

---
##Step 6: Transform the original matrix.


```
# df_std.shape * n_eigen_vectors.shape  = transformed_data.shape <br>
(5,4) * (4,3) = (5,3)
```



In [17]:
transformed_data = np.matmul(np.array(df_std),top_eigen_vectors)

In [18]:
pd.DataFrame(data = transformed_data
             , columns = ['principal component '+ str(i+1) for i in range(n_components)])

Unnamed: 0,principal component 1,principal component 2,principal component 3
0,0.053796,0.586828,0.917353
1,-2.564686,-0.765083,-0.129967
2,-0.057691,1.416094,-0.286098
3,1.014812,-0.020871,-0.70452
4,1.553769,-1.216969,0.203232


In [19]:
transformed_data.shape

(5, 3)

---
## Lets see the result using the Sklearn library

In [20]:
from sklearn.decomposition import PCA
pca = PCA(n_components=n_components)
principalComponents = pca.fit_transform(df_std)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component '+ str(i+1) for i in range(n_components)])
    

In [21]:
principalDf

Unnamed: 0,principal component 1,principal component 2,principal component 3
0,-0.053796,0.586828,0.917353
1,2.564686,-0.765083,-0.129967
2,0.057691,1.416094,-0.286098
3,-1.014812,-0.020871,-0.70452
4,-1.553769,-1.216969,0.203232
