## Principal Component Analysis

Principal Component Analysis, or PCA for short, is a method for reducing the dimensionality of data.

It can be thought of as a projection method where data with m-columns (features) is projected into a subspace with m or fewer columns, whilst retaining the essence of the original data.

The PCA method can be described and implemented using the tools of linear algebra.

PCA is an operation applied to a dataset, represented by an n x m matrix A that results in a projection of A which we will call P.

<ol>
    <li>The first step is to calculate the mean values of each column.<br>
        M = mean(A)
    </li>
    <li>
        Next, we need to center the values in each column by subtracting the mean column value.
        <br>
        C = A - M </li>
    <li>
        The next step is to calculate the covariance matrix of the centered matrix C.

Correlation is a normalized measure of the amount and direction (positive or negative) that two columns change together. Covariance is a generalized and unnormalized version of correlation across multiple columns. A covariance matrix is a calculation of covariance of a given matrix with covariance scores for every column with every other column, including itself.<br>
        
V = cov(C)</li>
    <li>
        we calculate the eigendecomposition of the covariance matrix V. This results in a list of eigenvalues and a list of eigenvectors.<br>
        
values, vectors = eig(V)</li>
    <li>
        The eigenvectors can be sorted by the eigenvalues in descending order to provide a ranking of the components or axes of the new subspace for A.

If all eigenvalues have a similar value, then we know that the existing representation may already be reasonably compressed or dense and that the projection may offer little. If there are eigenvalues close to zero, they represent components or axes of B that may be discarded.

A total of m or less components must be selected to comprise the chosen subspace. Ideally, we would select k eigenvectors, called principal components, that have the k largest eigenvalues.
        <br>
B = select(values, vectors)    </li>
    <li>Once chosen, data can be projected into the subspace via matrix multiplication.<br>
        
P = B^T . A
        <br>
        Where A is the original data that we wish to project, B^T is the transpose of the chosen principal components and P is the projection of A.
    </li>
    
</ol>

In [1]:
import numpy as np

In [2]:
#data matrix for pca
data=np.array([[1,2],[3,4],[5,6]])
data

array([[1, 2],
       [3, 4],
       [5, 6]])

In [6]:
#step 1
#mean of each column
M=np.mean(data,axis=0)
M

array([3., 4.])

In [7]:
#step 2
#subtracting the mean from data
scaled_data=data-M
scaled_data

array([[-2., -2.],
       [ 0.,  0.],
       [ 2.,  2.]])

In [9]:
#step 3
# calculate covariance marix from scaled data
V=np.cov(scaled_data.T)
V

array([[4., 4.],
       [4., 4.]])

In [10]:
#step 4
#calculate eigen values and eigen vectors
values, vectors=np.linalg.eig(V)

In [12]:
# eigen values are the explained variance of the components
values

array([8., 0.])

In [13]:
# eigen vector is the principal components
vectors

array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]])

In [14]:
vectors.shape #shape of eigen vector

(2, 2)

In [17]:
scaled_data.T.shape  # shape of scaled data

(2, 3)

In [21]:
P=vectors.T.dot(scaled_data.T)  # for transformation multiplication of eigen vector and data

In [22]:
P.T

array([[-2.82842712,  0.        ],
       [ 0.        ,  0.        ],
       [ 2.82842712,  0.        ]])

### PCA implementation with sklearn

In [23]:
from sklearn.decomposition import PCA
pca=PCA()
pca.fit(data)

PCA()

In [24]:
# principal components
pca.components_

array([[ 0.70710678,  0.70710678],
       [ 0.70710678, -0.70710678]])

In [25]:
# explained variance
pca.explained_variance_

array([8.00000000e+00, 2.25080839e-33])

In [26]:
P=pca.transform(data)
P

array([[-2.82842712e+00,  2.22044605e-16],
       [ 0.00000000e+00,  0.00000000e+00],
       [ 2.82842712e+00, -2.22044605e-16]])

In [27]:
pca.n_components_

2