# Principal Component Analysis #

Principal Component Analysis (PCA) is based on the spectral decomposition of a matrix. For a matrix $A$, we aim to find the decomposition $A = W \Lambda W^\dag$, where $\Lambda$ is a diagonal matrix. This is useful as it gives us a transformation to an orthonormal basis for the matrix $A$. It is typically applied on a covariance or correlation matrix.

While PCA is the most popular and straightforward orthogonalization technique, it is by no means the only method.

Let us denote $X$ to be the $T \times n$ matrix of data ($T$ data points, $n$ number of variables). We also denote the columns of $X$ as $\{\vec{x}_1,\dots,\vec{x}_n\}$, where each vector $\vec{x}_i$ is the vector of data for a explanatory variable. We assume each $\vec{x}_i$ has zero mean.

The sample variances and covariances of this data is summarized by the matrix

$$
V = T^{-1} X^\dag X.
$$

If we normalize the data such that each $\vec{x}_i$ has zero mean and variance 1, $V$ represents the correlation matrix of the returns.

Sometimes, we might not have enough data ($T < n$), which causes $V$ to have some zero eigenvalues since it is singular. In such a scenario, a full set of $n$ principal components will not be able to be determined. Usually, this is still not too big of an issue since we only look at the first few most important principal components. When ($T > n$), $V$ is positive definite.

We still have not defined what we mean by a principal component. A principal component is a linear combination of the vectors of $X$, where the weights are chosen such that:

1) The principal components are uncorrelated with each other

2) The first principal component explains the most variation, the second explains the greatest amount of the remaining variation, ect.

We now describe the method to do so.

Denote $\Lambda$ as the diagonal matrix of the eigenvalues of $V$, and $W$ the orthogonal matrix of the corresponding eigenvectors of $V$. The eigenvalues (and corresponding eigenvectors) are ordered from largest to smallest, $\lambda_1 \geq \lambda_2 \dots \geq \lambda_n$. We define the matrix of principal components $P$ as:

$$
P = XW.
$$

Then the $i$ th principal component of $V$ is the $i$ th column of $P$. One realizes that the covariance matrix of the principal components $T^{-1}P^\dag P = \Lambda$. This diagonal matrix indicates that the principal components are uncorrelated, and the variance of the $i$ th principal component is $\lambda_i$.

Since $W^\dag = W^{-1}$, $X=P W^\dag$. In other words,

$$
\vec{x}_i = w_{i1}\vec{p}_1 + \dots + w_{ik}\vec{p}_k.
$$

Now, one can start to use the PCAs to approximate the input data. For example, if we only wish to use the first two principal components,

$$
\vec{x}_i \approx w_{i1}\vec{p}_1 + w_{i2}\vec{p}_2.
$$

This is typically written as:

$$
X \approx P^* W^{*\dag}.
$$

We will try out PCA on a small dataset, of 569 data points with 30 explanatory variables.


In [11]:
from sklearn.datasets import load_breast_cancer
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd

In [4]:
breast = load_breast_cancer()
breast_data = breast.data
breast_labels = breast.target
labels = np.reshape(breast_labels,(569,1))
print(breast_data.shape)
features = breast.feature_names
print(features)

(569, 30)
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


In [8]:
x = breast_data
x = StandardScaler().fit_transform(x) # normalizing the features
print(x.shape)
print(np.mean(x),np.std(x))

(569, 30)
-6.118909323768877e-16 1.0


In [10]:
pca_breast = PCA(n_components=2)
principalComponents_breast = pca_breast.fit_transform(x)

In [12]:
principal_breast_Df = pd.DataFrame(data = principalComponents_breast
             , columns = ['principal component 1', 'principal component 2'])
principal_breast_Df.tail()

Unnamed: 0,principal component 1,principal component 2
564,6.439315,-3.576817
565,3.793382,-3.584048
566,1.256179,-1.902297
567,10.374794,1.67201
568,-5.475243,-0.670637


In [13]:
print('Explained variation per principal component: {}'.format(pca_breast.explained_variance_ratio_))

Explained variation per principal component: [0.44272026 0.18971182]
