# Principal Component Analysis (PCA)

In [34]:
import numpy as np
from sklearn.decomposition import PCA

## Concept

$N$ is the number of data. $p$ is the number of features. $X$ is $N \times p$ given data. Each observation is **centered**, meaning the means of each feature are subtracted from each feature. In code,

$$
X = X - \text{np.mean($X$, axis=0)}
$$

**np** is **numpy**. **axis=0** means that getting means of each column.

**Principal component analysis (PCA)** is the **singular value decomposition (SVD)** of this centered data.

$$
X = U D V^T
$$
$$
(N \times p) = (N \times p) (p \times p) (p \times p)
$$

$U$ is **left singular vectors**. $D$ is a diagonal matrix with **singular values** in diagonal elements. $V$ is **right singular vectors**.

The columns of $U D$ are called the **principal components** of $X$.

Dimension reduction of $X$ from $p$ to $q$ ($q \le p$) is given by the first $q$ principal components like below.

$$
X_{\text{dimension reduced}} = U_q D_q
$$
$$
(N \times q) = (N \times q) (q \times q)
$$

$U_q$ is $U$ of the first $q$ columns and all the rows. $D_q$ is $D$ of the first $q$ columns and first $q$ rows.

## Scikit-learn

In **sklearn.decomposition.PCA**, parameter **n_components** is $q$.

$X$ needs to be centered before doing **fit(X)** or **fit_transform(X)**.

Attribute **singular_values_** is **singular values of SVD**.

Attribute **components_** is **right singular vectors of SVD**.

Dimension reduction by **fit_transform(X)** is $U_q D_q$ of SVD.

## Reference

- The Elements of Statistical Learning, 14.5.1 Principal Components
- [numpy.linalg.svd](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html)
- [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

## Singular value and components

In [112]:
np.random.seed(0)

# Make a given matrix
n = 100
p = 10
X = np.random.randn(n, p)

# PCA by manual
X_centered = X - np.mean(X, axis=0)
U, s, V = np.linalg.svd(X_centered, full_matrices=False)
S = np.diag(s)

print(f'U: {U.shape}')
print(f'S: {S.shape}')
print(f'V: {V.shape}')
print()

print('Singular values by manual')
print(np.round(s, 2))
print()
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#
# Equivalently, the right singular vectors of the centered input data, parallel to its eigenvectors.
print('Components by manual')
print(np.round(V, 2))
print()

# PCA by sklearn
pca = PCA(n_components=p)
pca.fit(X_centered)
print('Singular values by sklearn')
print(np.round(pca.singular_values_, 2))
print()
print('Components by sklearn')
print(np.round(pca.components_, 2))
print()

# Take difference
print('Singular values of manual - sklearn ')
print(np.round(s - pca.singular_values_, 2))
print()
print('Components of manual - sklearn')
print(np.round(V - pca.components_, 2))
print()
print('Components of manual + sklearn')
print(np.round(V + pca.components_, 2))
print()

U: (100, 10)
S: (10, 10)
V: (10, 10)

Singular values by manual
[12.06 11.37 11.1  10.26  9.98  9.71  8.86  8.1   7.94  7.73]

Components by manual
[[-0.31  0.24  0.03  0.37  0.19  0.26  0.13 -0.47  0.6  -0.1 ]
 [ 0.14  0.21 -0.2   0.11  0.53 -0.65  0.33 -0.11 -0.15 -0.18]
 [-0.02  0.47 -0.21 -0.58  0.01 -0.08  0.05  0.29  0.47  0.28]
 [ 0.71  0.12  0.01  0.43 -0.27  0.08  0.29  0.25  0.25  0.03]
 [-0.27 -0.44 -0.36  0.07 -0.38 -0.39  0.05  0.2   0.38 -0.36]
 [ 0.34 -0.11 -0.22 -0.32 -0.34 -0.12  0.04 -0.76  0.01  0.13]
 [-0.11 -0.26  0.65  0.02 -0.04 -0.42  0.18 -0.03  0.21  0.49]
 [-0.31  0.07 -0.49  0.37 -0.16  0.01  0.17  0.02 -0.25  0.64]
 [ 0.27 -0.51 -0.28  0.03  0.55  0.11 -0.34  0.04  0.28  0.29]
 [-0.06 -0.35 -0.04 -0.29  0.16  0.38  0.78  0.06 -0.06 -0.05]]

Singular values by sklearn
[12.06 11.37 11.1  10.26  9.98  9.71  8.86  8.1   7.94  7.73]

Components by sklearn
[[ 0.31 -0.24 -0.03 -0.37 -0.19 -0.26 -0.13  0.47 -0.6   0.1 ]
 [ 0.14  0.21 -0.2   0.11  0.53 -0.65  0.33 -

## Dimension reduction

In [117]:
m = 5

print('Size of original data')
print(X_centered.shape)
print()

# Dimension reduction by manual
U, s, V = np.linalg.svd(X_centered, full_matrices=False)
S = np.diag(s)
reduced_manual = U[:, :m] @ S[:m, :m]

print('Dimension reduction by manual')
print(reduced_manual.shape)
print()

# Dimension reduction by sklearn
pca = PCA(n_components=m)
reduced_sklearn = pca.fit_transform(X_centered)

print('Dimension reduction by sklearn')
print(reduced_sklearn.shape)
print()

print('Manual - sklearn')
print(np.round(reduced_manual[:10] - reduced_sklearn[:10], 2))
print()

print('Manual + sklearn')
print(np.round(reduced_manual[:10] + reduced_sklearn[:10], 2))
print()

Size of original data
(100, 10)

Dimension reduction by manual
(100, 5)

Dimension reduction by sklearn
(100, 5)

Manual - sklearn
[[ 1.17  0.   -0.   -0.   -0.  ]
 [ 2.18  0.   -0.   -0.   -0.  ]
 [ 3.21 -0.    0.    0.   -0.  ]
 [-2.74 -0.   -0.    0.    0.  ]
 [-2.06 -0.    0.   -0.   -0.  ]
 [-0.93 -0.    0.    0.    0.  ]
 [-3.2  -0.    0.    0.    0.  ]
 [-1.53  0.    0.   -0.    0.  ]
 [ 0.53 -0.    0.   -0.   -0.  ]
 [ 0.44  0.   -0.   -0.   -0.  ]]

Manual + sklearn
[[-0.    4.92 -2.35  4.3  -2.75]
 [-0.    2.02  0.72  1.8  -1.6 ]
 [ 0.    2.78  3.62 -4.33 -0.51]
 [ 0.    0.67  3.27  0.37  0.58]
 [-0.    0.16 -4.15 -0.67  3.07]
 [ 0.   -0.09  1.27 -1.82 -0.1 ]
 [ 0.   -0.29  1.33 -3.41  0.46]
 [ 0.    0.79  0.45 -0.79 -1.9 ]
 [ 0.   -0.23  1.79 -2.69 -4.85]
 [ 0.   -0.21  1.22  1.93 -1.22]]

