# Principal Component Analysis

Based on https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

## Overview

In this notebook, we will discuss the concept of principal component analysis (PCA), an unsupervised machine learning method for dimensionality reduction, and how to implement it in Python using the scikit-learn library

# Libraries

In [None]:
import numpy   as np
import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

from mendeleev.fetch       import fetch_table

plt.rc('xtick', labelsize=18) 
plt.rc('ytick', labelsize=18)

blue   = '#0021A5'
orange = '#FA4616'

## 1. Introduction

Definition: PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated numerical variables into a set of values of linearly uncorrelated variables called principal components.

- Principal component analysis is a fast and flexible unsupervised method for dimensionality reduction in data. It is a linear transformation technique that is widely used across different fields, most prominently for feature extraction and dimensionality reduction in machine learning.

- PCA is an **unsupervised linear dimensionality reduction technique** that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space.

- PCA aims to preserve the directions of greatest variance in the data, reducing the influence of low-variance directions that may represent noise or redundant information.

- Dimensions are nothing but features that represent the data. For example, the atomic descriptors we used already included the atomic number, atomic and ionic radii, electronegetivity, etc. Each of these descriptor components is a dimension. Note: Features, Dimensions, and Variables are all referring to the same idea. You will find them being used interchangeably.

- You can use PCA to cluster similar data points based on the feature correlation between them.

### 1.1 Five steps of PCA

Principal component analysis can be broken down into five steps. We will go through each step, explaining what PCA does and discuss the underlying mathematical concepts such as standardization, covariance, eigenvectors and eigenvalues without focusing on how to compute them.

1. [**Standardize**](####-1.2.1-Step-1:-Standardization) the range of continuous initial variables, i.e., zero mean, unit variance.

2. Compute the [**covariance matrix**](####-1.2.2-Step-2:-Covariance-matrix) to identify correlations.

3. Compute the [**eigenvectors** and **eigenvalues**](####-1.2.3-Step-3:-Eigenvectors-and-eigenvalues) of the covariance matrix to identify the principal components.

4. Create a [**feature vector**](####-1.2.4-Step-4:-Create-the-feature-vector) to decide which principal components to keep.

5. [**Recast**](####-1.2.5-Step-5:-Recast-data-along-principal-components) the data along the principal components axes.

### 1.2 Illustrative example of PCA

In [None]:
points       = 200
random_state = np.random.RandomState(1)

X = np.dot(random_state.rand(2, 2), random_state.randn(2, points)).T

plt.figure( figsize=(8, 8) )

plt.scatter(X[:, 0], X[:, 1], color=blue, s=64)
plt.xlabel('x', fontsize=18)
plt.ylabel('y', fontsize=18)

plt.axis('equal')

plt.show()

It is apparent that there is a nearly linear relationship between the $x$ and $y$ variables. This is reminiscent of the linear regression data we explored previously, but the problem setting here is slightly different. Rather than attempting to predict the $y$ values from the $x$ values, the unsupervised learning problem attempts to learn about the relationship between the $x$ and $y$ values.

In principal component analysis, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset.

#### 1.2.1 Step 1: Standardization

The goal is to standardize the range of the variables so that each one of them contributes equally to the analysis.

PCA, is sensitive to the variances of the initial variables. If there are large differences between the ranges of the different features, those with larger ranges will dominate over those with small ranges.

For example, the $x$ values have a much larger range than the $y$ values. Hence the $x$ values would dominate over the $y$ values, which will lead to biased results. So, transforming the data to comparable scales can prevent this problem.

Mathematically, this can be done by subtracting the mean $\mu$ and dividing by the standard deviation $\sigma$ for each value of each variable. This is applied column-wise to each feature:
$$
\overline{\mathbf{X}} = \frac{\mathbf{X} - \mu}{\sigma}
$$

In [None]:
mean   = np.mean(X, axis=0)
stddev = np.std(X,  axis=0, ddof=1)

X_transformed = (X - mean)/stddev

We can alternatively use sklearn's StandardScaler to achieve the same result

In [None]:
standard_scaling = StandardScaler(with_mean=True, with_std=True)

standard_scaling.fit(X)

X_transformed_using_sklearn = standard_scaling.transform(X)

Verify that the two methods produce the same result

In [None]:
np.allclose(X_transformed, X_transformed_using_sklearn)

Note that scikit-learn uses a normalization of $N$ instead on $N-1$ to calculate the standard deviation. To change this behavior, you can set the parameter ddof to 0 in `np.std()`



#### 1.2.2 Step 2: Covariance matrix

The **covariance matrix** is a square matrix that captures the pairwise correlation between the components of a vector. For a dataset with $ n $ observations and $ d $ dimensions for the vector, represented as a matrix $ \mathbf{X} \in \mathbb{R}^{n \times d} $, the covariance matrix $ \boldsymbol{\Sigma} \in \mathbb{R}^{d \times d} $ is defined as:

$$
\boldsymbol{\Sigma} = \mathrm{cov}(\mathbf{X}) = \frac{1}{n - 1} \mathbf{X}^\top \mathbf{X}
$$

where $ \mathbf{X} $ is assumed to be **mean-centered**, i.e., each column (feature) has zero mean.

Each entry $ \Sigma_{ij} $ of the covariance matrix represents the covariance between feature $ i $ and feature $ j $:

$$
 \Sigma_{ij} = \mathbf{cov}(x_i, x_j) = \frac{1}{n - 1} \sum_{k=1}^{n} (x_{ki} - \bar{x}_i)(x_{kj} - \bar{x}_j)
$$

For example, for a 3-dimensional data set with 3 variables $x$, $y$, and $z$, $\mathrm{\Sigma}(\mathbf{X}) \in \mathbb{R}^{3 \times 3}$ with elements
$$
\mathrm{\Sigma}(\mathbf{X}) = 
\begin{bmatrix}
   \mathrm{cov}(x,x) & \mathrm{cov}(x,y) & \mathrm{cov}(x,z) \\
   \mathrm{cov}(y,x) & \mathrm{cov}(y,y) & \mathrm{cov}(y,z) \\
   \mathrm{cov}(z,x) & \mathrm{cov}(z,y) & \mathrm{cov}(z,z) \\
\end{bmatrix}
$$

In [None]:
# Calculate the covariance matrix
mean_vector = np.mean(X_transformed, axis=0)

covariance  = (X_transformed - mean_vector).T.dot((X_transformed - mean_vector)) / (points - 1)

print(f'Covariance matrix\n{covariance}')

We can alternatively use the numpy function `cov`

In [None]:
covariance = np.cov(X_transformed.T)

print(f'\nCovariance matrix\n{covariance}')

We see that the covariance of a variable with itself is its variance, namely, $\mathrm{cov}(a,a) = \mathrm{var}(a)$. With that in mind, we can easily see that the diagonal elements of $\mathrm{cov}(\mathbf{X})$ diagonal are the variance for each initial variable. And since the covariance is commutative, i.e., $\mathrm{cov}(a,b) = \mathrm{cov}(b,a)$, the entries of the covariance matrix are symmetric with respect to the diagonal elements of $\mathrm{cov}(\mathbf{X})$. This means that the upper and the lower triangular portions are equal:
$$
\mathrm{cov}(\mathbf{X}) = 
\begin{bmatrix}
   \mathrm{var}(x,x) & \mathrm{cov}(x,y) & \mathrm{cov}(x,z) \\
   \mathrm{cov}(x,y) & \mathrm{var}(y,y) & \mathrm{cov}(y,z) \\
   \mathrm{cov}(x,z) & \mathrm{cov}(y,z) & \mathrm{var}(z,z) \\
\end{bmatrix}
$$

Key properties:
- $ \boldsymbol{\Sigma} $ is symmetric: $ \Sigma_{ij} = \Sigma_{ji} $
- The diagonal elements $ \Sigma_{ii} $ represent the variances of individual features
- The matrix is positive semi-definite
- The sign of the covariance entries $a$ and $b$ determines the correlation:
  - $\mathrm{cov}(a,b) > 0$: Both increase or decrease together (**correlated**)
  - $\mathrm{cov}(a,b) < 0$: One increases when the other decreases (**inversely correlated**)
  - $\mathrm{cov}(a,b) = 0$: The two variables are **uncorrelated**


#### 1.2.3 Step 3: Eigenvectors and eigenvalues

Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the covariance matrix in order to determine the principal components of the data.

What do we mean by principal components?

- Principal components are new variables constructed as linear combinations or mixtures of the initial variables.
- These combinations are such that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.
- So, the idea is that a $p$-dimensional data give $p$ principal components, but the PCA tries to put the maximum possible information in the first component, then the maximum of the remaining information in the second and so on.

We see that the covariance of a variable with itself is its variance, namely, $\mathrm{cov}(a,a) = \mathrm{var}(a)$. With that in mind, we can easily see that the diagonal elements of $\mathrm{cov}(\mathbf{X})$ diagonal are the variance for each initial variable. And since the covariance is commutative, i.e., $\mathrm{cov}(a,b) = \mathrm{cov}(b,a)$, the entries of the covariance matrix are symmetric with respect to the diagonal elements of $\mathrm{cov}(\mathbf{X})$. This means that the upper and the lower triangular portions are equal:
$$
\mathrm{cov}(\mathbf{X}) = 
\begin{bmatrix}
   \mathrm{var}(x,x) & \mathrm{cov}(x,y) & \mathrm{cov}(x,z) \\
   \mathrm{cov}(x,y) & \mathrm{var}(y,y) & \mathrm{cov}(y,z) \\
   \mathrm{cov}(x,z) & \mathrm{cov}(y,z) & \mathrm{var}(z,z) \\
\end{bmatrix}
$$

Key properties:
- $ \boldsymbol{\Sigma} $ is symmetric: $ \Sigma_{ij} = \Sigma_{ji} $
- The diagonal elements $ \Sigma_{ii} $ represent the variances of individual features
- The matrix is positive semi-definite
- The sign of the covariance entries $a$ and $b$ determines the correlation:
  - $\mathrm{cov}(a,b) > 0$: Both increase or decrease together (**correlated**)
  - $\mathrm{cov}(a,b) < 0$: One increases when the other decreases (**inversely correlated**)
  - $\mathrm{cov}(a,b) = 0$: The two variables are **uncorrelated**

In [None]:
eigenvalues, eigenvectors = np.linalg.eig(covariance)

print(f'eigenvalues  = {eigenvalues}')
print(f'eigenvectors =\n{eigenvectors}')

In [None]:
# Calculate the cumulative summation of the eigenvalues
cummulative = np.cumsum(eigenvalues)/np.sum(eigenvalues)
        
print(f'cummulative summation = {cummulative}\n')

plt.figure( figsize=(8, 8) )

plt.bar( np.arange(eigenvalues.size), 100*eigenvalues/np.sum(eigenvalues) , color=blue, label='Individual')
plt.plot(100*cummulative, color=orange, label='Cumulative', lw=4)

plt.xlabel('Principal components', fontsize=18)
plt.ylabel('Percentage of explained variance', fontsize=18)

plt.xticks(np.arange(eigenvalues.size), fontsize=18)
plt.yticks(np.arange(0, 110, 10), fontsize=18)

plt.legend(fontsize=18)

plt.show()

Keep in mind that there are as many principal components as variables in the data. Principal components are constructed in such a manner that the first principal component accounts for the **largest possible variance** in the data set.

For our simple 2D dataset, we can guess the first principal component. It is a line going from the lower left to the upper right. It matches the blue marks because it goes through the origin and it’s the line in which the projection of the points is the most spread out. Or mathematically speaking, it’s the line that maximizes the variance (the average of the squared distances from the projected points to the origin).

#### 1.2.4 Step 4: Create the feature vector

We need to decide which principal components to keep.

Computing the eigenvectors and ordering them by their eigenvalues in descending order, allow us to find the principal components in order of significance. In this step, what we do is, to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors that we call **feature vector**.

So, the feature vector is simply a matrix whose columns are the eigenvectors of the components that we decide to keep. This makes it the first step towards dimensionality reduction, because if we choose to keep only $p$ eigenvectors (components) out of $n$, the final data set will only have $p$ dimensions.

In [None]:
feature_vectors = (eigenvectors.T[:][:2])

print(f'feature vectors =\n{feature_vectors}')

#### 1.2.5 Step 5: Recast data along principal components

In the previous steps, apart from standardization, we did not make any changes to the data. We just selected the principal components and built the feature vector. The input data set remained always in terms of the original axes (i.e, in terms of the initial variables).

In this last step, we use the feature vector formed from the eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones represented by the principal components (hence the name Principal Components Analysis). This can be done by multiplying the transpose of the original data set by the transpose of the feature vector.
$$
\mathrm{PCA}_\mathrm{Dataset} = \mathrm{Standardized\, Original\, Dataset} \times \mathrm{Feature\, vectors}^\mathrm{T}
$$

In [None]:
X_pca = np.dot(X_transformed, feature_vectors.T)

For the sake of simplicity, we can visualize the components we just calculated as vectors, where we will use the components as the direction of the vectors and the eigenvalues as the length of the vectors.

In [None]:
def draw_vector(v0, v1, ax=None):

    arrowprops=dict(arrowstyle='->',
                    lw=4,
                    color=orange)
    
    ax.annotate('', v1, v0, arrowprops=arrowprops)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 8), layout='tight')

ax[0].scatter(X_transformed[:, 0], X_transformed[:, 1], color=blue, s=64, alpha=0.7)
ax[1].scatter(X_pca[:, 0], X_pca[:, 1], color=blue, s=64, alpha=0.7)

for length, vector, identity in zip(eigenvalues, eigenvectors.T, np.identity(2)):
    draw_vector([0,0], 3.0*vector*np.sqrt(length),   ax=ax[0])
    draw_vector([0,0], 3.0*identity*np.sqrt(length), ax=ax[1])

ax[0].set_xlabel('x', fontsize=18)
ax[0].set_ylabel('y', fontsize=18)
ax[1].set_xlabel(f'PCA$_1$', fontsize=18)
ax[1].set_ylabel(f'PCA$_2$', fontsize=18)

plt.axis('equal')

plt.show()

These vectors represent the principal axes of the data, and the length of the vector illustrates the importance of that axis to describe its distribution. This procedures allows to measure the variance of the data projected onto that axis, where the projection of each data point onto the principal axes are the *principal components* of our data.

## 2. Principal Component Analysis using Chemical Data

Let's start by creating a pandas dataframe containg the properties of the chemical elements.

In [None]:
periodic_table = fetch_table('elements').select_dtypes([np.number])

# Select all elements from hydrogen to lawrencium, the last of the actinides
periodic_table = periodic_table.iloc[list(range(1,103)), :]

# Drop the columns that include incomplete data
periodic_table = periodic_table.dropna(axis=1)

periodic_table.describe()

### 2.1 Standardize the data

Recall that the purpose of this procedure is that all variables will contribute equally to the analysis. For example, a the atomic number ranges from 1 to 118, while the atomic radius ranges from 0.25 Å for hydrogen to 2.65 Å for cesium. Hence the atomic number would dominate over the atomic radius, which will lead to biased results. So, transforming the data to comparable scales can prevent this problem.

> ### Assignment
>
> Standardize the data and save it to the variable `periodic_table_transformed`

We can alternatively use scikit-learn to standardize the data with the function `zscore` from the module `scipy.stats` in the form

~~~
periodic_table.apply(zscore).describe()
~~~

### 2.2 Calculate the covariance matrix

We can use the built-in function in the pandas library to calculate the covariance matrix

In [None]:
covariance = periodic_table_transformed.cov()

By means of a heatmap we can visualize and inspect the covariance matrix of the periodic table

In [None]:
fig, ax = plt.subplots( figsize=(10, 8) )

sns.heatmap(covariance, ax=ax, cmap='coolwarm', cbar=True,
        xticklabels=covariance.columns,
        yticklabels=covariance.columns)

plt.show()

### 2.3 Compute the eigenvectors and eigenvalues of the covariance matrix

> ### Assignment
>
> Calculate the eigenvalues and eigenvectors. Save your results to the variables `eigenvalues` and `eigenvectors`

In [None]:
# Calculate the cumulative summation of the eigenvalues
cummulative = np.cumsum(eigenvalues)/np.sum(eigenvalues)

plt.figure( figsize=(8, 8) )

plt.bar( np.arange(eigenvalues.size), 100*eigenvalues/np.sum(eigenvalues) , color=blue, label='Individual')
plt.plot(100*cummulative, color=orange, label='Cumulative', lw=4)

plt.xlabel('Principal components', fontsize=18)
plt.ylabel('Percentage of explained variance', fontsize=18)

plt.legend(loc='best', fontsize=18)

plt.show()

### 2.4 Create the feature vectors

> ### Assignment
>
> Generate these vectors using a total of two features. Save your result to the variable `feature_vectors`

### 2.5 Recast data along the principal component axes

> ### Assignment
>
> Project the data to its principal component axes and save your results to the variable `periodic_table_pca`

In [None]:
def draw_vector(v0, v1, ax=None):

    arrowprops=dict(arrowstyle='->',
                    linewidth=4,
                    color=orange)
    
    ax.annotate('', v1, v0, arrowprops=arrowprops)


fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 8), layout='tight')

ax[0].scatter(periodic_table.vdw_radius,
              periodic_table.dipole_polarizability,
              s=periodic_table.index, color=blue, alpha=0.7)

ax[1].scatter(periodic_table_pca.loc[:, 0],
              periodic_table_pca.loc[:, 1],
              s=periodic_table_pca.index, color=blue, alpha=0.7)

for length, vector in zip(eigenvalues, np.identity(2)):
    v = vector*np.sqrt(length)
    draw_vector([0,0], v, ax=ax[1])

ax[0].set_xlabel('vdW radius', fontsize=18)
ax[0].set_ylabel('Dipole polarizability', fontsize=18)

ax[1].set_xlabel('PCA1', fontsize=18)
ax[1].set_ylabel('PCA2', fontsize=18)

plt.axis('equal')

plt.show()