Principal Component Analysis, PCA is one of the most commonly used statistical methods for pattern recognition and dimensionality reduction on data defined in high dimensions. It was introduced by Karl Pearson in 19011. Later, as computers reached practical use, it also became one of the most fundamental methods of machine learning.
One of the most promising ideas to represent a subject more precisely is to use more variables, in other words, to represent the subject in higher dimension. For this reason, these variables or dimensions are also referred to as features.
On the other hand, data expressed in such higher dimensions may not have essential features in all dimensions. When data has redundancy in the defined dimensions and can actually be represented in lower dimensions than the defined dimensions, there is a possibility that the subject can be represented more essentially by dimensionality reduction.
A dataset is a set of partially or fully sampled data about a subject.
Let
As a preparation for later, denote the
The vector and matrix representations defined here are used in the following calculations.
In preparation for PCA, we introduce the statistics that evaluate the features contained in the dataset, namely, variance and covariance.
The variance is the sum of squares of the distance from the mean of each dimension averaged over
where
Since variance is the mean of the sums of squares, its square root, standard deviation
When the variance is small, the variable represents poor features. In particular, if the variance is zero, the variable can be ignored. However, since random noise also has a certain variance, a large variance does not always mean that the features well represented by the variable. In high-dimensional datasets, not all dimensions are represented by the same scale, so the values are often scaled based on the standard deviation or prior knowledge of the variable.
If the dataset is prepared by sampling a subset of the original data, it is possible that the variable that originally well represents the features of the subject may not be evaluated correctly due to the bias of the samples with small variance, or vice versa. This is an essential issue when analyzing a subset of the dataset of interest. We should consider eliminating bias by resampling or, if possible, treating all the data without sampling, however, this is often difficult in practice.
While variance evaluates the behavior of each dimension of the dataset, covariance evaluates the association between the dimensions of the dataset.
The covariance of
We can say
Unlike variance,
- zero if
$X_i$ and$X_j$ are independent, - positive if
$X_i$ and$X_j$ both increase or decrease, - negative when one increases, the other decreases.
Therefore, if the absolute value of the covariance of two dimensions are large, we can say that the two dimensions share certain features including flipping.
The matrix
The covariance matrix for a dataset defined in
Since we assumed
For variances and covariances, by definition, their values do not depend on the mean of each dimension.
Therefore, for each dimension
In this case, the covariance is
and the covariance matrix
While we could perform a simple dimensionality reduction by removing poorly featured dimensions from statistics such as variance and covariance or removing dimensions with overlapping features, we can take this idea further and project all variables in the dataset to new variables such that the variances are maximized.
The new features obtained by the projection will also generally be of high dimension, and the new features can be selected in the order of largest variances. This is the basic idea of PCA.
First we take an example of Pearson's two dimensional dataset1.
For this dataset, we perform mean centering. That is,
Figure 1. Pearson's dataset centered by the mean.
Calculating the covariance matrix of the centered dataset, we yield
Since
The essential computation of PCA is eigendecomposition of the covariance matrix
where
The covariance matrix
It is known that the eigenvectors of the Hermitian matrix are orthogonal to each other, and since the eigenvectors are determined by excluding multiples of the coefficients, we can take the orthonormal vectors as eigenvectors.
In fact, for Pearson's dataset, we yield
Let
Since we take orthonormal vectors as eigenvectors,
Let
Using
This is the eigendecomposition of
The components of the covariance matrix
In the case of a two-dimensional dataset, the projection to the same two dimensions by PCA gives a coordinate transformation on the same plane.
Figure 2 shows the centered coordinates
Figure 2. Variance is maximized by projection.
It can be seen that
The projection of the dataset is obtained by using a matrix
Figure 3. The projected dataset.
The ratio of the variance of the projected dimension to the sum of the variances of all dimensions is equal to the ratio of the corresponding eigenvalue
The inverse transformation of the projection is given by
The calculation of PCA is based on eigendecomposition of a matrix.
Here is an example using the Python numerical package numpy
.
# Mean centering
mu = np.mean(X, axis=0)
X = X - mu
# For matrix operations.
X = np.asmatrix(X)
# Covariance matrix
C = X.T * X / (N - 1)
# Eigendecomposition
L, W = np.linalg.eig(C)
# Sort eigenvectors in the descending order of eigenvalues.
W = W[:, np.flip(np.argsort(L))]
# Projection
T = X * W
A computational complexity of eigendecomposition of a square matrix of order
The SVD for
where
# Singular value decomposition
U, S, Vt = np.linalg.svd(X, full_matrices=False)
# Singular values are already sorted.
assert np.allclose(S, np.flip(np.sort(S)))
# Projected coodinates T = U * S
T = U * np.diag(S)
# U * S = X * V
T = X * Vt.T
The eigenvalue
Therefore, we can also calculate the ratio of variance of the projected dimension from the singular values.
In the first example, we visually confirmed the projection on a two-dimensional dataset. Next we will show an example where the results of the projection can be used to discriminate labels that are not used in the calculation, i.e., unsupervised learning, or more recently, the method is called self-supervised learning.
We take Fisher's iris dataset as an example2.
This famous dataset measures four items, sepal length, sepal width, petal length, and petal width, for three iris species, setosa, versicolor, and virginica.
Again, we denote the i-th dimension of the dataset by
Although we cannot visually check the whole dataset since it is four-dimensional, if we represent it in two dimensions with
Figure 4. Iris dataset plotted by
The covariance matrix
Looking at
As a result of the eigendecomposition of
The ratio of the eigenvalues to their sum of them as 0.925, 0.053, 0.017, 0.005. This ratio is equal to the ratio of variance of the projected distination dimensions, which indicates that the variance of the dimension corresponding to the first two singular values contributing for over 95% of the total. The results of the plot of these two dimensions are shown in Figure 5.
Figure 5. Iris dataset projected by PCA and plotted in two dimensions with large variance.
Although we did not use labels indicating the three species of iris when performing the projection, we can see that compared to Figure 4, Figure 5 better reflects the species of iris.
The projection matrix
Since the projection by the first column of
In fact, looking at Figure 6, which plots
Figure 6. Iris datast plotted by
Applications of PCA include quantitative structure-activity relationship, QSAR of small molecules and classification of diseases based on gene expression levels, etc.
In 1999, Golub et al. classified two leukemia phenotypes, Acute Lymphocytic Leukemia, ALL and Acute Myeloid Leukemia, AML, based on gene expression levels using clustering3. The training dataset consisted of about 7,000 gene expression levels in 38 patients. Here, we unsupervised learned this dataset with PCA. The results of the projection is shown in Figure 7.
Figure 7. Projection by PCA of gene expression dataset for ALL and AML patients.
It shows that the projection is able to classify almost linearly between ALL and AML in the training dataset.
While a number of genes in a gene expression dataset is high dimensional, it is not uncommon that the number of samples is much smaller than the number of genes due to the constraints of clinical studies. In such cases, it is feasible to consider PCA, which can extract effective features by dimensionality reduction.
QSAR handles properties based on the substructure of small molecules, so it was necessary to support high-dimensional datasets from an early time. For QSAR, neural networks and kernel methods were known for analysis before deep learning.
Later, this feature extraction would be performed by deep learning. In particular, we note that PCA is equivalent to a linear autoencoder.
Footnotes
-
K. Pearson, On lines and planes of closest fit to systems of points in space, Philosophical Magazine, 2, 559-572, 1901. ↩ ↩2
-
R.S. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics, 7 (2), 179–188. 1936. ↩
-
T.R. Golub et al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 286, pp531-537, 1999. ↩