In [1]:
import numpy as np
import pandas as pd
from scipy import linalg as la

# sort both along rows and columns
def sortMatrix(matrix):
    sorted_matrix = np.sort(np.sort(matrix, axis=0), axis=1)
    return sorted_matrix 

1\. **PCA on 3D dataset**

* Generate a dataset simulating 3 features, each with N entries (N being ${\cal O}(1000)$). Each feature is made by random numbers generated according the normal distribution $N(\mu,\sigma)$ with mean $\mu_i$ and standard deviation $\sigma_i$, with $i=1, 2, 3$. Generate the 3 variables $x_{i}$ such that:
    * $x_1$ is distributed as $N(0,1)$
    * $x_2$ is distributed as $x_1+N(0,3)$
    * $x_3$ is given by $2x_1+x_2$
* Find the eigenvectors and eigenvalues using the eigendecomposition of the covariance matrix
* Find the eigenvectors and eigenvalues using the SVD. Check that the two procedures yield to same result
* What percent of the total dataset's variability is explained by the principal components? Given how the dataset was constructed, do these make sense? Reduce the dimensionality of the system so that at least 99% of the total variability is retained
* Redefine the data according to the new basis from the PCA
* Plot the data, in both the original and the new basis. The figure should have 2 rows (the original and the new basis) and 3 columns (the $[x_0, x_1]$, $[x_0, x_2]$ and $[x_1, x_2]$ projections) of scatter plots.

In [20]:
# initial parameters
N, mu, std1, std2 = 1000, 0, 1, 3

# gaussian datasets (all x have zero mean)
x1 = np.random.normal(mu, std1, N)
x2 = x1 + np.random.normal(mu, std2, N)
x3 = 2*x1 + x2 # 

A = np.array([x1, x2, x3]) # shape 3 x N
print(A.shape)

# data set
df = pd.DataFrame()
df['features1'] = x1
df['features2'] = x2
df['features3'] = x3
df

(3, 1000)


Unnamed: 0,features1,features2,features3
0,-0.451976,-0.392583,-1.296534
1,-0.546393,-0.031387,-1.124174
2,3.604273,-0.363413,6.845134
3,0.124893,0.429398,0.679183
4,-0.302145,-2.756121,-3.360412
...,...,...,...
995,0.644128,1.900794,3.189049
996,0.818661,-0.608208,1.029113
997,0.010022,-1.346789,-1.326745
998,0.353894,-1.661504,-0.953717


In [24]:
# eigendecomposition of the covariance matrix
C = np.cov(A)
l1, V1 = np.linalg.eig(C) # eigenvalues and eigen vectors

# sigular value decomposition
U, s, Vt = np.linalg.svd(A)

# eigenvalues and eigen vectors
l2 = s**2 / (N-1)
V2 = U

# check the correspondece
if np.allclose(sortMatrix(np.abs(V1)), sortMatrix(np.abs(V2)), rtol=1e-03):
    print('The results match:\n')
else:
    print('The results do not match:\n')
    
# results
print('Egienvalues with Eigendecomposition:\n', l1, '\n')
print('Egienvalues with SVD:\n', l2, '\n')
print('Eigenvectors with Eigendecomposition:\n', V1, '\n')
print('Eigenvectors with SVD:\n', V2, '\n\n\n')

Lambda = np.diag(l2)
print('By selecting the component 0, we retain %.2f%% of the total variability' % (Lambda[0, 0]/Lambda.trace()*100))
print('By selecting the component 1, we retain %.2f%% of the total variability' % (Lambda[1, 1]/Lambda.trace()*100))
print('By selecting the component 2, we retain %.2f%% of the total variability' % (Lambda[2, 2]/Lambda.trace()*100))


# reducing the dimension of the system selecting only the first two rows
Lambda_reduced = Lambda[:2, :]
print('\nBy reducing the system, we retain %.2f%% of the toatl variability\n\n\n' % (Lambda_reduced.trace()/Lambda.trace()*100))


# redefine the data
V1_r = V1[:2, :]
#PCA_A = np.dot(A, V1_r)

The results match:

Egienvalues with Eigendecomposition:
 [ 2.68964159e+01 -1.43483648e-15  2.03880981e+00] 

Egienvalues with SVD:
 [2.68995414e+01 2.03995399e+00 2.90321487e-30] 

Eigenvectors with Eigendecomposition:
 [[-0.12012795 -0.81649658  0.56471463]
 [-0.57150339 -0.40824829 -0.71184072]
 [-0.81175929  0.40824829  0.41758853]] 

Eigenvectors with SVD:
 [[-0.12017091  0.56470549 -0.81649658]
 [-0.57144924 -0.71188419 -0.40824829]
 [-0.81179106  0.41752678  0.40824829]] 



By selecting the component 0, we retain 92.95% of the total variability
By selecting the component 1, we retain 7.05% of the total variability
By selecting the component 2, we retain 0.00% of the total variability

By reducing the system, we retain 100.00% of the toatl variability





2\. **PCA on a nD dataset**

* Start from the dataset you have genereted in the previous exercise and add uncorrelated random noise. Such noise should be represented by other 10 uncorrelated variables normally distributed, with a standard deviation much smaller (e.g. a factor 20) than those used to generate the $x_1$ and $x_2$. Repeat the PCA procedure and compare the results with what you have obtained before.

3\. **Optional**: **PCA on the MAGIC dataset**

Perform a PCA on the magic04.data dataset.

In [None]:
# get the dataset and its description on the proper data directory
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data -P data/
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.names -P data/ 