In [45]:
import numpy as np
from scipy import linalg as la
import matplotlib.pyplot as plt

1\. PCA on 3D dataset

* Generate a dataset simulating 3 features, each with N entries (N being ${\cal O}(1000)$). Each feature is made by random numbers generated according the normal distribution $N(\mu,\sigma)$ with mean $\mu_i$ and standard deviation $\sigma_i$, with $i=1, 2, 3$. Generate the 3 variables $x_{i}$ such that:
    * $x_1$ is distributed as $N(0,1)$
    * $x_2$ is distributed as $x_1+N(0,3)$
    * $x_3$ is given by $2x_1+x_2$

In [46]:
N = 1000

x1 = np.random.normal(loc = 0, scale = 1, size = N)
x2 = x1 + np.random.normal(loc = 0, scale = 3, size = N)
x3 = 2 * x1 + x2

M = np.array([x1, x2, x3])

print ("Dataset:\n", M, "\n")


Dataset:
 [[ 0.09817224 -0.17442946 -0.49086382 ... -0.0711662   1.76637867
  -0.93113195]
 [ 1.69868646 -3.4567745   0.08893523 ...  0.91670948  2.83457295
  -4.26535549]
 [ 1.89503093 -3.80563342 -0.89279241 ...  0.77437709  6.3673303
  -6.1276194 ]] 



* Find the eigenvectors and eigenvalues using the eigendecomposition of the covariance matrix

In [58]:
### Using the eigendecomposition of the covariance matrix ###

# compute the mean of each sequence (row) and set the right shape
M_mean = M.mean(axis = 1)[:, np.newaxis]

# re-center each sequence (row) around its mean
w = M - M_mean

# compute the covariance matrix
cov = w.dot(w.T) / (N - 1)
print("\nCovariance matrix:\n")
print(cov, "\n")

l, V = la.eig(cov)
print("\n### Using the eigendecomposition of the covariance matrix ###\n")
print("Eigenvalues:\n", l, '\n')
print("Real eigenvalues:\n", np.real_if_close(l), '\n')

print("Eigenvectors:\n", V, '\n')

D1 = np.dot(V, np.dot(np.diag(np.real_if_close(l)), la.inv(V)))
if (np.allclose(cov, D1)):
    print("The decomposition was successful.")



Covariance matrix:

[[ 0.94803896  0.96083859  2.85691651]
 [ 0.96083859 10.45382044 12.37549761]
 [ 2.85691651 12.37549761 18.08933062]] 


### Using the eigendecomposition of the covariance matrix ###

Eigenvalues:
 [ 2.75326221e+01+0.j -6.67272072e-16+0.j  1.95856789e+00+0.j] 

Real eigenvalues:
 [ 2.75326221e+01 -6.67272072e-16  1.95856789e+00] 

Eigenvectors:
 [[-0.10743502 -0.81649658  0.56726629]
 [-0.58732146 -0.40824829 -0.69884679]
 [-0.80219151  0.40824829  0.4356858 ]] 

The decomposition was successful.


* Find the eigenvectors and eigenvalues using the SVD. Check that the two procedures yield to same result

In [68]:
### Using the the Singular Value Decomposition of the covariance matrix ###

# perform the SVD
U, S, Vt = la.svd(cov)

print("shapes: U =", U.shape, "D:", S.shape, "V^T:", Vt.shape, '\n')
print("Eigenvalues:\n", S, '\n')
print("U:\n", U, '\n')
print("V^T:\n", Vt, '\n')

# Let's verify the definition of SVD by hand
D2 = np.zeros(np.shape(cov))
for i in range(min(np.shape(cov))):
    D2[i, i] = S[i]
print("D:\n", D2, '\n')

SVD = np.dot(U, np.dot(D2, Vt))
print("SVD:\n", SVD, '\n')

if (np.allclose(cov, SVD)):
    print("The decomposition was successful.\n")

if (np.allclose(np.sort(S), np.sort(np.real_if_close(l)))):
    print("The eigenvalues obtained by SVD and eigendecomposition are the same.")



shapes: U = (3, 3) D: (3,) V^T: (3, 3) 

Eigenvalues:
 [2.75326221e+01 1.95856789e+00 7.88923421e-16] 

U:
 [[-0.10743502  0.56726629 -0.81649658]
 [-0.58732146 -0.69884679 -0.40824829]
 [-0.80219151  0.4356858   0.40824829]] 

V^T:
 [[-0.10743502 -0.58732146 -0.80219151]
 [ 0.56726629 -0.69884679  0.4356858 ]
 [ 0.81649658  0.40824829 -0.40824829]] 

D:
 [[2.75326221e+01 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 1.95856789e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 7.88923421e-16]] 

SVD:
 [[ 0.94803896  0.96083859  2.85691651]
 [ 0.96083859 10.45382044 12.37549761]
 [ 2.85691651 12.37549761 18.08933062]] 

The decomposition was successful.

The eigenvalues obtained by SVD and eigendecomposition are the same.


* What percent of the total dataset's variability is explained by the principal components? Given how the dataset was constructed, do these make sense? Reduce the dimensionality of the system so that at least 99% of the total variability is retained

* Redefine the data according to the new basis from the PCA

* Plot the data, in both the original and the new basis. The figure should have 2 rows (the original and the new basis) and 3 columns (the $[x_0, x_1]$, $[x_0, x_2]$ and $[x_1, x_2]$ projections) of scatter plots.

2\. PCA on a nD dataset

* Start from the dataset you have genereted in the previous exercise and add uncorrelated random noise. Such noise should be represented by other 10 uncorrelated variables normally distributed, with a standard deviation much smaller (e.g. a factor 20) than those used to generate the $x_1$ and $x_2$. Repeat the PCA procedure and compare the results with what you have obtained before.

3\. **Optional**: PCA on the MAGIC dataset

Perform a PCA on the magic04.data dataset.

In [None]:
# get the dataset and its description on the proper data directory
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data -P data/
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.names -P data/ 