1\. **PCA on 3D dataset**

* Generate a dataset simulating 3 features, each with N entries (N being ${\cal O}(1000)$). Each feature is made by random numbers generated according the normal distribution $N(\mu,\sigma)$ with mean $\mu_i$ and standard deviation $\sigma_i$, with $i=1, 2, 3$. Generate the 3 variables $x_{i}$ such that:
    * $x_1$ is distributed as $N(0,1)$
    * $x_2$ is distributed as $x_1+N(0,3)$
    * $x_3$ is given by $2x_1+x_2$
* Find the eigenvectors and eigenvalues using the eigendecomposition of the covariance matrix
* Find the eigenvectors and eigenvalues using the SVD. Check that the two procedures yield to same result
* What percent of the total dataset's variability is explained by the principal components? Given how the dataset was constructed, do these make sense? Reduce the dimensionality of the system so that at least 99% of the total variability is retained
* Redefine the data according to the new basis from the PCA
* Plot the data, in both the original and the new basis. The figure should have 2 rows (the original and the new basis) and 3 columns (the $[x_0, x_1]$, $[x_0, x_2]$ and $[x_1, x_2]$ projections) of scatter plots.

In [8]:
import pandas as pd
import numpy as np
import numpy.random as npr
from scipy import linalg as la
from IPython.display import display
import matplotlib.pyplot as plt

In [22]:
size = 500
x1 = npr.normal(loc=0, scale=1, size=size)
x2 = x1 + npr.normal(loc=0, scale=3, size=size)
x3 = 2*x1 + x2


df = pd.DataFrame({
    'x1' : x1.T,
    'x2' : x2.T,
    'x3' : x3.T,
})

# display(df)


# calculate the covariance matrix
cov = np.cov(df, rowvar = False)

# find the eigenvalues using the eigendecomposition of the covariance matrix
eValues1, eVectors1 = la.eig(cov)
print("results using the eigendecomposition of the covariance matrix\n")
print("real(eigenvalues):\n", np.real_if_close(eValues1), '\n')
# printing the eigenvectors
print("eigenvectors:\n", eVectors1, '\n')

# find the eigenvalues using the SVD of the matrix
eVectors2, eValues2, _ = la.svd(cov)
print("results using the single value decomposition of the covariance matrix\n")
print("eigenvalues:\n", eValues2, '\n')
print("eigenvectors:\n", eVectors2, '\n')
# the eigenvectors and eigenvalues are swapped
print("note that the second and third eigenvalue (and consequently their eigenvectors)",
      "are swapped between the two methods. This is not a problem, but if we want to",
      "compare the two methods, we need to swap their values.")
eVectors2[0][2], eVectors2[0][1] = eVectors2[0][1], eVectors2[0][2]
eVectors2[1][2], eVectors2[1][1] = eVectors2[1][1], eVectors2[1][2]
eVectors2[2][2], eVectors2[2][1] = eVectors2[2][1], eVectors2[2][2]
print("the swapped eigenvector matrix reads\n")
print(eVectors2)
print("\nand now the method 'allclose' returns", np.allclose(eVectors1, eVectors2))


# part 2: principal component analysis
variability = sum(eValues2)
fractionVariability = [(i / variability) for i in sorted(eValues2, reverse=True)]

# Print the results
print('total variability explained by each component:', fractionVariability)
print("since the third column is a linear combination of the first two,",
      "it makes sense that the last components bears a very small variability.\n")

# select the first two principal components
pc = eVectors2[:, :2]

# Project the data onto the first two principal components
projected_data = np.dot(df, pc)


results using the eigendecomposition of the covariance matrix

real(eigenvalues):
 [2.39768746e+01 3.61431738e-16 2.09776265e+00] 

eigenvectors:
 [[-0.11924782 -0.81649658  0.56490113]
 [-0.57261194 -0.40824829 -0.71094929]
 [-0.81110759  0.40824829  0.41885297]] 

results using the single value decomposition of the covariance matrix

eigenvalues:
 [2.39768746e+01 2.09776265e+00 5.96705218e-16] 

eigenvectors:
 [[-0.11924782  0.56490113 -0.81649658]
 [-0.57261194 -0.71094929 -0.40824829]
 [-0.81110759  0.41885297  0.40824829]] 

note that the second and third eigenvalue (and consequently their eigenvectors) are swapped between the two methods. This is not a problem, but if we want to compare the two methods, we need to swap their values.
the swapped eigenvector matrix reads

[[-0.11924782 -0.81649658  0.56490113]
 [-0.57261194 -0.40824829 -0.71094929]
 [-0.81110759  0.40824829  0.41885297]]

and now the method 'allclose' returns True
total variability explained by each component: [0.9

2\. **PCA on a nD dataset**

* Start from the dataset you have genereted in the previous exercise and add uncorrelated random noise. Such noise should be represented by other 10 uncorrelated variables normally distributed, with a standard deviation much smaller (e.g. a factor 20) than those used to generate the $x_1$ and $x_2$. Repeat the PCA procedure and compare the results with what you have obtained before.

3\. **Optional**: **PCA on the MAGIC dataset**

Perform a PCA on the magic04.data dataset.

In [None]:
# get the dataset and its description on the proper data directory
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data -P data/
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.names -P data/ 