1\. **PCA on 3D dataset**

* Generate a dataset simulating 3 features, each with N entries (N being ${\cal O}(1000)$). Each feature is made by random numbers generated according the normal distribution $N(\mu,\sigma)$ with mean $\mu_i$ and standard deviation $\sigma_i$, with $i=1, 2, 3$. Generate the 3 variables $x_{i}$ such that:
    * $x_1$ is distributed as $N(0,1)$
    * $x_2$ is distributed as $x_1+N(0,3)$
    * $x_3$ is given by $2x_1+x_2$
* Find the eigenvectors and eigenvalues using the eigendecomposition of the covariance matrix
* Find the eigenvectors and eigenvalues using the SVD. Check that the two procedures yield to same result
* What percent of the total dataset's variability is explained by the principal components? Given how the dataset was constructed, do these make sense? Reduce the dimensionality of the system so that at least 99% of the total variability is retained
* Redefine the data according to the new basis from the PCA
* Plot the data, in both the original and the new basis. The figure should have 2 rows (the original and the new basis) and 3 columns (the $[x_0, x_1]$, $[x_0, x_2]$ and $[x_1, x_2]$ projections) of scatter plots.

In [3]:
import numpy as np

N = 1000 # size
mu1, sigma1 = 0, 1# mean and standard deviation
x1 = np.random.normal(mu1, sigma1, N)
mu2, sigma2 = 0, 3# mean and standard deviation
x2 = x1 + np.random.normal(mu2, sigma2, N)
x3 = 2*x1 + x2
# create a matrix with the three vectors
X = np.array([x1,x2, x3])
print("X:", X.shape)
#Find the covariance matrix from the samples
cov = np.cov(X)
#print(cov)
# Compute using eignevalues
l, V = la.eig(cov)
print(l)
print(V)
# Compute using svd
U, s, Vt = la.svd(cov)
# Estimate eigenvalues
l_est = np.matmul(s, (np.matmul(Vt ,U))) # Can be proven
print("l_est:",l_est)
print(U)
# As we can find:
# V = U
# l = s*Vt*U

# 
l = abs(l)
l_ord = np.sort(l)
print(l_ord)
variability = abs((l_ord[2]+l_ord[1])/l_ord.sum())
print(variability)
#That's make sense due to the enormous difference between values
# We just need to delete the 1 eigenvector, the once associated to the smallest eigenvalue

toDel = np.argmin(l)
V_mod = np.copy(V)
V_mod[:3, toDel] = 0
out = np.dot(X.T, V_mod)
X = np.dot(X.T, V)

x1 = X[:, 0]
x2 = X[:, 1]
x3 = X[:, 2]

o1 = out[:, 0]
o2 = out[:, 1]
o3 = out[:, 2]

# Print
fig, axs= plt.subplots(2, 3,figsize=(15, 15))
axs[0, 0].scatter(x1, x2)
axs[0, 0].set_title("x_0 vs x_1 original eigenvector basis")
axs[0, 0].set_xlabel("x_0")
axs[0, 0].set_ylabel("x_1")
axs[0, 1].scatter(x1, x3)
axs[0, 1].set_title("x_0 vs x_2 original eigenvector basis")
axs[0, 1].set_xlabel("x_0")
axs[0, 1].set_ylabel("x_2")
axs[0, 2].scatter(x2, x3)
axs[0, 2].set_title("x_1 vs x_2 original eigenvector basis")
axs[0, 2].set_xlabel("x_1")
axs[0, 2].set_ylabel("x2")

axs[1, 0].scatter(o1, o2)
axs[1, 0].set_title("x_0 vs x_1 new basis")
axs[1, 0].set_xlabel("x_0")
axs[1, 0].set_ylabel("x_1")
axs[1, 1].scatter(o1, o3)
axs[1, 1].set_title("x_0 vs x_2 new basis")
axs[1, 1].set_xlabel("x_0")
axs[1, 1].set_ylabel("x_2")
axs[1, 2].scatter(o2, o3)
axs[1, 2].set_title("x_1 vs x_2 new basis")
axs[1, 2].set_xlabel("x_1")
axs[1, 2].set_ylabel("x_2")
plt.show()


X: (3, 1000)


NameError: name 'la' is not defined


2\. **PCA on a nD dataset**

* Start from the dataset you have genereted in the previous exercise and add uncorrelated random noise. Such noise should be represented by other 10 uncorrelated variables normally distributed, with a standard deviation much smaller (e.g. a factor 20) than those used to generate the $x_1$ and $x_2$. Repeat the PCA procedure and compare the results with what you have obtained before.

In [None]:
N = (1000, 10) # size
mu1N, sigma1N = mu1, sigma1/20# mean and standard deviation
w1 = np.random.normal(mu1N, sigma1N, N)
w1 = w1.sum(axis = 1)
x1 = x1 + w1

N = (1000, 10) # size
mu2N, sigma2N = mu2, sigma2/20# mean and standard deviation
w2 = np.random.normal(mu2N, sigma2N, N)
w2 = w2.sum(axis = 1)
x2 = x2 + w2

x3 = 2*x1 + x2

# Now repeat the code presented before
X = np.array([x1,x2, x3])
cov = np.cov(X)
l, V = la.eig(cov)

l = abs(l)
l_ord = np.sort(l)
variability = abs((l_ord[2]+l_ord[1])/l_ord.sum())
print("The Variability is: ", variability)

toDel = np.argmin(l)
V_mod = np.copy(V)
V_mod[:3, toDel] = 0
out = np.dot(X.T, V_mod)
X = np.dot(X.T, V)

x1 = X[:, 0]
x2 = X[:, 1]
x3 = X[:, 2]

o1 = out[:, 0]
o2 = out[:, 1]
o3 = out[:, 2]


fig, axs= plt.subplots(2, 3,figsize=(15, 15))
axs[0, 0].scatter(x1, x2)
axs[0, 0].set_title("x0 vs x1 original eigenvector basis")
axs[0, 0].set_xlabel("x0")
axs[0, 0].set_ylabel("x1")
axs[0, 1].scatter(x1, x3)
axs[0, 1].set_title("x0 vs x2 original eigenvector basis")
axs[0, 1].set_xlabel("x0")
axs[0, 1].set_ylabel("x2")
axs[0, 2].scatter(x2, x3)
axs[0, 2].set_title("x1 vs x2 original eigenvector basis")
axs[0, 2].set_xlabel("x1")
axs[0, 2].set_ylabel("x2")

axs[1, 0].scatter(o1, o2)
axs[1, 0].set_title("x0 vs x1 new basis")
axs[1, 0].set_xlabel("x0")
axs[1, 0].set_ylabel("x1")
axs[1, 1].scatter(o1, o3)
axs[1, 1].set_title("x0 vs x2 new basis")
axs[1, 1].set_xlabel("x0")
axs[1, 1].set_ylabel("x2")
axs[1, 2].scatter(o2, o3)
axs[1, 2].set_title("x1 vs x2 new basis")
axs[1, 2].set_xlabel("x1")
axs[1, 2].set_ylabel("x2")
plt.show()

3\. **Optional**: **PCA on the MAGIC dataset**

Perform a PCA on the magic04.data dataset.

In [None]:
# get the dataset and its description on the proper data directory
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data -P data/
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.names -P data/ 

In [None]:
data = pd.read_csv("data/magic04.data", header=None)
data = data.drop(columns=[10])
display(data)

# SVD decomposition on datas
U, S, Vt = np.linalg.svd(data)

# Eigenvalues
l = S**2/(-1) 
V = U
print("\nThe eigenvalues are:\n", l)
print("\nThe eigenvectors are:\n", V)

#Perform PCA
l_sum = l.sum()

for i in range(magic.shape[1]):
    print("By selecting the component %d, we retain %.2f%% of the total variability" % (i,((l[i]/l_sum)*100)))