# Section 4 - PCA
You do not need any accompanying datasets for this notebook.

Goals:
- practice pandas
- make interpretation from plots
- solidify linear regression
- understand SVD better

# 1 Loading data from internet/preprocessing
**Task:**
- Read the description of the [diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes) dataset. 
- Using [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), load the dataset into a DataFrame `df` from the URL https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt. 
    - This dataset is tab separated, so make sure you use the read_csv function's sep argument correctly with '\t'.
- Display the dataframe.

In [None]:
# TODO import data


We will analyze the data for individuals with
1. `20 <= AGE < 30`
2. `SEX == 1`

**Task:** 
- Store this subset of data (with above two criteria) as `subdata`. 
    - After filtering the above criteria, drop the `AGE` and `SEX` columns from the data. Hint: use subdata.drop(columns= ...).
    - Display `subdata` to check you did the right thing.
- Define numpy array `y` by choosing the corresponding column of `subdata`.
- Define numpy array `X` by choosing the remaining columns of `subdata`. 
    - Hence, your predictors/features are `BMI`, `BP`, `S1`, `S2`, `S3`, `S4`, `S5`, and `S6`. (Remember: we already dropped `SEX` and `AGE`)
    - Make sure the dimensions are number of features by number of data points. You may need to take a transpose.
- Check if you did the correct thing by printing any relevatn quantities, like shapes or array entries.

In [None]:
# TODO create subdata, y, and X


# 2 Covariance matrix 

**Task:**
- Discuss: What is one important step you must do to the data before calculating covariance matrix for PCA?

    **Ans:** 

- Perform that change on the data X above, store the centered data as `X_ctd`. 
- Then, compute the new covariance matrix `C` of the data `X_ctd` using the formula
    $$
    C = \frac{1}{n\!-\!1}\ X X^T.
    $$
    - Check the dimensions of the covariance matrix is correct.
- Perform an eigendecomposition, using singular value decomposition.

In [3]:
# TODO compute covariance matrix

# TODO eigendecomposition


# 3 Compare eigendecompositions
**Task:**
- Compute the eigen decomposition of the covariance matrix THREE different ways: `svd()`, `eig()`, `eigh()`. 
    - For each way, print the eigenvalues/singular values.
- Disucss:
    - What's the difference between `svd` and `eig`? How many different outputs are there?
    - What's the difference between `eig` and `eigh`? What property of covariance matrices allows us to use `eigh`?
    - What is the difference between the eigenvalue/singular value outputs? How are they sorted? Print them and check.

In [None]:
# TODO SVD


In [None]:
# TODO EIG


In [None]:
# TODO EIGH


# sklearn PCA
**Task:**
Compare your manual implementation against sklearn.

In [None]:
# TODO sklearn PCA

# TODO compare eigenvalues

In [None]:
# TODO compare eigenvectors


## 5 Projections
- For simplicity, consider a further subdata in 2D using only the `BMI` and `BP` columns, centered, call it 
$Z$. 
- Get the principal components of $Z$, call it $U$.
The code has been written for you.

In [None]:
n = X.shape[1]

Z = X[:2,:]                                 # keep first two rows for BMI and BP only
Z -= Z.mean(axis=1, keepdims=True)          # center
Z = Z[:,np.argsort(Z[1,:])]                 # sort by x values, for simplicty later

print(Z.shape)
Z[:,:5]                                     # first 5 points

In [None]:
U, V, W = np.linalg.svd(Z @ Z.T / (n-1))    # svd
U

We learned that the projection is $U_1U_1^TX$, visualized below.

In [None]:
P1 = U[:,:1] @ U[:,:1].T @ Z        # TODO how do we turn the math into code? You may need to do U[:,:1] instead of U[:,0]

import matplotlib.pyplot as plt
plt.scatter(Z[0,:], Z[1,:], c='b')
plt.scatter(P1[0,:], P1[1,:], c='orange', label='projection')
plt.quiver([0,0], [0,0], U[0,:], U[1,:], color='k', angles='xy', scale_units='xy', scale=0.1, label='PCs')
plt.plot(np.vstack((Z[0,:], P1[0,:])), np.vstack((Z[1,:], P1[1,:])), '--', c='orange')
plt.xlabel('x'); plt.ylabel('y'); plt.title('Projecting onto PC1');plt.legend(loc='lower right')
plt.axis('equal'); plt.show()

What happens if we do $U_1^TX$ only? Run the cell below.
- Only 1-dimensional!

In [None]:
magnitudes = U[:,0].T @ Z
print(magnitudes.shape)
magnitudes