# Skill Check 9

The block below imports the necessary packages.

In [1]:
import numpy as np
import pandas as pd

## 1. MNIST Dataset (60 pts)

### 1a: Reading in the dataset (20 pts)

Load the MNIST data from `sklearn.datasets`. Save the pixel data to `X_mnist` (numpy.array) and the corresponding numbers to `y_mnist` (numpy.array).

In [2]:
########################################
# Start your code here
from sklearn.datasets import load_digits

digits = load_digits()
X_mnist = np.array(digits.data)
y_mnist = np.array(digits.target)
########################################

In [3]:
assert X_mnist.shape == (1797, 64), "Wrong data"
assert y_mnist.shape == (1797,), "Wrong data"
assert X_mnist[0].sum() == 294.0, "Wrong data"
assert np.isclose(X_mnist[y_mnist == 9].mean(), 4.895138888888889), "Wrong data"

### 1b: Correlation matrix (20 pts)

Create the correlation matrix `corr` for the 9th to 12th features in the `X_mnist` data. The resulting matrix should be a 4 x 4 matrix.

In [4]:
########################################
# Start your code here
corr = np.corrcoef(X_mnist[:, 8:12].T)
########################################

In [5]:
assert corr.shape == (4, 4), "Wrong dimensions"
assert np.isclose(np.linalg.norm(corr), 2.181790276103041), "Wrong correlation matrix"

### 1c: Standardize the features (10 pts)

Standardize the 9th to 12th features in the `X_mnist` data and save them to `X_mnist_scaled`.

In [6]:
########################################
# Start your code here
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_mnist_scaled = ss.fit_transform(X_mnist[:, 8:12])
########################################

In [7]:
assert np.isclose(X_mnist_scaled.mean(), 0), "not standardized"
assert np.isclose(X_mnist_scaled.std(), 1), "not standardized"
assert np.isclose(np.linalg.norm(X_mnist_scaled), 84.78207357690678), "wrong features"

### 1d: Correlation and regression I (5 pts)

The correlation between two given features is numerically equivalent to the regression slope between the standardized features. Let's show this with a simple coding exercise:

- Get the correlation between 9th and 11th features from `corr` and save it to `corr_9_11`.
- Declare `LinearRegression(fit_intercept = False)` and train the model with the 9th and 11th features as x data and y data, respectively.
- Get the regression slope `slope_9_11` from the `LinearRegression` model.
- The autograder will test if `corr_9_11` and `slope_9_11` are identical to each other.

In [8]:
########################################
# Start your code here
corr_9_11 = corr[0][2]

from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept = False)

lr.fit(X_mnist_scaled[:, 0].reshape(-1, 1), X_mnist_scaled[:, 2].reshape(-1, 1))
slope_9_11 = lr.coef_[0][0]
########################################

In [9]:
assert np.isclose(corr_9_11, 0.04161298355604604), "wrong correlation"
assert np.isclose(corr_9_11, slope_9_11), "wrong regression slope"

### 1e: Correlation and regression II (5 pts)

The value of correlation squared is numerically equivalent to the $\mathrm{r^2}$ score of a regression model between the two standardized features. You will also show this with a simple coding exercise:

- Get the squared-value of correlation between 10th and 12th features from `corr` and save it to `corr_10_12`.
- Decaler `LinearRegression(fit_intercept = False)` and train the model with the 10th and 12th features as x data and y data, respectively.
- Get the $\mathrm{r^2}$ `r2_10_12` from the `LinearRegression` model.
- The autograder will test if `corr_10_12` and `r2_10_12` are identical to each other.

In [10]:
########################################
# Start your code here
corr_10_12 = corr[1][3]**2

lr.fit(X_mnist_scaled[:, 1].reshape(-1, 1), X_mnist_scaled[:, 3].reshape(-1, 1))
r2_10_12 = lr.score(X_mnist_scaled[:, 1].reshape(-1, 1), X_mnist_scaled[:, 3].reshape(-1, 1))
########################################

In [11]:
assert np.isclose(corr_10_12, 0.06455652711795277), "wrong correlation"
assert np.isclose(corr_10_12, r2_10_12), "wrong r2 score"

## 2. Kernel PCA (40 pts)

Kernel PCA utilizes the eigenvalues and corresponding eigenvectors of a given kernel matrix. Let's make use of the built-in `KernelPCA` module in `scikit-learn` package.

### 2a: Import `KernelPCA` (20 pts)

Import the `KernelPCA` class from `sklearn.decomposition`.

In [12]:
########################################
# Start your code here
from sklearn.decomposition import KernelPCA
########################################

In [13]:
assert KernelPCA.__init__, "KernelPCA not imported"

### 2b: Set the kernel matrix (10 pts)

`KernelPCA` object takes `kernel` parameter that defines a kernel that is used for the following PCA. The default option is a `linear` kernel, while, in this problem, you will use a `rbf` kernel which you are very familiar with. Declare a `KernelPCA` object `kpca` with `kernel = 'rbf'`. You do not need to specify any other arguments for this problem.

In [14]:
########################################
# Start your code here
kpca = KernelPCA(kernel = 'rbf')
########################################

In [15]:
assert kpca.kernel == 'rbf', "wrong kernel"

### 2c: Eigenvalues (5 pts)

Project the `X_mnist` data using `kpca` and save the resulting matrix to `X_mnist_kpca` (you do not need to do any dimensional reduction). Get the eigenvalues of the kernel matrix. It would be helpful to refer to the [official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html). Report the number of eigenvalues obtained from `kpca` as `num_eigvals`.

In [16]:
########################################
# Start your code here
X_mnist_kpca = kpca.fit_transform(X_mnist)
num_eigvals = kpca.eigenvalues_.shape[0]
########################################

In [17]:
assert X_mnist_kpca.shape == (1797, 1796), "wrong kernel pca"
assert np.isclose(kpca.eigenvalues_[0], 2.3481557275070606), "wrong kernel pca"
assert np.isclose(num_eigvals * kpca.eigenvalues_[1], 3532.6854240142975), "wrong num_eigvals"

### 2d: kPCA Dimensional Reduction (5 pts)

Use kernel PCA to project the dataset onto the first two principal components of the kernel PCA matrix. Save the resulting dataset as `X_kpca_2d`.

**Hint:** You may need to re-initialize your KernelPCA class.

In [18]:
########################################
# Start your code here
X_kpca_2d = X_mnist_kpca[:, :2]
########################################

In [19]:
assert X_kpca_2d.shape == (1797, 2), "dimensionality of projection is not correct"
assert np.isclose(np.linalg.norm(X_kpca_2d), 2.0772890492972254), "projection is not correct"