# Skill Check 11

The block below imports necessary packages.

In [1]:
import numpy as np
import pandas as pd

In this assignment, you will work with the Dow dataset and MNIST dataset. Both datasets are provided below.

In [2]:
# Dow Dataset
df = pd.read_excel('impurity_dataset-training.xlsx')

def is_real_and_finite(x):
    if not np.isreal(x):
        return False
    elif not np.isfinite(x):
        return False
    else:
        return True
    
all_data = df[df.columns[1:]].values
numeric_map = df[df.columns[1:]].applymap(is_real_and_finite)
real_rows = numeric_map.all(axis = 1).copy().values
X_dow = np.array(all_data[real_rows, :-5], dtype = 'float')
y_dow = np.array(all_data[real_rows, -3], dtype = 'float')
y = y_dow.reshape(-1, 1)

# MNIST Dataset
from sklearn.datasets import load_digits

digits = load_digits()

X_mnist = np.array(digits.data)
y_mnist = np.array(digits.target)

## 1. Partial Least Squares (50 pts)

Partial least squares regression (PLS regression) is a supervised algorithm that finds **linear** combinations of features that maximize the covariance between features.

### 1a: Import `PLSRegression` (10 pts)

Import the `sklearn.cross_decomposition.PLSRegression` object.

In [3]:
########################################
# Start your code here
from sklearn.cross_decomposition import PLSRegression
########################################

In [4]:
assert PLSRegression.__init__

### 1b: Standardization (10 pts)

Standardize `X_dow` and save the resulting matrix to `X_scaled`.

In [5]:
########################################
# Start your code here
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X_dow)
########################################

In [6]:
assert np.isclose(X_scaled.mean(axis = 0), 0).all()
assert np.isclose(X_scaled.std(axis = 0), 1).all()

### 1c: PLS Regression (10 pts)

Declare a `PLSRegression` object `pls` with `n_components = 5`. Train `pls` with `X_scaled` and `y`. Report the $\mathrm{r^2}$ as `pls_r2`.

In [7]:
########################################
# Start your code here
pls = PLSRegression(n_components = 5)
pls.fit(X_scaled, y)
pls_r2 = pls.score(X_scaled, y)
########################################

In [8]:
assert pls.n_components == 5
assert np.isclose(pls_r2, 0.6533602447390862)

### 1d: Achieving $\mathrm{r^2}$ of 0.68 (15 pts)

Find the minimum `n_components` with which $\mathrm{r^2}$ is above 0.68 when `PLSRegression` is trained and tested on `X_scaled` and `y`. Report the number as `min_n_comp` and the corresponding $\mathrm{r^2}$ as `r2_n_comp`.

In [9]:
########################################
# Start your code here
n = 1
r2 = 0

while r2 < 0.68:
    n += 1
    
    pls = PLSRegression(n_components = n)
    pls.fit(X_scaled, y)
    r2 = pls.score(X_scaled, y)
    
min_n_comp = n
r2_n_comp = r2
########################################

In [10]:
assert np.isclose(min_n_comp * r2_n_comp, 5.455616652135625)

### 1e: Feature transformation (5 pts)

Transform the features and reduce the dimensionality to `min_n_comp`. Name the matrix of transformed features `X_pls`.

In [11]:
########################################
# Start your code here
pls = PLSRegression(n_components = min_n_comp)
pls.fit(X_scaled, y)

X_pls = pls.transform(X_scaled)
########################################

In [12]:
assert X_pls.shape[0] == 10297
assert X_pls.shape[1] == min_n_comp
assert np.isclose(np.linalg.norm(X_pls), 537.5375896752992)

## 2. Linear Discriminant Analysis (50 pts)

LDA finds a **linear** combination of features that separates two or more classes or labels.

### 2a: Import `LinearDiscriminantAnalysis` (15 pts)

Import the `sklearn.discriminant_analysis.LinearDiscriminantAnalysis` object.

In [13]:
########################################
# Start your code here
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
########################################

In [14]:
assert LinearDiscriminantAnalysis.__init__

### 2b: LDA transformation (15 pts)

Come up with `X_lda` which is a matrix of transformed features from `X_mnist` using LDA. Do not change any parameter settings of the `LinearDiscriminantAnalysis` object.

In [15]:
########################################
# Start your code here
lda = LinearDiscriminantAnalysis()
lda.fit(X_mnist, y_mnist)
X_lda = lda.transform(X_mnist)
########################################

In [16]:
assert X_lda.shape[1] == 9
assert np.isclose(X_lda[0].sum(), -8.929053278244288)

### 2c: Apply k-Means (10 pts)

Apply k-means clustering with `KMeans(n_clusters=10, random_state=42)` to `X_lda`. Report the clustering result as `y_kmeans`.

In [17]:
########################################
# Start your code here
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 10, random_state = 42)
y_kmeans = kmeans.fit_predict(X_lda)
########################################

In [18]:
assert np.bincount(y_kmeans)[0] == 189
assert np.bincount(y_kmeans)[-1] == 174

### 2d: Assigning labels (10 pts)

Define a function `assign_labels` that assigns a label to each cluster by using the most common label from the cluster. This step converts the clustering to a classification prediction. If the label **9** is most common in the cluster 1, all points belonging to that cluster should be labeled as 9. `assign_labels` takes the following arguments:

- y: the original labels (numpy array, default: `y_mnist`)
- y_hat: the clustering result (numpy array, default: `y_kmeans`)

This function should return a numpy array `y_predict` in which a label is assigned to each data point.

In [19]:
def assign_labels(y = y_mnist, y_hat = y_kmeans):
########################################
# Start your code here
    y_predict = np.zeros(y.shape, dtype = 'int')
    
    for i in range(10):
        collect_label = y[y_hat == i]
        count_label = np.bincount(collect_label)
        common_label = np.argmax(count_label)
        
        y_predict[y_hat == i] = common_label
########################################
    return y_predict

In [20]:
y_predict = assign_labels()

assert np.bincount(y_predict)[1] == 189
assert np.bincount(y_predict)[4] == 175