# Skill Check 5

The block below imports the necessary packages.

In [1]:
import pandas as pd
import numpy as np
import pylab as plt
import sklearn

## 0. The Dow Dataset (5 pts)

You will work on the new dataset, which is the Dow dataset, from this week. Read in the `impurity_dataset-training.xlsx` as a `pandas.DataFrame` with a variable name `df`. (5 pts)

Note: If you run this on a computer without Microsoft Excel you may get an error. However, it should work reliably in the Vocareum environment.

In [2]:
########################################
# Start your code here
df = pd.read_excel('impurity_dataset-training.xlsx')
########################################

In [3]:
assert type(df) == pd.core.frame.DataFrame
assert df.shape == (10703, 46)
assert np.isclose(np.linalg.norm(df[df.columns[1:]].loc[1]), 3381.2181210675867)

The cell below will clean the `df` to remove invalid cells and missing values. This cell create two variables `X` and `y` that will be the input feature matrix and corresponding impurity concentrations, respectively. You don't need to understand how this works yet, but we will cover it in future lessons.

In [4]:
def is_real_and_finite(x):
    if not np.isreal(x):
        return False
    elif not np.isfinite(x):
        return False
    else:
        return True

all_data = df[df.columns[1:]].values #drop the first column (date)
numeric_map = df[df.columns[1:]].applymap(is_real_and_finite)
real_rows = numeric_map.all(axis=1).copy().values #True if all values in a row are real numbers
X = np.array(all_data[real_rows,:-5], dtype='float') #drop the last 5 cols that are not inputs
y = np.array(all_data[real_rows,-3], dtype='float')
y = y.reshape(-1,1)

## 1. Feature Scaling (75 pts)

In this problem, you will see how feature scaling will affect the model performance. First, import `StandardScaler` and `MinMaxScaler` from `scikit-learn`. Declare a `StandardScaler` object `ss` and a `MinMaxScaler` object `mms`. Do not change any default parameter settings for both scaler objects. (15 pts)

In [5]:
########################################
# Start your code here
from sklearn.preprocessing import StandardScaler, MinMaxScaler

ss = StandardScaler()
mms = MinMaxScaler()
########################################

In [6]:
assert type(ss) == sklearn.preprocessing._data.StandardScaler
assert type(mms) == sklearn.preprocessing._data.MinMaxScaler

assert ss.with_mean and ss.with_std, "default setting for StandardScaler changed"
assert mms.feature_range == (0, 1), "default setting for MinMaxScaler changed"

Train a LASSO model with the Dow dataset and find the best scaling method (among no scaling, standard scaling, and min-max scaling). Below is the instruction step by step.

- Do train/test split on `X` and `y` by `train_test_split` with `test_size=0.3` and `random_state=42`. The training set and test set should be named as `*_train` and `*_test` where `*` denotes either `X` or `y`, respectively. (20 pts)
- Declare a LASSO model with `alpha=1e-4` and `tol=0.15`. Assign the LASSO model to the variable `lasso`. (10 points)
- For each scaling method, train the LASSO model on the training set and provide the $\mathrm{r^2}$ for the test set.
- Report the best $\mathrm{r^2}$ as `r2_opt`. (20 pts)

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso

########################################
# Start your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
lasso = Lasso(alpha = 1e-4, tol = 0.15)

X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

X_train_mms = mms.fit_transform(X_train)
X_test_mms = mms.transform(X_test)

train = [X_train, X_train_ss, X_train_mms]
test = [X_test, X_test_ss, X_test_mms]

r2_opt = 0
for tr, te in zip(train, test):
    lasso.fit(tr, y_train)
    if r2_opt < lasso.score(te, y_test):
        r2_opt = lasso.score(te, y_test)
########################################

In [8]:
assert X_train.shape == (7207, 40), "test_size not correct"

In [9]:
assert np.isclose(np.linalg.norm(X_test), 229401.15359462335), "random_state not correct"

In [10]:
assert type(lasso) == sklearn.linear_model._coordinate_descent.Lasso, "LASSO model not stored to correct variable"
assert lasso.alpha == 1e-4, "Alpha parameter of LASSO is not correct"
assert lasso.tol == 0.15, "Tolerance not correct"

In [11]:
assert np.isclose(r2_opt, 0.6805049793263493), "r2 not correct"

Report the resulting parameter vector after all features with a coefficient of zero have been dropped in the case of **min-max scaling**. The name of the reduced parameter vector should `dropped_coefs`. (10 pts)

In [12]:
########################################
# Start your code here
lasso.fit(X_train_mms, y_train)
coefs = lasso.coef_

dropped_coefs = coefs[coefs != 0]
########################################

In [13]:
assert np.isclose(np.linalg.norm(dropped_coefs) * len(dropped_coefs), 495.69726745190184), "parameter vector not correct"

## 2. Principal Component Analysis (20 pts)

Principal component analysis is closely related to the eigenvalue decomposition of the correlation matrix, as described in the lectures. This problem ensures that you know how to obtain the principal components in this way.

First, create a correlation matrix `corr` from `X`. (10 pts)

In [14]:
########################################
# Start your code here
corr = np.corrcoef(X.T)
########################################

In [15]:
assert np.isclose(np.linalg.norm(corr), 24.09288033850843)

Next, get the eigenvectors and corresponding eigenvalues for the correlation matrix. Report the third highest eigenvalue as `eig_3` and the eigenvector corresponding to the sixth highest eigenvalue as `eigvec_6`. (10 pts)

Hint: Remember that eigenvectors are stored as columns by default.

In [16]:
from scipy.linalg import eig

########################################
# Start your code here
eigvals, eigvecs = eig(corr)

eig_3 = eigvals[2]
eigvec_6 = eigvecs[:, 5]
########################################

In [17]:
assert np.isclose(np.real(eig_3), 2.33189496669), "Eigenvalue is not correct"
assert np.isclose(eigvec_6[0], 0.13732149628), "Eigenvector is not correct"
assert np.isclose(np.real(eig_3) * np.linalg.norm(eigvec_6[:10]), 0.8256564069966813), "Incorrect eigenvalue or eigenvector selected"