# Feature Learning

In this exercise we will use a dataset with high dimensionality and apply supervised learning methods to it after various feature learning methods.

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn as sk

plt.style.use("ggplot")

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest, mutual_info_regression, RFE
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.decomposition import PCA

The data we'll be working with is the [California housing dataset](http://scikit-learn.org/stable/datasets/index.html#california-housing-dataset).

In [3]:
house_data = fetch_california_housing()
print(house_data["DESCR"])

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /Users/rrj/scikit_learn_data


.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

In [4]:
house_features = pd.DataFrame(house_data["data"], columns=house_data["feature_names"])
house_prices = pd.Series(house_data["target"])

In [5]:
house_features.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [6]:
house_features.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31


In [7]:
house_prices.head()

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
dtype: float64

In [8]:
house_prices.describe()

count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
dtype: float64

First, we'll split our data in order to determine how well we're doing.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(house_features, house_prices, test_size=0.2, random_state=0)

## Filter Feature Selection

Select the best features based on [mutual information score](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression) from the training data then transform X_train and X_test into the new shape for the data. See [SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)

In [0]:
# We will select the best k features from all feature selection methods
k = 4

In [0]:
# Save the transformed values into mi_X_train and mi_X_test
# Save the transformer you use into mi_transformer
# YOUR CODE HERE

mi_transformer = SelectKBest(mutual_info_regression, k=k).fit(X_train, y_train)

mi_X_train=mi_transformer.transform(X_train)
mi_X_test=mi_transformer.transform(X_test)


In [29]:
for feature, importance in zip(house_features.columns, mi_transformer.scores_):
    print(f"The MI score for {feature} is {importance}")

The MI score for MedInc is 0.3994739622163763
The MI score for HouseAge is 0.03294380003690822
The MI score for AveRooms is 0.10812235174123686
The MI score for AveBedrms is 0.029776008397862874
The MI score for Population is 0.024695598165312305
The MI score for AveOccup is 0.06934045888602736
The MI score for Latitude is 0.3639427018881296
The MI score for Longitude is 0.40238581379139227


In [0]:
assert mi_transformer.k == k
assert isinstance(mi_transformer, SelectKBest)
assert len(mi_transformer.scores_) == 8
assert mi_X_train.shape == (16512, k)
assert mi_X_test.shape == (4128, k)

Since the focus in this exercise is on the feature learning and not on the supervised learning portion, we will use a simple estimator (linear regression) for the model training portions.

In [31]:
miEst = LinearRegression()
miEst.fit(mi_X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [32]:
print(f"The mean squared error when training on the MI selected features is {miEst.score(mi_X_train, y_train)}.")
print(f"When testing on the test data, the mean squared error is {miEst.score(mi_X_test, y_test)}")

The mean squared error when training on the MI selected features is 0.5894298141364945.
When testing on the test data, the mean squared error is 0.5672465391760814


## Wrapper Feature Selection

Now try using [recursive feature elimination](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) to select the 2 features we will use instead.

In [0]:
# Use an RFE object to determine the k features to select from X_train using a step of 2
# Save the rfe object as rfe_transformer
# Create rfe_X_train and rfe_X_test which are the updated features based on the RFE output

rfeEst = LinearRegression()

# YOUR CODE HERE
rfe_transformer= RFE(rfeEst, step=2).fit(X_train,y_train)

rfe_X_train=rfe_transformer.transform(X_train)
rfe_X_test=rfe_transformer.transform(X_test)



In [0]:
assert isinstance(rfe_transformer, RFE)
assert rfe_transformer.step == 2
assert len(rfe_transformer.support_) == 8
assert mi_X_train.shape == (16512, k)
assert mi_X_test.shape == (4128, k)

In [47]:
rfeEst = LinearRegression()
rfeEst.fit(rfe_X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [48]:
print(f"The mean squared error when training on the RFE selected features is {rfeEst.score(rfe_X_train, y_train)}.")
print(f"When testing on the test data, the mean squared error is {rfeEst.score(rfe_X_test, y_test)}")

The mean squared error when training on the RFE selected features is 0.5926637486301407.
When testing on the test data, the mean squared error is 0.5720072080584462


## Embedded Methods

For the embedded methods feature selection example, we will use Lasso. For this task you should use [LassoCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV) and **not** [Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) so that it trains with various values for alpha.

Since this is an embedded method, the feature selection will occur directly in the model.

In [0]:
# Create a LassoCV model trained with 10 alphas and save it to lassoClf
# YOUR CODE HERE

lassoClf=LassoCV(cv=10).fit(X_train, y_train)


In [50]:
lassoClf.coef_

array([ 3.83899283e-01,  1.13979529e-02,  1.56884471e-03,  0.00000000e+00,
       -4.07469366e-07, -4.61025303e-03, -3.24168177e-01, -3.23939467e-01])

In [51]:
lassoClf.alpha_

0.03577400513173515

In [0]:
assert lassoClf
assert isinstance(lassoClf, LassoCV)
assert len(lassoClf.coef_) == 8

In [53]:
print(f"The mean squared error when training using lasso is {lassoClf.score(X_train, y_train)}.")
print(f"When testing on the test data, the mean squared error is {lassoClf.score(X_test, y_test)}")

The mean squared error when training using lasso is 0.5929687839875066.
When testing on the test data, the mean squared error is 0.5709853376060581


## Feature Extraction

Here we'll use [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to extract features from the data.

In [0]:
# Select the best k features using PCA
# Save the PCA transformer object as pca_transformer
# Transform X_train and X_test into the new shape and save to pca_X_train and pca_X_test


pca = PCA(n_components=2)
# YOUR CODE HERE
pca_transformer=PCA(n_components=k).fit(X_train)

pca_X_train=pca_transformer.transform(X_train)
pca_X_test=pca_transformer.transform(X_test)

In [0]:
assert pca_transformer 
assert isinstance(pca_transformer, PCA)
assert pca_transformer.n_components == k
assert pca_X_train.shape == (16512, k)
assert pca_X_test.shape == (4128, k)

In [58]:
pcaEst = LinearRegression()
pcaEst.fit(pca_X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [61]:
print(f"The R squared value when training using PCA is {pcaEst.score(pca_X_train, y_train)}.")
print(f"When testing on the test data, the R squared value is {pcaEst.score(pca_X_test, y_test)}")

The R squared value when training using PCA is 0.011832093756049544.
When testing on the test data, the R squared value is 0.003451711009928249


In [62]:
pca_transformer.get_covariance()

array([[ 2.59881878e+00, -2.85404338e+00,  1.69458762e-01,
         2.02649494e-02,  9.88022268e+00,  2.12524850e-02,
         1.77252627e-01, -1.13858541e-01],
       [-2.85404338e+00,  1.58777213e+02, -4.81380579e+00,
        -4.69596878e-01, -4.30539684e+03,  1.01359303e+00,
         4.93684706e-01, -2.87568750e+00],
       [ 1.69458762e-01, -4.81380579e+00,  3.15500331e+00,
         8.25054593e-02, -1.92964965e+02,  7.47416667e-02,
         1.11027923e+00, -9.23849732e-01],
       [ 2.02649494e-02, -4.69596878e-01,  8.25054593e-02,
         2.54365791e+00, -3.40875068e+01, -6.12764836e-03,
         1.60142020e-01, -1.36806830e-01],
       [ 9.88022268e+00, -4.30539684e+03, -1.92964965e+02,
        -3.40875068e+01,  1.30659246e+06,  5.35350485e+02,
        -2.80412604e+02,  2.40965884e+02],
       [ 2.12524850e-02,  1.01359303e+00,  7.47416667e-02,
        -6.12764836e-03,  5.35350485e+02,  4.14906644e+01,
        -7.41569408e-02,  1.75039261e-01],
       [ 1.77252627e-01,  4.936847

The explained variance ratio tells us how much of the variance is explained by each component.

In [63]:
pca_transformer.explained_variance_ratio_

array([9.99843464e-01, 1.10928016e-04, 3.15269403e-05, 6.32948656e-06])

Sometimes PCA may perform poorly for regression cases. See [this answer](https://stats.stackexchange.com/a/52798) on StackExchange for more info. In the next lecture, we will take a deeper dive into PCA.

## Feedback

In [0]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    # YOUR CODE HERE
    return "This was great. thank you so much"