# How to select features for machine learning

Which features do you use? All of them? Some of them?
    
__Remember__: Our goal is to find the smallest set of the available features such that the fitted model will reach its maximal predictive value. 
    
### Why?     
 - Less complexity = reduced bias    
 - Lower dimensional space = less computation time    
 - Fewer variables = better interpretability    
 
### How to pick?    
 - Domain expertise    
 - Regularization techniques    
 - Transformer methods    
 - Dimensionality reduction    

#### Imports

In [70]:
import os
import zipfile
import requests
import pandas as pd


from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

#### Fetch the data

In [71]:
OCCUPANCY = ('http://bit.ly/ddl-occupancy-dataset', 'occupancy.zip')
CREDIT    = ('http://bit.ly/ddl-credit-dataset', 'credit.xls')
CONCRETE  = ('http://bit.ly/ddl-concrete-data', 'concrete.xls')

def download_data(url, name, path='data'):
    if not os.path.exists(path):
        os.mkdir(path)

    response = requests.get(url)
    with open(os.path.join(path, name), 'wb') as f:
        f.write(response.content)


def download_all(path='data'):
    for href, name in (OCCUPANCY, CREDIT, CONCRETE):
        download_data(href, name, path)

    # Extract the occupancy zip data
    z = zipfile.ZipFile(os.path.join(path, 'occupancy.zip'))
    z.extractall(os.path.join(path, 'occupancy'))

path='data'
download_all(path)

#### Load the first dataset into a dataframe

In [72]:
# Load the room occupancy dataset into a dataframe
occupancy = os.path.join('data','occupancy','datatraining.txt')
occupancy = pd.read_csv(occupancy, sep=',')
occupancy.columns = [
    'date', 'temp', 'humid', 'light', 'co2', 'hratio', 'occupied'
]

#### Separate dataframe into features and targets

In [73]:
features = occupancy[['temp', 'humid', 'light', 'co2', 'hratio']]
labels   = occupancy['occupied']

In [74]:
list(features)

['temp', 'humid', 'light', 'co2', 'hratio']

## Regularization techniques

### LASSO  (L1 Regularization) 
LASSO forces weak features to have zeroes as coefficients, effectively dropping the least predictive features.

In [75]:
#LASSO forces out weak features to have 0 coefficient (dropping them).-L1
model = Lasso()
model.fit(features, labels)
print(list(zip(features, model.coef_.tolist())))

[('temp', -0.0), ('humid', 0.0), ('light', 0.001604039030371534), ('co2', 0.0002566541934433866), ('hratio', 0.0)]


### Ridge Regression (L2 Regularization) 
Ridge assigns every feature a weight, but spreads the coefficient values out more equally, shrinking but still maintaining less predictive features.

In [76]:
#Ridge assigns weigths to every featuer shrinking less predictive features in concert with all features. -L2
model = Ridge()
model.fit(features, labels)
print(list(zip(features, model.coef_.tolist())))

[('temp', -0.06273198200152029), ('humid', -0.0024965287829986226), ('light', 0.0017625248355878475), ('co2', 0.00033448323910357223), ('hratio', 0.004924465342955606)]


### ElasticNet
ElasticNet is a linear combination of L1 and L2 regularization, meaning it combines Ridge and LASSO and essentially splits the difference.

In [77]:
#ENet is a linear combination of L1 and L2 (combines Ridge and LASSO)
model = ElasticNet()
model.fit(features, labels)
print(list(zip(features, model.coef_.tolist())))

[('temp', -0.0), ('humid', 0.0), ('light', 0.0016178824800305253), ('co2', 0.0002560187107121839), ('hratio', -0.0)]


## Transformer methods    

### `SelectFromModel()` 
Scikit-Learn has a meta-transformer method for selecting features based on importance weights.

In [78]:
#LASSO demo select from model--will select 'importance' based on weights
model = Lasso()
sfm = SelectFromModel(model)
sfm.fit(features, labels)
print(list(features.iloc[:, sfm.get_support(indices=True)]))

['light', 'co2']


In [79]:
#Ridge demo select from model--will select 'importance' based on weights
model = Ridge()
sfm = SelectFromModel(model)
sfm.fit(features, labels)
print(list(features.iloc[:, sfm.get_support(indices=True)]))

['temp']


In [80]:
#ElasticNet demo select from model--will select 'importance' based on weights

model = ElasticNet()
sfm = SelectFromModel(model)
sfm.fit(features, labels)
print(list(features.iloc[:, sfm.get_support(indices=True)]))

['light']


## Dimensionality reduction

### Principal component analysis (PCA)
Linear dimensionality reduction using Singular Value Decomposition (SVD) of the data and keeping only the most significant singular vectors to project the data into a lower dimensional space.

 - Unsupervised method
 - Uses a signal representation criterion
 - Identifies the combination of attributes that account for the most variance in the data.

In [81]:
pca = PCA(n_components=2)
new_features = pca.fit(features).transform(features)
print(new_features)

[[239.75415116 222.70021675]
 [234.83734556 229.07352309]
 [232.82363564 226.1680712 ]
 ...
 [312.01464185 194.23928716]
 [331.53968882 184.46791921]
 [338.40133606 196.6888825 ]]


### Linear discriminant analysis (LDA)
A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix. Can be used to reduce the dimensionality of the input by projecting it to the most discriminative directions.

- Supervised method
- Uses a classification criterion
- Tries to identify attributes that account for the most variance between classes.

In [49]:
lda = LDA(n_components=2)
new_features = lda.fit(features, labels).transform(features)
print(new_features)

[[3.44457859]
 [3.4827798 ]
 [3.43530287]
 ...
 [4.25852644]
 [4.29565825]
 [4.4748546 ]]


To learn more about feature selection tools within Scikit-Learn, check out http://scikit-learn.org/stable/modules/feature_selection.html.

## Exercises
Try out the above techniques yourself with the Credit Card Default and Concrete Strength datasets.

In [82]:
# make sure you have pip installed xlrd
# Load the credit card default dataset into a dataframe
credit = os.path.join('data','credit.xls')
credit = pd.read_excel(credit, header=1)
credit.columns = [
    'id', 'limit', 'sex', 'edu', 'married', 'age', 'apr_delay', 'may_delay',
    'jun_delay', 'jul_delay', 'aug_delay', 'sep_delay', 'apr_bill', 'may_bill',
    'jun_bill', 'jul_bill', 'aug_bill', 'sep_bill', 'apr_pay', 'may_pay', 'jun_pay',
    'jul_pay', 'aug_pay', 'sep_pay', 'default'
]

# Separate dataframe into features and targets
cred_features = credit[[
    'limit', 'sex', 'edu', 'married', 'age', 'apr_delay', 'may_delay',
    'jun_delay', 'jul_delay', 'aug_delay', 'sep_delay', 'apr_bill', 'may_bill',
    'jun_bill', 'jul_bill', 'aug_bill', 'sep_bill', 'apr_pay', 'may_pay',
    'jun_pay', 'jul_pay', 'aug_pay', 'sep_pay'
]]
cred_labels = credit['default']


# Load the concrete compression dataset into a dataframe
concrete = pd.read_excel(os.path.join('data','concrete.xls'))
concrete.columns = [
    'cement', 'slag', 'ash', 'water', 'splast',
    'coarse', 'fine', 'age', 'strength'
]

# Separate dataframe into features and targets
conc_features = concrete[[
    'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age'
]]
conc_labels = concrete['strength']

In [84]:
#List Features
list(conc_features)

['cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age']

In [85]:
#LASSO
model = Lasso()
model.fit(conc_features, conc_labels)
print(list(zip(conc_features, model.coef_.tolist())))

[('cement', 0.11830385490649789), ('slag', 0.10192312708923051), ('ash', 0.08692065036243565), ('water', -0.1679459525417401), ('splast', 0.22324101440120747), ('coarse', 0.014248476813369436), ('fine', 0.017337277629753935), ('age', 0.11375440243198076)]


In [86]:
#Ridge
model = Ridge()
model.fit(conc_features, conc_labels)
print(list(zip(conc_features, model.coef_.tolist())))

[('cement', 0.11978531126731268), ('slag', 0.10384729638533656), ('ash', 0.08794351927289551), ('water', -0.1503018798373258), ('splast', 0.290666261913905), ('coarse', 0.018029539615950222), ('fine', 0.020154207514373395), ('age', 0.11422560057934898)]


In [87]:
#ENET
model = ElasticNet()
model.fit(conc_features, conc_labels)
print(list(zip(conc_features, model.coef_.tolist())))

[('cement', 0.11908714234987437), ('slag', 0.10292563560088508), ('ash', 0.08764411558867577), ('water', -0.16068677429011297), ('splast', 0.24817944283575386), ('coarse', 0.01588627715245208), ('fine', 0.01866789704444519), ('age', 0.11397596593916044)]


In [88]:
#Feat Select--LASSO
model = Lasso()
sfm = SelectFromModel(model)
sfm.fit(conc_features, conc_labels)
print(list(conc_features.iloc[:, sfm.get_support(indices=True)]))

['cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age']


In [89]:
#Feat Select--RIDGE
model = Ridge()
sfm = SelectFromModel(model)
sfm.fit(conc_features, conc_labels)
print(list(conc_features.iloc[:, sfm.get_support(indices=True)]))

['cement', 'water', 'splast', 'age']


In [90]:
#Feat Select--ElasticNet
model = ElasticNet()
sfm = SelectFromModel(model)
sfm.fit(conc_features, conc_labels)
print(list(conc_features.iloc[:, sfm.get_support(indices=True)]))

['cement', 'water', 'splast', 'age']


In [91]:
#PCA_it
pca = PCA(n_components=2)
new_features = pca.fit(conc_features).transform(conc_features)
print(new_features)

[[ 284.79397884  -10.36611049]
 [ 284.6574078   -14.48472503]
 [ 101.85095094  183.43561281]
 ...
 [-152.62039873   49.50233773]
 [-132.38216524   87.90567118]
 [ -29.20115611   48.36298599]]


In [69]:
#LDA_it
lda = LDA(n_components=2)
new_features = lda.fit(conc_features, conc_labels).transform(conc_features)
print (new_features)

ValueError: Unknown label type: (array([79.98611076, 61.88736576, 40.26953526, ..., 23.69660064,
       32.76803638, 32.40123514]),)