# How to select features for machine learning

Which features do you use? All of them? Some of them?
    
__Remember__: Our goal is to find the smallest set of the available features such that the fitted model will reach its maximal predictive value. 
    
### Why?     
 - Less complexity = reduced bias    
 - Lower dimensional space = less computation time    
 - Fewer variables = better interpretability    
 
### How to pick?    
 - Domain expertise    
 - Regularization techniques    
 - Automated methods    
 - Dimensionality reduction    

#### Imports

In [None]:
import os
import zipfile
import requests
import pandas as pd

#### Fetch the data

In [None]:
OCCUPANCY = ('http://bit.ly/ddl-occupancy-dataset', 'occupancy.zip')
CREDIT    = ('http://bit.ly/ddl-credit-dataset', 'credit.xls')
CONCRETE  = ('http://bit.ly/ddl-concrete-data', 'concrete.xls')

def download_data(url, name, path='data'):
    if not os.path.exists(path):
        os.mkdir(path)

    response = requests.get(url)
    with open(os.path.join(path, name), 'w') as f:
        f.write(response.content)


def download_all(path='data'):
    for href, name in (OCCUPANCY, CREDIT, CONCRETE):
        download_data(href, name, path)

    # Extract the occupancy zip data
    z = zipfile.ZipFile(os.path.join(path, 'occupancy.zip'))
    z.extractall(os.path.join(path, 'occupancy'))

path='data'
download_all(path)

#### Load into dataframes

In [None]:
# Load the room occupancy dataset
occupancy = os.path.join('data','occupancy_data','datatraining.txt')
occupancy = pd.read_csv(occupancy, sep=',')
occupancy.columns = [
    'date', 'temp', 'humid', 'light', 'co2', 'hratio', 'occupied'
]

# Load the credit card default dataset
credit = os.path.join('data','credit.xls')
credit = pd.read_excel(credit, header=1)
credit.columns = [
    'id', 'limit', 'sex', 'edu', 'married', 'age', 'apr_delay', 'may_delay',
    'jun_delay', 'jul_delay', 'aug_delay', 'sep_delay', 'apr_bill', 'may_bill',
    'jun_bill', 'jul_bill', 'aug_bill', 'sep_bill', 'apr_pay', 'may_pay', 'jun_pay',
    'jul_pay', 'aug_pay', 'sep_pay', 'default'
]

# Load the concrete compression data set
concrete   = pd.read_excel(os.path.join('data','concrete.xls'))
concrete.columns = [
    'cement', 'slag', 'ash', 'water', 'splast',
    'coarse', 'fine', 'age', 'strength'
]

## Regularization techniques

### LASSO  (L1 Regularization) 
LASSO forces weak features to have zeroes as coefficients, effectively dropping the least predictive features.

### Ridge Regression (L2 Regularization) 
Ridge assigns every feature a weight, but spreads the coefficient values out more equally, shrinking but still maintaining less predictive features.

### ElasticNet
ElasticNet is a linear combination of L1 and L2 regularization, meaning it combines Ridge and LASSO and essentially splits the difference.

### Automated methods    

### Dimensionality reduction