<center> <img src="res/ds3000.png"> </center>

<center> <h1> Week 12 - Day 1</h1> </center>

<center> <h2> Part 3: Feature Selection</h2></center>

## Outline
1. <a href='#1'>Univariate Selection</a>
2. <a href='#2'>Model-Based Feature Selection</a>
3. <a href='#3'>Iterative Feature Selection</a>


<a id="1"></a>

## 1.  Univariate Selection
* Using this strategy, we check to see whether there is a statistically significant relationship between each feature and target
* The features that are related with the highest confidence are selected.
* Each feature is individually considered.
* This selection performs an ANOVA.

### 1.1. Tests
* Use f_classif for classification (default)
* Use f_regression for regression

### 1.2. Criteria
* How many features to select?
* Two methods:
    * SelectKBest(k = 5): selects a fixed number k of features
    * SelectPercentile (percentile = 50): selects a fixed percentage of features
    
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest
 
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html

In [1]:
import pandas as pd

from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()  # Bunch object

df = pd.DataFrame(california.data, columns=california.feature_names)
df["Value"] = california.target

features = df.drop("Value", axis=1)
target = df["Value"]

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\Ken\scikit_learn_data


In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

#split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)

#define a selection method and specify the score function to be f_regression
select = SelectKBest(score_func=f_regression, k = 3)
select.fit(X_train, y_train)

#transform training and testing sets so only the selected features are retained
X_train_selected = select.transform(X_train)
X_test_selected = select.transform(X_test)

In [3]:
X_train_selected

array([[ 2.7483    ,  4.10650888, 34.09      ],
       [ 4.58      ,  6.00986193, 34.27      ],
       [ 1.3844    ,  3.45646438, 37.78      ],
       ...,
       [ 5.299     ,  7.21493213, 34.91      ],
       [ 7.0309    ,  5.43678161, 37.41      ],
       [ 2.8167    ,  6.08007812, 37.19      ]])

In [4]:
model = LinearRegression().fit(X=X_train, y=y_train)

print("Original results:")
print("\tR-squared value for training set: ", r2_score(y_train, model.predict(X_train)))
print("\tR-squared value for testing set: ", r2_score(y_test, model.predict(X_test)))


model = LinearRegression().fit(X=X_train_selected, y=y_train)

print("With selected features:")
print("\tR-squared value for training set: ", r2_score(y_train, model.predict(X_train_selected)))
print("\tR-squared value for testing set: ", r2_score(y_test, model.predict(X_test_selected)))

Original results:
	R-squared value for training set:  0.6095160399631114
	R-squared value for testing set:  0.5954462325232106
With selected features:
	R-squared value for training set:  0.4922565671674721
	R-squared value for testing set:  0.4571291548708881


In [5]:
print(features.columns)

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude'],
      dtype='object')


In [6]:
#returns a Boolean mask of selected features
select.get_support()

array([ True, False,  True, False, False, False,  True, False])

<a id="2"></a>

## 2. Model-Based Feature Selection
* Uses a supervised ML algorithm to judge the importance of each feature and keeps only the most important ones
* The model used for selection doesn't need to be the same model that is used for training
* The feature selection model needs to provide some measure of importance for each feature
    * so that they can be ranked by this measure
    
    
* DecisionTreeRegressor() provides a feature_importances_ atttribute
    * Use a regression algorithm when the target variable is continuous
    * Use a classification algorithm when the target variable is categorical
    
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html

In [7]:
from sklearn.feature_selection import SelectFromModel
from sklearn.tree import DecisionTreeRegressor

select = SelectFromModel(DecisionTreeRegressor(random_state = 3000), threshold = "median")

In [8]:
select.fit(X_train, y_train)

#transform training and testing sets so only the selected features are retained
X_train_selected = select.transform(X_train)
X_test_selected = select.transform(X_test)

model = LinearRegression().fit(X=X_train, y=y_train)

print("Original results:")
print("\tR-squared value for training set: ", r2_score(y_train, model.predict(X_train)))
print("\tR-squared value for testing set: ", r2_score(y_test, model.predict(X_test)))


model = LinearRegression().fit(X=X_train_selected, y=y_train)

print("With selected features:")
print("\tR-squared value for training set: ", r2_score(y_train, model.predict(X_train_selected)))
print("\tR-squared value for testing set: ", r2_score(y_test, model.predict(X_test_selected)))

Original results:
	R-squared value for training set:  0.6095160399631114
	R-squared value for testing set:  0.5954462325232106
With selected features:
	R-squared value for training set:  0.5874746879530321
	R-squared value for testing set:  0.5780931872855849


In [9]:
print(features.columns)

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude'],
      dtype='object')


In [10]:
select.get_support()

array([ True, False, False, False, False,  True,  True,  True])

In [11]:
select.threshold_

0.06946007914825762

<a id="3"></a>

## 3. Iterative Feature Selection
* A series of models are built, with varying numbers of features
* **Recursive Feature Elimination** (RFE)
    * Starts with all features, builds a model, and discards the least important feature according to the model
    * Then a new model is built using all but discarded feature, and so on
    * This is done until only a prespecified number of features are left
    
* Use RFE's **n_features_to_select** parameter to set the number of features to select
 
* The feature selection model needs to provide some measure of importance for each feature
    * so that they can be ranked by this measure

* https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

<center> <img src="res/RFE_example.png"> </center>

In [12]:
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeRegressor

select = RFE(DecisionTreeRegressor(random_state = 3000), n_features_to_select = 3)

In [13]:
#fit the RFE selector to the training data
select.fit(X_train, y_train)

#transform training and testing sets so only the selected features are retained
X_train_selected = select.transform(X_train)
X_test_selected = select.transform(X_test)

model = LinearRegression().fit(X=X_train, y=y_train)

print("Original results:")
print("\tR-squared value for training set: ", r2_score(y_train, model.predict(X_train)))
print("\tR-squared value for testing set: ", r2_score(y_test, model.predict(X_test)))


model = LinearRegression().fit(X=X_train_selected, y=y_train)

print("With selected features:")
print("\tR-squared value for training set: ", r2_score(y_train, model.predict(X_train_selected)))
print("\tR-squared value for testing set: ", r2_score(y_test, model.predict(X_test_selected)))

Original results:
	R-squared value for training set:  0.6095160399631114
	R-squared value for testing set:  0.5954462325232106
With selected features:
	R-squared value for training set:  0.5863410269918616
	R-squared value for testing set:  0.5777114692521801


In [14]:
print(features.columns)

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude'],
      dtype='object')


In [15]:
select.get_support()

array([ True, False, False, False, False, False,  True,  True])