## Stepwise Feature Selection for Linear Regression

**Forward Selection**: The procedure starts with an empty set of features [reduced set]. The best of the original features is determined and added to the reduced set. At each subsequent iteration, the best of the remaining original attributes is added to the set. 

**Backward Elimination**: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set. 

**Ridge Regression**: Penalizes the size (square of magnitude) of the regression coefficients.

**LASSO Regression**: Regularization term penalizes absolute value of the coefficients.

In [87]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

For this tutorial, you will be using a popular machine learning Pima Indians dataset that contains real data about the onset of diabetes for the Pima Indian population. For more infomation on this dataset, please visit [this webpage](https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset)

In [16]:
## load sklearns Pima Indians Diabetes Dataset
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv(url, names=names)

In [17]:
dataframe.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Forward Feature Selection
Here we will start with all features, use a chi-squared test to select the most influential feautres, and then build our model. 

For foward feature selection, we'll use Sci-Kit Learn's SelectKBest() method to retain only the top k features. 

### Step 1: Separating features and labels
First we'll separate out features from the labels that tell us whether or not an indvidual has diabetes

In [20]:
import numpy as np
X = dataframe.values[:,0:8]
Y = dataframe.values[:,8]

In [21]:
X[:5]

array([[6.000e+00, 1.480e+02, 7.200e+01, 3.500e+01, 0.000e+00, 3.360e+01,
        6.270e-01, 5.000e+01],
       [1.000e+00, 8.500e+01, 6.600e+01, 2.900e+01, 0.000e+00, 2.660e+01,
        3.510e-01, 3.100e+01],
       [8.000e+00, 1.830e+02, 6.400e+01, 0.000e+00, 0.000e+00, 2.330e+01,
        6.720e-01, 3.200e+01],
       [1.000e+00, 8.900e+01, 6.600e+01, 2.300e+01, 9.400e+01, 2.810e+01,
        1.670e-01, 2.100e+01],
       [0.000e+00, 1.370e+02, 4.000e+01, 3.500e+01, 1.680e+02, 4.310e+01,
        2.288e+00, 3.300e+01]])

In [23]:
Y[:5]

array([1., 0., 1., 0., 1.])

You can see from Y above, that a label is one-hot encoded to show us an individual either does, or does not, have diabetes. The rest of the columns we can use to help predict if a person has diabetes. 

### Step 2: Using Chi-squared to compare each feature
Next, we'll use a chi-squared test to compare the relationship between each variable with sklearn's SelectKBest() method to then pick the top 5 most influential features in our dataset. 

In [24]:
# Import the necessary libraries first
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

Now we'll use a chi squared test to select the 5 most influential features, and will then remove all but the top 5 features before fitting your model. 

In [25]:
# Feature extraction
test = SelectKBest(score_func=chi2, k=5)
fit = test.fit(X, Y)

Now we can view the score for each of the features: 

In [43]:
# Summarize scores
print(fit.scores_)

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]


View which columns were/were not selected using fit.get_support(). 

In [54]:
print(list(zip(dataframe.columns[0:7], fit.get_support())))

[('preg', True), ('plas', True), ('pres', False), ('skin', False), ('test', True), ('mass', True), ('pedi', False)]


And after fitting the data to our model using only the selected k best features, we use the transform function to apply these restrictions to our dataset: 

In [29]:
features = fit.transform(X)
# Summarize selected features
print(features[0:5,:])

[[  6.  148.    0.   33.6  50. ]
 [  1.   85.    0.   26.6  31. ]
 [  8.  183.    0.   23.3  32. ]
 [  1.   89.   94.   28.1  21. ]
 [  0.  137.  168.   43.1  33. ]]


Now you can use the transformed dataset to perform linear regression. Or just perform regression with the selected features. 

## Backwards Feature Selection
This is also called Recursive Feature Selection. Here we build a model first, check model accuracy, and then remove features that are not contributing to the accuracy. 

In [56]:
# Import your necessary dependencies
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [64]:
# Feature extraction
model = LogisticRegression() # use linear regression model for all features
rfe = RFE(model, 3) # use rfe to select top 3
fit = rfe.fit(X, Y) # fit our model
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % (fit.ranking_))

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]


## Ridge Regression
Ridge regression combines feature selection with regularization. We can use the same dataset and sklearn's Ridge model. 

We'll define a function that lets us compute and plot a ridge regression for differing levels of alpha: 

In [98]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X,Y)
y_pred = ridge.predict(X) # you can pull these out for graphing

Now we can examine how the ridge regression changed each of our coefficients: 

In [85]:
print(list(zip(dataframe.columns[:7], ridge.coef_)))

[('preg', 0.020577473865630566), ('plas', 0.005921805452308099), ('pres', -0.0023327865789012615), ('skin', 0.00015926990803786386), ('test', -0.00018004345777043377), ('mass', 0.013248600981050254), ('pedi', 0.14539385236103794)]


You can see the reduced weight of the coefficients and even the direction of the relationship between dependent and independent variables here. 

To make this even more useful, you would want to compare it with your original non-regularized regression. 

## Lasso Regression
And finally, we'll try lasso regression, which will set irrelvant features to 0 in your dataset. 


In [97]:
from sklearn.linear_model import Lasso

lasso = Lasso()
lasso.fit(X,Y)


Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [99]:
print(list(zip(dataframe.columns[:7], lasso.coef_)))

[('preg', 0.0), ('plas', 0.005980714446303619), ('pres', 0.0), ('skin', 0.0), ('test', -0.0), ('mass', 0.0), ('pedi', 0.0)]


And here you can see the effect of the L1 regularization setting less relevant features to 0. 