# Feature Selection

+ VarianceThreshold: X only
+ chi2: dependency between X and y.
+ SelectKBest, SelectPercentile
+ SelectFromModel
    * Linear: SVC
    * Nonlinear: RandomForest


So far, we fit data into SVC and tree models.  After fitting, we have some ideas which features are important.

SVC may decide that one feature is most important, while RF might decide that another feature is most important.

Here's different scenario.

Before we do any modeling, we might interested in selecting only important features to model.

### Using models to select features

This seems like putting a horse in front of the carriage.  But it has some value.  For example, we want to use model A for selecting important features, then use model B to actually model the relationship between those features and y.

In [23]:
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import ShuffleSplit, cross_validate
import pandas

heart = pandas.read_csv('../Datasets/heart.csv')
X, y = heart.drop(columns='target'), heart['target']

rf = RandomForestClassifier(max_depth=13)
svm = LinearSVC(dual=False)

In [24]:
heart.sample()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
235,51,1,0,140,299,0,1,173,1,1.6,2,0,3,0


In [25]:
def evaluate(model, X, y, scoring=['accuracy'], random_state=2021):
    result = cross_validate(
        model, 
        X, 
        y, 
        cv=ShuffleSplit(n_splits=50, random_state=random_state), 
        scoring=scoring)
    print(model)
    for s in scoring:
        print('\t', s, result['test_' + s].mean().round(2))

In [26]:
evaluate(rf, X, y, scoring=['precision','recall'])

RandomForestClassifier(max_depth=13)
	 precision 0.85
	 recall 0.85


In [27]:
evaluate(svm, X, y, scoring=['precision','recall'])

LinearSVC(dual=False)
	 precision 0.83
	 recall 0.9


#### Use a model to select the best features

##### Random forest

In [28]:
rf.fit(X,y)

RandomForestClassifier(max_depth=13)

In [29]:
sorted(list(zip(rf.feature_importances_, X.columns)))

[(0.00916059488261722, 'fbs'),
 (0.018786321110094487, 'restecg'),
 (0.036717859331490695, 'sex'),
 (0.04572415122476997, 'slope'),
 (0.06215571863827391, 'exang'),
 (0.06793606794082686, 'trestbps'),
 (0.07738950730925795, 'chol'),
 (0.08239387086479656, 'age'),
 (0.10729933839957514, 'ca'),
 (0.11259657808474526, 'thalach'),
 (0.12299441195696631, 'thal'),
 (0.12371563119211523, 'oldpeak'),
 (0.13312994906447023, 'cp')]

In [30]:
from sklearn.feature_selection import SelectKBest, SelectFromModel, RFE

In [31]:
selector = SelectFromModel(rf, max_features=7)
selector.fit(X,y)

SelectFromModel(estimator=RandomForestClassifier(max_depth=13), max_features=7)

In [32]:
X.columns[selector.get_support()]

Index(['age', 'cp', 'chol', 'thalach', 'oldpeak', 'ca', 'thal'], dtype='object')

##### SVM

In [33]:
svm.fit(X,y)

LinearSVC(dual=False)

In [34]:
import numpy

sorted(list(zip(numpy.abs(svm.coef_[0]), X.columns)))

[(0.0010934104117581937, 'age'),
 (0.0012397707218213215, 'chol'),
 (0.005929552053704565, 'trestbps'),
 (0.008478519872121868, 'thalach'),
 (0.024982318480280725, 'fbs'),
 (0.1385515300590859, 'restecg'),
 (0.17456640443215402, 'slope'),
 (0.18410588936592262, 'oldpeak'),
 (0.26368840093704143, 'ca'),
 (0.2890859571879028, 'cp'),
 (0.29737642911403755, 'thal'),
 (0.320594904609055, 'exang'),
 (0.5187292826436363, 'sex')]

In [35]:
selector = SelectFromModel(svm, max_features=7)
selector.fit(X,y)
features = X.columns[selector.get_support()]
print(selector.get_support())
print(features)

[False  True  True False False False False False  True  True  True  True
  True]
Index(['sex', 'cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'], dtype='object')


In [36]:
X2=X[features]

In [37]:
rf.fit(X2,y)

RandomForestClassifier(max_depth=13)

In [38]:
evaluate(rf, X2, y, scoring=['precision', 'recall'])

RandomForestClassifier(max_depth=13)
	 precision 0.82
	 recall 0.83


#### Practice

In [39]:
from sklearn.svm import LinearSVC

svm = LinearSVC(dual = False)
evaluate(svm, X, y, ['precision', 'recall'])

LinearSVC(dual=False)
	 precision 0.83
	 recall 0.9


In [47]:
# Observe and Select the K best features
import numpy as np

svm.fit(X, y)
sorted(list(zip(np.abs(svm.coef_[0]), X.columns)))

[(0.0010934104117581937, 'age'),
 (0.0012397707218213215, 'chol'),
 (0.005929552053704565, 'trestbps'),
 (0.008478519872121868, 'thalach'),
 (0.024982318480280725, 'fbs'),
 (0.1385515300590859, 'restecg'),
 (0.17456640443215402, 'slope'),
 (0.18410588936592262, 'oldpeak'),
 (0.26368840093704143, 'ca'),
 (0.2890859571879028, 'cp'),
 (0.29737642911403755, 'thal'),
 (0.320594904609055, 'exang'),
 (0.5187292826436363, 'sex')]

In [48]:
# Use sklearn to select the K best features from the model
from sklearn.feature_selection import SelectFromModel

selector = SelectFromModel(svm, max_features = 7)
selector.fit(X, y)
X.columns[selector.get_support()]

Index(['sex', 'cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'], dtype='object')

In [49]:
desried_features = X.columns[selector.get_support()]
X2 = X[desried_features]
evaluate(svm, X2, y, ['precision', 'recall'])

LinearSVC(dual=False)
	 precision 0.83
	 recall 0.88


### Using variance to select features

In [43]:
X.var()

age           82.484558
sex            0.217166
cp             1.065132
trestbps     307.586453
chol        2686.426748
fbs            0.126877
restecg        0.276528
thalach      524.646406
exang          0.220707
oldpeak        1.348095
slope          0.379735
ca             1.045724
thal           0.374883
dtype: float64

In [2]:
diabetes = pandas.read_csv('../Datasets/diabetes.csv')
diabetes.var()

Pregnancies                    11.354056
Glucose                      1022.248314
BloodPressure                 374.647271
SkinThickness                 254.473245
Insulin                     13281.180078
BMI                            62.159984
DiabetesPedigreeFunction        0.109779
Age                           138.303046
Outcome                         0.227483
dtype: float64

In [3]:
iris = pandas.read_csv('../Datasets/iris50.csv')
iris.var()

SepalLength    0.369016
SepalWidth     0.101065
PetalLength    0.481914
PetalWidth     0.182596
Species        0.253469
dtype: float64

In [4]:
from sklearn.feature_selection import VarianceThreshold

In [5]:
selector = VarianceThreshold(threshold=0.5)
X2 = selector.fit_transform(X)

In [53]:
X2[0:3]

array([[ 63. ,   3. , 145. , 233. , 150. ,   2.3,   0. ],
       [ 37. ,   2. , 130. , 250. , 187. ,   3.5,   0. ],
       [ 41. ,   1. , 130. , 204. , 172. ,   1.4,   0. ]])

In [52]:
X[0:3]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2


In [54]:
X.var()

age           82.484558
sex            0.217166
cp             1.065132
trestbps     307.586453
chol        2686.426748
fbs            0.126877
restecg        0.276528
thalach      524.646406
exang          0.220707
oldpeak        1.348095
slope          0.379735
ca             1.045724
thal           0.374883
dtype: float64

#### Practice

In [50]:
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold = .5)
selector.fit(X)
X.columns[selector.get_support()]

Index(['age', 'cp', 'trestbps', 'chol', 'thalach', 'oldpeak', 'ca'], dtype='object')

In [21]:
X.var()

age           82.484558
sex            0.217166
cp             1.065132
trestbps     307.586453
chol        2686.426748
fbs            0.126877
restecg        0.276528
thalach      524.646406
exang          0.220707
oldpeak        1.348095
slope          0.379735
ca             1.045724
thal           0.374883
dtype: float64

### Selecting features using dependency

In [6]:
from sklearn.feature_selection import SelectKBest, chi2

In [7]:
selector = SelectKBest(chi2, k=7)

In [8]:
X3=selector.fit_transform(X,y)

In [9]:
X3[0:3]

array([[ 63. ,   3. , 233. , 150. ,   0. ,   2.3,   0. ],
       [ 37. ,   2. , 250. , 187. ,   0. ,   3.5,   0. ],
       [ 41. ,   1. , 204. , 172. ,   0. ,   1.4,   0. ]])

In [10]:
X[0:3]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2


#### Practice

In [19]:
from sklearn.feature_selection import SelectKBest, chi2

selector = SelectKBest(chi2, k = 7)
selector.fit(X,y)
X.columns[selector.get_support()]

Index(['age', 'cp', 'chol', 'thalach', 'exang', 'oldpeak', 'ca'], dtype='object')

### Recursive Feature Elimination (RFE)

In [11]:
from sklearn.feature_selection import RFE

In [12]:
rf, svm

(RandomForestClassifier(max_depth=13), LinearSVC(dual=False))

In [13]:
selector = RFE(svm, n_features_to_select=7)

In [14]:
selector.fit(X,y)

RFE(estimator=LinearSVC(dual=False), n_features_to_select=7)

In [15]:
selector.get_support()

array([False,  True,  True, False, False, False, False, False,  True,
        True,  True,  True,  True])

In [16]:
X.columns[selector.get_support()]

Index(['sex', 'cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'], dtype='object')

#### Practice

In [18]:
# Heart Dataset
from sklearn.feature_selection import RFE

selector = RFE(svm, n_features_to_select = 7)
selector.fit(X, y)
X.columns[selector.get_support()]

Index(['sex', 'cp', 'exang', 'oldpeak', 'slope', 'ca', 'thal'], dtype='object')