# Feature Selection

    Feature selection is an important process in the workflow of Machine Learning. It involves selecting the most important/relevant features for better prediction of the Target variable.
    
    
    Following are feature selection operations performed on iris data set.

In [12]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")
# load the iris datasets
dataset = datasets.load_iris()
# create a base classifier used to evaluate a subset of attributes

# Recursive Feature Elimination


    This is a feature elimination technique where in some attributes are removed and model is built. From the models built from elimination of one or more models, the features that gives good accuracy is chosen.

In [13]:
from sklearn.feature_selection import RFE
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(dataset.data, dataset.target)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

[False  True  True  True]
[2 1 1 1]


In [22]:
dataset.feature_names[0]

'sepal length (cm)'

    Here we choose 3 of 4 features. And we find that the last three features are important than the fourth (sepal length). Lets now construct a feature with all features and with these three features.

In [48]:
#Logistic regression with rfe data 
# Choosing three features from the dataset
rfe_data = dataset.data[:,rfe.support_]

lr2=LogisticRegression()
X_train,X_test,y_train,y_test=train_test_split(rfe_data,dataset.target,random_state=0)
lr2.fit(X_train,y_train)
y_pred=lr2.predict(X_test)
accuracy_score(y_pred,y_test)

0.9210526315789473

In [49]:
#Logistic Regression with whole data
lr1=LogisticRegression()
X_train,X_test,y_train,y_test=train_test_split(dataset.data,dataset.target,random_state=0)
lr1.fit(X_train,y_train)
y_pred=lr1.predict(X_test)
accuracy_score(y_pred,y_test)

0.868421052631579

    Here we see that there is an improvement in accuracy with the feature selection. However this need always be true. Our aim was to reduce the dimension and it happened!

# Embedded Methods

    We can use regularizations like L1 and L2 and tree based models like Randomforest,ExtraTree etc.,  to find the feature importance.
    
    In this session we will first perform feature selection with Lasso, and then with random forest.

## L1 regularization

In [112]:
#lasso with linear support vector classifier
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
lsvc = LinearSVC(C=0.01,penalty='l1',dual=False)
lsvc.fit(X_train,y_train)
feat = SelectFromModel(lsvc,prefit=True)

In [107]:
feat.get_support()

array([ True,  True,  True, False])

    Here we find that the first three features are important, so the fourth parameter is the least important! ( as chosen by l1 regularization with svc). At this point lets find the accuracy.

## Random Forest

In [117]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
feat = SelectFromModel(rf,prefit=True)

In [118]:
feat.get_support()

array([False, False,  True,  True])

 Here the last two parameters are chosen. We can also obtain the level of importance. 

In [119]:
rf.feature_importances_

array([0.18577196, 0.03136508, 0.46515608, 0.31770689])

    Here we see that the last two features has more importance.

# Finally

    Now we see that there is a strong corrlation between feature selection and model selection. We have to select features based on the model selections. Also Domain knowledge is very important when it comes to feature selection.
    
    
    Feature selection can remove irrelevant features, resulting in reduction of efforts to collect data. It also reduces the complexity of the problem and time and computation needed.

# References

Reference: 
   1. https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/
   2. https://scikit-learn.org/stable/modules/feature_selection.html