# **Data Features Selection**

**Importance of Data Feature Selection**<br>
The performance of machine learning model is directly proportional to the data features used to train it. The performance of ML model will be affected negatively if the data features provided to it are irrelevant. On the other hand, use of relevant data features can increase the accuracy of your ML model especially linear and logistic regression.

Now the question arise that what is automatic feature selection? It may be defined as the process with the help of which we select those features in our data that are most relevant to the output or prediction variable in which we are interested. It is also called attribute selection.

The following are some of the benefits of automatic feature selection before modeling the data −

1. Performing feature selection before data modeling will reduce the overfitting.
2. Performing feature selection before data modeling will increases the accuracy of ML model.
3. Performing feature selection before data modeling will reduce the training time.

**Feature Selection Techniques**<br>
The followings are automatic feature selection techniques that we can use to model ML data in Python −

**Univariate Selection**<br>
This feature selection technique is very useful in selecting those features, with the help of statistical testing, having strongest relationship with the prediction variables. We can implement univariate feature selection technique with the help of SelectKBest0class of scikit-learn Python library.

**Example**<br>
In this example, we will use Pima Indians Diabetes dataset to select 4 of the attributes having best features with the help of chi-square statistical test.

In [5]:
# Code to read csv file into colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [6]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [8]:
downloaded = drive.CreateFile({'id':'1-lM7lDTtTWgnMlbH5Tv4M2qn7FRTsGSy'}) # replace the id with id of file you want to access
downloaded.GetContentFile('pima-indians-diabetes.csv')

In [9]:
#Driver Code
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv('pima-indians-diabetes.csv')
array = dataframe.values

#Next, we will separate array into input and output components −
X = array[:,0:8]
Y = array[:,8]

#The following lines of code will select the best features from dataset −
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X,Y)

#We can also summarize the data for output as per our choice. 
#Here, we are setting the precision to 2 and showing the 4 data 
#attributes with best features along with best score of each attribute −
set_printoptions(precision=2)
print(fit.scores_)
featured_data = fit.transform(X)
print ("\nFeatured data:\n", featured_data[0:4])

[ 111.52 1411.89   17.61   53.11 2175.57  127.67    5.39  181.3 ]

Featured data:
 [[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]]


**Recursive Feature Elimination**<br>
As the name suggests, RFE (Recursive feature elimination) feature selection technique removes the attributes recursively and builds the model with remaining attributes. We can implement RFE feature selection technique with the help of RFE class of scikit-learn Python library.

**Example**<br>
In this example, we will use RFE with logistic regression algorithm to select the best 3 attributes having the best features from Pima Indians Diabetes dataset to.

In [13]:
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv('pima-indians-diabetes.csv')
array = dataframe.values

#Next, we will separate the array into its input and output components −
X = array[:,0:8]
Y = array[:,8]

#The following lines of code will select the best features from a dataset −
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=3)
fit = rfe.fit(X, Y)

print("Number of Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Number of Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 4 5 6 1 1 3]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


We can see in above output, RFE choose preg, mass and pedi as the first 3 best features. They are marked as 1 in the output.

**Principal Component Analysis (PCA)**<br>
PCA, generally called data reduction technique, is very useful feature selection technique as it uses linear algebra to transform the dataset into a compressed form. We can implement PCA feature selection technique with the help of PCA class of scikit-learn Python library. We can select number of principal components in the output.

<br>**Example**<br>
In this example, we will use PCA to select best 3 Principal components from Pima Indians Diabetes dataset.

In [15]:
from pandas import read_csv
from sklearn.decomposition import PCA
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv('pima-indians-diabetes.csv')
array = dataframe.values

#Next, we will separate array into input and output components −
X = array[:,0:8]
Y = array[:,8]

#The following lines of code will extract features from dataset −
pca = PCA(n_components=3)
fit = pca.fit(X)
print("Explained Variance: {}".format(fit.explained_variance_ratio_))
print(fit.components_)

Explained Variance: [0.89 0.06 0.03]
[[-2.02e-03  9.78e-02  1.61e-02  6.08e-02  9.93e-01  1.40e-02  5.37e-04
  -3.56e-03]
 [-2.26e-02 -9.72e-01 -1.42e-01  5.79e-02  9.46e-02 -4.70e-02 -8.17e-04
  -1.40e-01]
 [-2.25e-02  1.43e-01 -9.22e-01 -3.07e-01  2.10e-02 -1.32e-01 -6.40e-04
  -1.25e-01]]


We can observe from the above output that 3 Principal Components bear little resemblance to the source data.

**Feature Importance**<br>
As the name suggests, feature importance technique is used to choose the importance features. It basically uses a trained supervised classifier to select features. We can implement this feature selection technique with the help of ExtraTreeClassifier class of scikit-learn Python library.

**Example**<br>
In this example, we will use ExtraTreeClassifier to select features from Pima Indians Diabetes dataset.

In [16]:
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
path = r'C:\Desktop\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv('pima-indians-diabetes.csv')
array = dataframe.values

#Next, we will separate array into input and output components −
X = array[:,0:8]
Y = array[:,8]

#The following lines of code will extract features from dataset −
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

[0.11 0.23 0.1  0.08 0.08 0.14 0.12 0.14]


From the output, we can observe that there are scores for each attribute. The higher the score, higher is the importance of that attribute.

