# Feature Selection

There are two wide range of algorithms for feature selection:
1. Filter-based
2. Wrapper-based

## A. Filter-based
In this approach, we use a statistical measure to assign each feature a score based on this. Generally, this is a univariate approach i.e., we either select/drop a feature based on the score that we assigned to it.

Examples of popular statistical measures:
1. Chi-Squared
2. Information Gain
3. Pearson Coefficient

### 1. Chi-Sqaured
Used for categorical features.

In [11]:
import pandas as pd
from sklearn.feature_selection import chi2

X = pd.read_csv("pulsar_features.csv")
y = pd.read_csv("pulsar_target.csv")

print("Actual Values\n",X.iloc[0,:])
# Convert the dataset to categorical
X = X+4
X = X.astype(int)
# X.to_csv("discretized_pulsar_features.csv",index=False)
print("Converted Values\n",X.iloc[0,:])

chi2_scores, p_values = chi2(X,y)
print("Chi-2 Scores",chi2_scores)
print("p values",p_values)

Actual Values
  Mean of the integrated profile                  140.562500
 Standard deviation of the integrated profile     55.683782
 Excess kurtosis of the integrated profile        -0.234571
 Skewness of the integrated profile               -0.699648
 Mean of the DM-SNR curve                          3.199833
 Standard deviation of the DM-SNR curve           19.110426
 Excess kurtosis of the DM-SNR curve               7.975532
 Skewness of the DM-SNR curve                     74.242225
Name: 0, dtype: float64
Converted Values
  Mean of the integrated profile                  144
 Standard deviation of the integrated profile     59
 Excess kurtosis of the integrated profile         3
 Skewness of the integrated profile                3
 Mean of the DM-SNR curve                          7
 Standard deviation of the DM-SNR curve           23
 Excess kurtosis of the DM-SNR curve              11
 Skewness of the DM-SNR curve                     78
Name: 0, dtype: int32
Chi-2 Scores [ 46

**Explanation**<br>
> Using *chi2* function removes the features that are the most likely to be independent of the target class and therefore irrelevant for classification.
* Higher the value of chi2 --> More dependence
* Simply put, it determines if the association between two categorical variables of the sample would reflect their real association in the population.

**Further Reference**
1. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#examples-using-sklearn-feature-selection-chi2
2. https://en.wikipedia.org/wiki/Chi-squared_test

### 2. Mutual Information Gain
Used for continuous/real features

In [5]:
import pandas as pd
from sklearn.feature_selection import mutual_info_classif

X = pd.read_csv("pulsar_features.csv")
y = pd.read_csv("pulsar_target.csv")

print("Actual Values\n",X.iloc[0,:])
# Convert the dataset to categorical
# X = X*100
# X = X.astype(int)
print("Converted Values\n",X.iloc[0,:])

mutual_info_gain_values = mutual_info_classif(X,y,discrete_features=False,random_state=42)

print("Information Gain Values",mutual_info_gain_values)

Actual Values
  Mean of the integrated profile                  140.562500
 Standard deviation of the integrated profile     55.683782
 Excess kurtosis of the integrated profile        -0.234571
 Skewness of the integrated profile               -0.699648
 Mean of the DM-SNR curve                          3.199833
 Standard deviation of the DM-SNR curve           19.110426
 Excess kurtosis of the DM-SNR curve               7.975532
 Skewness of the DM-SNR curve                     74.242225
Name: 0, dtype: float64
Converted Values
  Mean of the integrated profile                  140.562500
 Standard deviation of the integrated profile     55.683782
 Excess kurtosis of the integrated profile        -0.234571
 Skewness of the integrated profile               -0.699648
 Mean of the DM-SNR curve                          3.199833
 Standard deviation of the DM-SNR curve           19.110426
 Excess kurtosis of the DM-SNR curve               7.975532
 Skewness of the DM-SNR curve              

In [10]:
#if mlxtend is not available install befor exectuting the wrapper method
! pip install mlxtend



**Explanation**
> We can see that both the methods are in compliance with each other i.e, last 2 features are better features compared to the first two.
* Higher value of Info gain --> BETTER

**Further Reference**
1. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif
2. http://www.scholarpedia.org/article/Mutual_information

## B. Wrapper-based
In this approach, we make a selection (subset) among the available features and create a predictive model. We assign each selection a score based on the accuracy of the predictive model.
There are many ways of making a selection, depending on the process

1. Methodical (best-first search)
2. Stochastic (random hill-climbing)
3. based on Heuristics (forward/backward feature selection)

### Recursive Feature Selection (RFE)

In [9]:
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

X = pd.read_csv("pulsar_features.csv")
y = pd.read_csv("pulsar_target.csv")

print("Actual Values\n",X.iloc[0,:])

# Create the sbs object and select best 4 features
knn = KNeighborsClassifier(n_neighbors=4)
# the param forward when set to False will do sequential backward selection
sbs = SFS(knn,
           k_features=4,
           forward=False,
           scoring='accuracy')

sbs = sbs.fit(X, y)
print("Best 4 features: ",sbs.k_feature_idx_)

Actual Values
  Mean of the integrated profile                  140.562500
 Standard deviation of the integrated profile     55.683782
 Excess kurtosis of the integrated profile        -0.234571
 Skewness of the integrated profile               -0.699648
 Mean of the DM-SNR curve                          3.199833
 Standard deviation of the DM-SNR curve           19.110426
 Excess kurtosis of the DM-SNR curve               7.975532
 Skewness of the DM-SNR curve                     74.242225
Name: 0, dtype: float64
Best 4 features:  (1, 2, 3, 6)


**Explanation**<br>
At each step we are finding one worst feature and removing it, so the first feature that will be removed will have last rank and vice versa.

The ranking at each step is given using the accuracy score of the predictive model that we use. Here, we used a **KNeighborsClassifier** that classifies the points. We can use any of the classification models for this purpose.

We can also observe that the best 4 features are in compliance with the values of information gain, so all the methods are different approaches and do not change the ground truth or characteristics of the dependencies between features.