### Table of content
1. Introduction
    * 1.1 what is Feature Selection?
    * 1.2 benefit of Feature Selection
<br>
<br>
2. Filter Method
    * 2.1 Univariant
        * 2.1.1 Constant removal
        * 2.1.2 Quasi Constant Removal
        * 2.1.3 Duplicate Feature Removal
        * 2.1.4 Information Gain
            *  2.1.4.1 Information Gain for Classification
            *  2.1.4.2 Information Gain for Regression
        * 2.1.5 Chi-square Test
        * 2.1.6 Anova Test 
    * 2.2 Multi variant
        * 2.2.1 Correlation
        * 2.2.2 Information Gain for regression and classification
<br>
<br>
3. Wrapper Method
    * 3.1. Forward Feature Selection
    * 3.2. Backward Feature Selection
    * 3.3. Recursive Feature Selection
<br>
<br>
4. Embedding Method
    * 4.1. Lasso Regression
    * 4.2. Ridge Regression
    * 4.3 Feature Importance
<br>
<br>

### <i>Dataset Used</i>
<img src="https://i.ibb.co/yXbLFmh/data-used.png" alt="data-used" border="0">


## 1.Introduction
<img src="https://i.ibb.co/mv8Vxch/feature-selection.png" alt="feature-selection" border="0">
<br>
<br>

### 1.1 What is Feature Selection?

In feature selection we choose those feature which contribute most to our prediction variable<br>
Having irrelevant feature decrease the accuracy of model.<br>
training the model on the feature can take lot of the time <br>


when the curse of dimension kicked in model performance reduced
<img src="https://i.ibb.co/s90zgTM/curse-of-dim.png" alt="curse-of-dim" border="0">

<img src="https://i.ibb.co/kQVNMZ6/what-is-feature-importance.png" alt="what-is-feature-importance" border="0">

### 1.2 Benefit of Feature Selection

* Model with less number of features have higher explainability
* It is easier to implement machine learning models with reduced features
* It reduces overfitting
* Training time of models with fewer features is significantly lower
* Models with fewer features are less prone to errors

In [None]:
#importing dataset
import pandas as pd
from sklearn.model_selection import train_test_split
#reading dataset
def read_dataset(file_name):
    csv_=pd.read_csv(file_name)
    return csv_

#droping coloumns
def drop_coloumn(dataframe,label):
    X = dataframe.drop(label, axis = 1)
    y = dataframe[label]
    return X,y

## 2.Filter Method
Filter Method use statistical test to choose the subset of features which minimize the loss. Filter Method does not depended on machine learning models.<br>

#### advantages of filter Method
* less computationally expensive
* uses statistical algorithms like chi-square
* uses individual feature predictive power

### 2.1 Uni-variant
The univariate filter method use only individual features to find the best subset of features and are ranked based upon specific criteria 
<br>
Uni-variant Methods
* Constant Removal
* Quasi Constant Removal
* Duplicate Feature Removal
* Chi-square Test
* Anova Test 

#### 1.1.1 Constant Removal
In constant feature we remove the feature which has 0 variance.Constant features provide no information that can help in classification.

In [None]:
#reading dataset
Santander=read_dataset("../input/santander-customer-satisfaction/train.csv")
X_train,y_train=drop_coloumn(Santander,'TARGET')

print(X_train.shape, y_train.shape)

In [None]:
from sklearn.feature_selection import VarianceThreshold
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(X_train)

In [None]:
constant_list = [not temp_feat for temp_feat in constant_filter.get_support()]

In [None]:
X_train_filter = constant_filter.transform(X_train)

In [None]:
X_train_filter.shape,X_train.shape

<b>feature has been reduced from 371 to 336</b>

#### 2.1.2 Quasi Constant Removal
Quasi Constant is like constant feature in constant feature we remove the 0 variance while in quasi we also remove feature having equal to 0.1 variance

In [None]:
quasi_constant_filter = VarianceThreshold(threshold=0.01)
quasi_constant_filter.fit(X_train)

In [None]:
constant_list = [not temp_feat for temp_feat in quasi_constant_filter.get_support()]

In [None]:
X_train_quasi_filter  = quasi_constant_filter.transform(X_train)

In [None]:
X_train_quasi_filter.shape

<b>after quasi constant removal feature has been reduced from 371 to 274</b>

#### 2.1.3 Duplicate Feature Removal
sometime we have duplicate columns.we take the first column and remove the second duplicate column
<img src="https://i.ibb.co/WcJR2Vb/Duplicate-Feature-Removal.png" alt="Duplicate-Feature-Removal" border="0">

In [None]:
X_train_T = X_train_quasi_filter.T

In [None]:
type(X_train_T)

In [None]:
X_train_T = pd.DataFrame(X_train_T)

In [None]:
X_train_T.shape

In [None]:
X_train_T.duplicated().sum()

In [None]:
duplicated_features = X_train_T.duplicated()

In [None]:
features_to_keep = [not index for index in duplicated_features]

In [None]:
X_train_unique = X_train_T[features_to_keep].T

In [None]:
X_train_unique.shape, X_train.shape

<b>in Santander dataset we have 3 duplicate columns </b>
column has been reduced from 370 to 256

#### 2.1.4 Chi-square (χ2) Test
It is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution.

<b>Note-</b>This score should be used to evaluate categorical variables in a classification task.

In [None]:
import seaborn as sns
from sklearn.feature_selection import chi2
titanic=sns.load_dataset('titanic')

In [None]:
titanic.drop(labels = ['age', 'deck'], axis = 1, inplace = True)

In [None]:
titanic = titanic.dropna()

In [None]:
data = titanic[['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'who', 'alone']].copy()

In [None]:
#encoding
sex = {'male': 0, 'female': 1}
data['sex'] = data['sex'].map(sex)

ports = {'S': 0, 'C': 1, 'Q': 2}
data['embarked'] = data['embarked'].map(ports)

who = {'man': 0, 'woman': 1, 'child': 2}
data['who'] = data['who'].map(who)

alone = {True: 1, False: 0}
data['alone'] = data['alone'].map(alone)

In [None]:
X_train = data.copy()
y_train = titanic['survived']

In [None]:
f_score = chi2(X_train, y_train)

In [None]:
import pandas as pd
p_values=pd.Series(f_score[1])
p_values.index=X_train.columns
p_values

In [None]:
p_values.sort_values(ascending=False).plot.bar(figsize=(20, 8))

#### 2.1.5 Anova Test 

ANOVA stands for Analysis of variance. It is similar to LDA except for the fact that it is operated using one or more categorical independent features and one continuous dependent feature. It provides a statistical test of whether the means of several groups are equal or not.

<b>Note-This score should be used to evaluate continuous variables in a classification task.</b>

In [None]:
from sklearn.datasets import load_wine
import pandas as pd
data=load_wine()

X_train = pd.DataFrame(data.data)
y_train = data.target

X_train.columns = data.feature_names
X_train.head()

print(X_train.shape,y_train.shape)

In [None]:
from sklearn.feature_selection import f_classif
f_score = f_classif(X_train, y_train)

In [None]:
import pandas as pd
p_values=pd.Series(f_score[1])
p_values.index=X_train.columns
p_values

p_values.sort_index(ascending=False)

In [None]:
p_values.sort_values(ascending=False).plot.bar(figsize=(20, 8))

### 2.2 Multi-variant

#### 2.2.1 Correlation Coefficient

Pearson’s Correlation: It is used as a measure for quantifying linear dependence between two continuous variables X and Y. Its value varies from -1 to +1. Pearson’s correlation is given as:

<img src="https://i.ibb.co/yBRFr3F/person-correlation.png" alt="person-correlation" border="0">

In [None]:
#reading dataset
Santander=read_dataset("../input/santander-customer-satisfaction/train.csv")
X_train,y_train=drop_coloumn(Santander,'TARGET')

print(X_train.shape, y_train.shape)

In [None]:
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        
        
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
#Using Pearson Correlation
plt.figure(figsize=(12,10))
cor = X_train.corr()
sns.heatmap(cor, annot=True)
plt.show()

In [None]:
corr_features = correlation(X_train, 0.9)
len(set(corr_features))

In [None]:
X_train.drop(corr_features,axis=1)

#### 2.2.2 Information Gain
Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

1. If X and Y are independent, then no information about Y can be obtained by knowing X or vice versa. Hence their mutual information is 0.
2. If X is a deterministic function of Y, then we can determine X from Y and Y from X with mutual information 1.
3. When we have Y = f(X,Z,M,N), 0 < mutual information < 1

The mutual information between two random variables X and Y can be stated formally as follows:

<img src="https://i.ibb.co/bNkrQzR/information-gain.png" alt="information-gain" border="0">

##### 2.2.2.1 Classification

In [None]:
from sklearn.datasets import load_wine
import pandas as pd
data=load_wine()

X_train = pd.DataFrame(data.data)
y_train = data.target

X_train.columns = data.feature_names
X_train.head()

print(X_train.shape,y_train.shape)

In [None]:
from sklearn.feature_selection import mutual_info_classif
# determine the mutual information
mutual_info = mutual_info_classif(X_train,y_train)
mutual_info

In [None]:
mutual_info = pd.Series(mutual_info)
mutual_info.index = X_train.columns
mutual_info.sort_values(ascending=False)

In [None]:
mutual_info.sort_values(ascending=False).plot.bar(figsize=(20, 8))

In [None]:
from sklearn.feature_selection import SelectKBest
#No we Will select the  top 5 important features
sel_five_cols = SelectKBest(mutual_info_classif, k=5)
sel_five_cols.fit(X_train, y_train)
X_train.columns[sel_five_cols.get_support()]

##### 2.2.2.2 Regression

In [None]:
from sklearn.datasets import load_boston
from sklearn.feature_selection import mutual_info_regression

In [None]:
boston = load_boston()

In [None]:
X_train = pd.DataFrame(data = boston.data, columns=boston.feature_names)
X_train.head()

In [None]:
y_train = boston.target

In [None]:
mi = mutual_info_regression(X_train, y_train)
mi = pd.Series(mi)
mi.index = X_train.columns
mi.sort_values(ascending=False, inplace = True)

In [None]:
mi.plot.bar(figsize=(20, 8))

In [None]:
sel = SelectKBest(mutual_info_regression, k = 9).fit(X_train, y_train)
X_train.columns[sel.get_support()]

<img src="https://i.ibb.co/cTX0DDp/feature-type-data.png" alt="feature-type-data" border="0">

## 3.Wrapper Method

we perform wrapper method using mlxtend library<br>
http://rasbt.github.io/mlxtend/<br>
<br>
<br>
for installing<br>
<i><b>pip install mlxtend</b></i><br>

<img src="https://i.ibb.co/qjw6jL5/wrapper-method.png" alt="wrapper-method" border="0">

### 3.1. Forward Feature Selection

In forward Feature selection, we start with a null model and then start fitting the model with each individual feature one at a time and select the feature which maximize our criterion function.we repeat this process untill we get our k features <br>
k=number of desire features

<img src="https://i.ibb.co/Cv6RDhZ/forward-feature-selection.png" alt="forward-feature-selection" border="0">

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

In [None]:
from sklearn.datasets import load_wine

In [None]:
data = load_wine()

In [None]:
data.keys()

In [None]:
X_train = pd.DataFrame(data.data)
y_train = data.target

In [None]:
X_train.columns = data.feature_names
X_train.head()

In [None]:
X_train.isnull().sum()

In [None]:
sfs = SFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs = -1),
         k_features = 7,
          forward= True,
          floating = False,
          verbose= 2,
          scoring= 'accuracy',
          cv = 4,
          n_jobs= -1
         ).fit(X_train, y_train)

<b>feature after Forward feature selection</b>

In [None]:
sfs.k_feature_names_

In [None]:
sfs.k_score_ 

### 3.2. Backward Feature Selection
In backward elimination, we start with the all the independent variables and then remove the insignificant feature which minimized our criterion function.we repeat this process untill we get our k features <br>
k=number of desire features

<img src="https://i.ibb.co/hCyDgpg/backward-feature-selection.png" alt="backward-feature-selection" border="0">

In [None]:
sfs = SFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs = -1),
         k_features = (1, 8),
          forward= False,
          floating = False,
          verbose= 2,
          scoring= 'accuracy',
          cv = 4,
          n_jobs= -1
         ).fit(X_train, y_train)

<b>feature after backward feature selection</b>

In [None]:
sfs.k_feature_names_

In [None]:
sfs.k_score_ 

### 3.3. Recursive Feature Selection
This is the most robust feature selection method covered so far. This is a brute-force evaluation of each feature subset. This means that it tries every possible combination of the variables and returns the best performing subset

In [None]:
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

In [None]:
efs = EFS(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1),
         min_features= 4,
          max_features= 8,
          scoring='accuracy',
          cv = None,
          n_jobs=-1
         ).fit(X_train, y_train)

In [None]:
efs.best_score_

In [None]:
efs.best_feature_names_

## 4.Embedding Method
<img src="https://i.ibb.co/7Y627DX/Emedding-Method.png" alt="Emedding-Method" border="0">


Embedding Method also called Regularization Approach <br>
* Lasso-L1 regularization <br>
* Ridge-L2 regularization <br>
* Elastic Nets-L1 and L2 regularization <br>

### 4.1. Lasso Regression

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel

In [None]:
titanic = sns.load_dataset('titanic')

In [None]:
titanic.isnull().sum()

In [None]:
titanic.drop(labels = ['age', 'deck'], axis = 1, inplace = True)

In [None]:
titanic = titanic.dropna()
titanic.isnull().sum()

In [None]:
titanic.head()

In [None]:
data = titanic[['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'who', 'alone']].copy()

In [None]:
data.head()

In [None]:
sex = {'male': 0, 'female': 1}
data['sex'] = data['sex'].map(sex)

In [None]:
ports = {'S': 0, 'C': 1, 'Q': 2}
data['embarked'] = data['embarked'].map(ports)

In [None]:
who = {'man': 0, 'woman': 1, 'child': 2}
data['who'] = data['who'].map(who)

In [None]:
alone = {True: 1, False: 0}
data['alone'] = data['alone'].map(alone)

In [None]:
X_train = data.copy()
y_train = titanic['survived']

In [None]:
sel = SelectFromModel(LogisticRegression(C = 0.05, penalty = 'l1', solver = 'liblinear'))
sel.fit(X_train, y_train)

In [None]:
sel.get_support()

In [None]:
features = X_train.columns[sel.get_support()]
features

In [None]:
X_train_l1 = sel.transform(X_train)

In [None]:
X_train_l1.shape, X_train.shape

### 4.2. Ridge Regression

In [None]:
sel = SelectFromModel(LogisticRegression(C = 0.05, penalty = 'l2', solver = 'liblinear'))
sel.fit(X_train, y_train)

In [None]:
sel.get_support()

In [None]:
features = X_train.columns[sel.get_support()]
features

In [None]:
X_train_l1 = sel.transform(X_train)

In [None]:
X_train_l1.shape

### 4.3 Feature Importance

In [None]:
from sklearn.datasets import load_wine
import pandas as pd
data = load_wine()

data.keys()

X = pd.DataFrame(data.data)
y = data.target

X.columns = data.feature_names
X.head()

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(X, y)

In [None]:
importance = pd.concat([pd.Series(X.columns), pd.Series(rf.feature_importances_)], axis = 1)

In [None]:
importance.columns = ['features', 'importance']

In [None]:
importance.sort_values(by = 'importance', ascending = False, inplace = True)

In [None]:
X[importance['features'][0:8]]

In [None]:
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
model = RandomForestClassifier(n_estimators=100, random_state=0)

model.fit(X,y)

In [None]:
print(model.feature_importances_)

In [None]:
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

## End Note
<img src="https://i.ibb.co/K7Jwh58/difference-table.png" alt="difference-table" border="0">