When creating machine learning models, the perfomance of the model will not always increase on creating new features. We  all may have faced this problem of identifying the good features from the set of features we have or the features we created. Feature selection techniques will comes to rescue in this case. It is one of the core concepts in machine learning which hugely impacts the performance of your model. 

The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.Irrelevant or partially relevant features can negatively impact model performance.Feature selection and Data cleaning should be the first and most important step of your model designing.


**Advantages of Feature selection:**

<b>Reduces Overfitting:</b> Less redundant data means less opportunity to make decisions based on noise.<br>

<b>Improves Accuracy:</b>Less misleading data means modeling accuracy improves.<br>

<b>Reduces Training Time:</b> fewer data points reduce algorithm complexity and algorithms train faster.<br>

In this notebook, we will familirize with some of the commonly used feature selection techniques.

**1. Filter methods**
```
    - chi2 test
    - Anova F test
    - Using Pearsons coorelation matrix
```
**2. Wrapper methods**
``` 
    - Forward feature selection
    - Backward selection
    - Recursive feature elimination
```
**3. Embeddeded methods**
```   
    - Lasso
    - Ridge
    - Elastic net
```   

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import auc,roc_auc_score,roc_curve
from sklearn.model_selection import GridSearchCV

In [None]:
## train test file path
data = '../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv'

df = pd.read_csv(data)
print(df.shape)
df.head(3)

In [None]:
df.info()

In [None]:
#basic EDA


df = df.drop(columns=['customerID'])
df['TotalCharges'] = df['TotalCharges'].apply(lambda x: -1 if x == ' ' else float(x))
df['TotalCharges'] = df['TotalCharges'].replace(-1,df['TotalCharges'].mean())


num_cols = ['TotalCharges','MonthlyCharges','tenure']
for col in num_cols:
    df[col] = df[col].astype(np.float32)

# [1]Filter Methods

Most of the people prefer to use warpper methods like forward feature selection,backward elimenation etc for feature selection, but while doing EDA, while proceeding to next step the easiest way to do feature selection is using univariate methods like ch2 test, ANOVA test, using coorelation matrix etc.

## [1.1] Chi square test (For categorical data)

In the case of classification problems where input variables are also categorical, we can use statistical tests to determine whether the output variable is dependent or independent of the input variables. If independent, then the input variable is a candidate for a feature that may be irrelevant to the problem and removed from the dataset. The Pearson’s chi-squared statistical hypothesis is an example of a test for independence between categorical variables.

Chi2 test can be used to know the feature importance of categorical variables in classification problems.Basically it will find wheather a relationship exist or there is dependency between two features. Let us take some categorical variables and see how it performs

In [None]:
df['SeniorCitizen'].value_counts()

**1. Design the hypothesis**

**Null hypothesis (H0) : Two features are independend**

**Alternate hypothesis (H1): Two features are dependent**

If we proved that input and target variables are independent, then there is no strong relation with the target and we can remove that feature on moving forward.

**2. Choose a alpha value**

We need to choose a alpha-value (significance value) which indicates how confident are we in saying two features are independent.Here we choose alpha value = 0.05 which indicates the probability of rejecting null hypothesis if it is true.

If the p value obtained is greater than alpha value the null hypotheis (H0) is true.

In [None]:
# contigency table
alpha = 0.05
cont_table = pd.crosstab(index=df['SeniorCitizen'],columns=df['Churn'])
cont_table

In [None]:
from scipy.stats import chi2_contingency,chi2

# chi2 value, p value, degree of freedom , expected_table
chi2_value, p, dof, expected_table = chi2_contingency(cont_table)

print(f'chi2 value: {chi2_value}')
print(f'p value: {p}')
print(f'degree of freedom: {dof}')
print(f'expected table/array : \n {expected_table}')

In [None]:
## calculated value of chi2 >= crirical value from table(found using dof and alpha) --> Ho is rejected
## ie,  abs(ch2_value) > chi2.ppf(0.95, dof) -->Ho rejected

if p <= alpha:
    print(f'Reject null hypothesis. There exist some relation between features')
else:
    print(f'Accept null hypothesis. Two features are not related')

Here we rejected null hypothesis which means features are not independent. There exist some relationship between SeniorCitizen and Churn prediction.

Note: Usually in hypothesis testing values below critical value are acepted and values above it are rejected. (one tail test)

In [None]:
def chi2_test(X,target,alpha=0.05):
    """
    X = input dataframe
    target= target frame
    alpha = significant value
    """
    useful_cols = {}
    for col in X.columns: 
        cont_table = pd.crosstab(index=X[col],columns=target)
        chi2_value, p, dof, expected_table = chi2_contingency(cont_table)
        if p <= alpha:
            # reject null hypothesis # so, important feature
            useful_cols[col] = p
    print(f'Total {len(useful_cols)} features selected')
    return useful_cols


In [None]:
chi2_test(df[['gender','SeniorCitizen', 'Partner','PhoneService','PaperlessBilling']],df['Churn'])

### Using Sklearn library

First we have to label encode categorical features

In [None]:
from sklearn.feature_selection import SelectKBest,chi2

dff = df[['gender','SeniorCitizen', 'Partner','PhoneService','PaperlessBilling','Churn']]

# label encod cat features
dff['gender'] = dff['gender'].map({v:i for i,v in enumerate(dff['gender'].value_counts().index)})
dff['SeniorCitizen'] = dff['SeniorCitizen'].map({v:i for i,v in enumerate(dff['SeniorCitizen'].value_counts().index)})
dff['Partner'] = dff['Partner'].map({v:i for i,v in enumerate(dff['Partner'].value_counts().index)})
dff['PhoneService'] = dff['PhoneService'].map({v:i for i,v in enumerate(dff['PhoneService'].value_counts().index)})
dff['PaperlessBilling'] = dff['PaperlessBilling'].map({v:i for i,v in enumerate(dff['PaperlessBilling'].value_counts().index)})



We will give k = 5 to show case scores of all features.If we want top 3 features we can directly give k = 3 

In [None]:
best = SelectKBest(chi2,k=5)
best.fit(dff[['gender','SeniorCitizen', 'Partner','PhoneService','PaperlessBilling']],dff['Churn'])

In [None]:
df_score = pd.DataFrame(best.pvalues_,columns=['p_values'])
df_score['chi2_values'] = best.scores_
df_score['columns'] = ['gender','SeniorCitizen', 'Partner','PhoneService','PaperlessBilling']
df_score.sort_values(by='p_values')

We have value of significance, alpha =0.05. so we have to choose those features with p value <= alpha

In [None]:
df_score[df_score['p_values'] <= 0.05]['columns']

## [1.2] Using Pearsons coorelation matrix

In [None]:
df_cor = df[['TotalCharges','MonthlyCharges','Churn']].corr()
plt.figure(figsize=(10,10))
sns.heatmap(df_cor,annot=True)
plt.show()

Here we donot find any strong coorelation between any features.So it is not helpful in this case. If we find any variables with strong positive or negative coorelation we can remove any one of them.

ie if a feature is important:

* It will have weak coorelation with other independent features
* It will have strong coorelatio with target(dependent feature)

## [1.3] ANOVA F-test


One way ANOVA test can be used to find relationship between numeric and a categorical variable


Here, <br> 
**Null hypothesis H0 : two groups have same variance.** <br>
**Alternate hypothesis H1: aleast one of the group have different variance**



ie, if two groups have same variance it indicates that those feature is not important. We can drop them on feature selection. otherwise we wont drop the feature.

The basic idea is that we will find 
```
Fscore = (variance_between groups/ variance_within groups) 
```

and compare it with critical value obtained from F value table to accept or reject null hypothesis.


Sklearn provides method called f_classif to do Anova F test and we can use it with Select K best for faster results. We dont have to do seperate for each input feature. If the value ‘variance_between / variance_within’ is less than the critical value (evaluated using log table). The library returns score and p value, for p<0.05 we mean that the confidence>95% for them to belong to the same population and hence are co-related. We select top k co-related features according to the score returned by Anova.


In [None]:
dff = df[['TotalCharges','MonthlyCharges','Churn']]

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# select top 3 features
selector = SelectKBest(f_classif,k=2)
selector.fit(dff[['TotalCharges','MonthlyCharges']],dff['Churn'])


df_score = pd.DataFrame(selector.pvalues_,columns=['p_values'])
df_score['score'] = selector.scores_
df_score['columns'] = ['TotalCharges','MonthlyCharges']

df_score

In [None]:
df_score[df_score['p_values'] <= 0.05]['columns']

# [2]Wrapper Methods

Eventhough We have filter methods they are not much accurate, so we have wrapper methods like forward feature selection, backward elemination etc.

## [2.1] Forward Selection

It is an iterative method in which we start with zero features at the beginning and in each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

Before Proceeding further Let us create a basic model first. We will build a decision tree classifier.

In [None]:
num_cols = ['TotalCharges','MonthlyCharges','tenure']

for col in df.columns:
    if col not in num_cols:
        df[col] = df[col].map({v:i for i,v in enumerate(df[col].value_counts().index)})

df.head()

In [None]:

y = df['Churn']
X = df.drop(columns=['Churn'])
print(X.shape,y.shape)
print('-'*50)

#60-20-20 split
x_train,x_test,y_train,y_test = train_test_split(X,y,random_state=100,stratify=y,test_size=0.2)

print(x_train.shape,y_train.shape)
print(x_test.shape,y_test.shape)


Before going further let us build and tune a model

In [None]:

clf = DecisionTreeClassifier()
samplesplits = [5, 10, 100, 500]
maximumdepth = [1, 5, 10, 50, 100, 500, 1000]
parameters = {'min_samples_split':samplesplits ,'max_depth':maximumdepth}

model = GridSearchCV(estimator=clf, param_grid=parameters, cv=3, n_jobs=-1, scoring='roc_auc',return_train_score=True)
model.fit(x_train,y_train)
print("Model with best parameters :\n",model.best_params_)

### model
best_est = DecisionTreeClassifier(**model.best_params_)
best_est = best_est.fit(x_train,y_train)
train_fpr, train_tpr, thresholds = roc_curve(y_train, best_est.predict_proba(x_train)[:,1])
test_fpr, test_tpr, thresholds = roc_curve(y_test, best_est.predict_proba(x_test)[:,1])

print('Area under train roc {}'.format(auc(train_fpr, train_tpr)))
print('Area under test roc {}'.format(auc(test_fpr, test_tpr)))

In [None]:
df.shape

**Forward feature selection**

Currently we have 20 features. Let us pick up top 17 features and see how it performs.

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector


model = DecisionTreeClassifier(max_depth=5,min_samples_split=100)
sfs = SequentialFeatureSelector(model,n_features_to_select=17,scoring='roc_auc',direction='forward')
sfs.fit(x_train,y_train)


idxes = sfs.get_support(indices=True)
top_feats = x_train.columns[idxes]
print(f'Selected features are {top_feats}')


### model
best_est = DecisionTreeClassifier(max_depth=5,min_samples_split=100)
best_est = best_est.fit(x_train[top_feats],y_train)
train_fpr, train_tpr, thresholds = roc_curve(y_train, best_est.predict_proba(x_train[top_feats])[:,1])
test_fpr, test_tpr, thresholds = roc_curve(y_test, best_est.predict_proba(x_test[top_feats])[:,1])


print(f'Results after reducing features from 20 to 17')
print('Area under train roc {}'.format(auc(train_fpr, train_tpr)))
print('Area under test roc {}'.format(auc(test_fpr, test_tpr)))

We can see that on selecting top 17 features our model perfomance increased from 0.814637422821566 to 0.8163347025239608

## [2.2] Backward selection

In this method, we start with the all the features, and remove features one by one if their absence increases the score of the model. We do this until no improvement is observed on removing any feature.

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector

model = DecisionTreeClassifier(max_depth=5,min_samples_split=100)
sfs = SequentialFeatureSelector(model,n_features_to_select=17,scoring='roc_auc',direction='backward')
sfs.fit(x_train,y_train)


idxes = sfs.get_support(indices=True)
top_feats = x_train.columns[idxes]
print(f'Selected features are {top_feats}')


### model
best_est = DecisionTreeClassifier(max_depth=5,min_samples_split=100)
best_est = best_est.fit(x_train[top_feats],y_train)
train_fpr, train_tpr, thresholds = roc_curve(y_train, best_est.predict_proba(x_train[top_feats])[:,1])
test_fpr, test_tpr, thresholds = roc_curve(y_test, best_est.predict_proba(x_test[top_feats])[:,1])


print(f'Results with reduced features')
print('Area under train roc {}'.format(auc(train_fpr, train_tpr)))
print('Area under test roc {}'.format(auc(test_fpr, test_tpr)))

In this case also removing features improved the perfomance of the model.

## [2.3] Recursive feature elimination


In RFE, it is a recursive process of feature selection. Initially a model is build with all the features. Now the feature with least importance is removed and again the model is fitted in remaining features. Inorder to determine important features algorithms like decision tree,xgboost etc have its own ways. Otherwise it internally uses statistical methods to achieve the same. This process is recursively done until we get required number of features.


Here we will familirize RFE with cross validation as it will makes better since since we will also do cross validation while building each model.

In [None]:
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold




model = DecisionTreeClassifier(max_depth=5,min_samples_split=100)
rfecv = RFECV(estimator=model, step=1, cv=StratifiedKFold(10), scoring='roc_auc')
rfecv.fit(x_train,y_train)


print('Optimal number of features: {}'.format(rfecv.n_features_))

In [None]:

print(np.where(rfecv.support_ == True)[0])
top_feats = x_train.columns[np.where(rfecv.support_ == True)[0]]



best_est = DecisionTreeClassifier(max_depth=5,min_samples_split=100)
best_est = best_est.fit(x_train[top_feats],y_train)
train_fpr, train_tpr, thresholds = roc_curve(y_train, best_est.predict_proba(x_train[top_feats])[:,1])
test_fpr, test_tpr, thresholds = roc_curve(y_test, best_est.predict_proba(x_test[top_feats])[:,1])


print(f'Results with reduced features')
print('Area under train roc {}'.format(auc(train_fpr, train_tpr)))
print('Area under test roc {}'.format(auc(test_fpr, test_tpr)))

We can see that with just 3 features our score improved to 0.8283

Thus we can see that feature selection helps in improving our model perfomance.