# In this particular notebook ,we are going to see 5 different techniques for feature selection.
1 Dropping Constant Features- Variance Threshold- Unsupervised learning 
2 Feature selection with correlation
3 Features selection Using Information Gain For Classification 
4 Features selection Using Information Gain For Regression 
5 Feature Selection Using Chi2 Statistical Analysis

# 1- Dropping Constant Features- Variance Threshold

# Variance Threshold
Variance threshold is a function inside Feature selector.
Feature selector that removes all low-variance features.

This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

In [None]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
df=pd.read_csv('../input/santander-customer-satisfaction/train.csv',nrows=10000)

In [None]:
df.shape

In [None]:
## top 10 data
df.head(10)

In [None]:
### Define the dataset into dependent and independent feature
X=df.drop(labels=['TARGET'], axis=1)
y=df['TARGET']

In [None]:
from sklearn.model_selection import train_test_split
# separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(labels=['TARGET'], axis=1),
    df['TARGET'],
    test_size=0.3,
    random_state=0)


### we have 370 features as our independent features
X_train.shape, X_test.shape

# Lets apply the variance threshold

It will remove all those features which have zero threshold value or zero variance feature . It applies only on independent feature.

In [None]:
var_thres=VarianceThreshold(threshold=0)
var_thres.fit(X_train)

In the below code, true indicates that a particular feature is very important and false indicates that a particular feature is not so important with respect to the target feature.

In [None]:
var_thres.get_support()

# There are total 284 non constant features out of 370 features.

In [None]:
### lets find non constant feature
len(X_train.columns[var_thres.get_support()])

# 86 features are our constant features

In [None]:
constant_columns = [column for column in X_train.columns
                    if column not in X_train.columns[var_thres.get_support()]]

print(len(constant_columns))


# Printing our constant columns

In [None]:
for column in constant_columns:
    print(column)

# Here, we are dropping the constant columns.

In [None]:
X_train.drop(constant_columns,axis=1)

From above function, we get all the features which are important for our dataset

# 2- Feature selection with correlation

# In this technique, we compare two features together and if both features are highly co-related with each other then we will drop anyone features from both.

In [None]:
#importing libraries
from sklearn.datasets import load_boston
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Loading the dataset

In [None]:
#Loading the dataset
data = load_boston()
df = pd.DataFrame(data.data, columns = data.feature_names)
df["MEDV"] = data.target

In [None]:
data.feature_names

In [None]:
df.head()

# Dividing our independent(x) and dependent feature (y)

In [None]:
X = df.drop("MEDV",axis=1)   #Feature Matrix
y = df["MEDV"]

In [None]:
X.head()

In [None]:
y.head()

In [None]:
# separate dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

# We perform all the operations on X_train dataset and after do the same for X_test.

In [None]:
X_train.corr()

# You can see tax and rad features both are are 91% highly co-related with each other, so we will drop one of them. threshold(90)

In [None]:
import seaborn as sns
#Using Pearson Correlation
plt.figure(figsize=(12,10))
cor = X_train.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.CMRmap_r)
plt.show()

With the following function we can select highly correlated features
It will remove the first feature that is correlated with anything other feature


In [None]:
# with the following function we can select highly correlated features
# it will remove the first feature that is correlated with anything other feature

def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

# Here we are calling our fn. and passing the dataset and threshold value

In [None]:
corr_features = correlation(X_train, 0.7)
len(set(corr_features))

# Highly co-related features

In [None]:
corr_features

# Drop highly co-related features

In [None]:
X_train.drop(corr_features,axis=1)
X_test.drop(corr_features,axis=1)

# We dropped highly co-related features after comparing each and every column with each other.

# 3- Features selection Using Information Gain For Classification

Before going ahead you need to have some statistical test knowledge like annova test, t test , chi square test, p value test

# Mutual Information
MI Estimate mutual information for a discrete target variable.

Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances.

Inshort

A quantity called mutual information measures the amount of information one can obtain from one random variable given another.

The mutual information between two random variables X and Y can be stated formally as follows:

I(X ; Y) = H(X) – H(X | Y) Where I(X ; Y) is the mutual information for X and Y, H(X) is the entropy for X and H(X | Y) is the conditional entropy for X given Y. The result has the units of bits.

In [None]:
import pandas as pd

In [None]:
df=pd.read_csv('https://gist.githubusercontent.com/tijptjik/9408623/raw/b237fa5848349a14a14e5d4107dc7897c21951f5/wine.csv')
df.head()

# let's check how many unique value we have

In [None]:
df['Wine'].unique()

# Check all the values are integers or not

In [None]:
df.info()

In [None]:
### Train test split to avoid overfitting
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df.drop(labels=['Wine'], axis=1),
    df['Wine'],
    test_size=0.3,
    random_state=0)

In [None]:
X_train.head()

Note - Remove all the null values from train and test dataset  before applying  mutual_info_classify

# High value for any feature mean that particular feature is the best feature 

In [None]:
from sklearn.feature_selection import mutual_info_classif
# determine the mutual information
mutual_info = mutual_info_classif(X_train, y_train)
mutual_info

# Converting the information of features into series

In [None]:
mutual_info = pd.Series(mutual_info)
mutual_info.index = X_train.columns
mutual_info.sort_values(ascending=False)

In [None]:
#let's plot the ordered mutual_info values per feature
mutual_info.sort_values(ascending=False).plot.bar(figsize=(20, 8))


# Import selectkbest function to pick top features

In [None]:
from sklearn.feature_selection import SelectKBest

# We will take only top 5 features as our independent features.

In [None]:
#No we Will select the  top 5 important features
sel_five_cols = SelectKBest(mutual_info_classif, k=5)
sel_five_cols.fit(X_train, y_train)
X_train.columns[sel_five_cols.get_support()]

# 4- Features selection Using Information Gain For Regression

# Mutual Information
Estimate mutual information for a continuous target variable.

Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.

Inshort

A quantity called mutual information measures the amount of information one can obtain from one random variable given another.

The mutual information between two random variables X and Y can be stated formally as follows:

I(X ; Y) = H(X) – H(X | Y) Where I(X ; Y) is the mutual information for X and Y, H(X) is the entropy for X and H(X | Y) is the conditional entropy for X given Y. The result has the units of bits.

here we are trying to find out the best features based on the specific sales price. And ales price is a continuous target variable.

In [None]:
import pandas as pd
housing_df=pd.read_csv('../input/housepricesadvancedregressiontechniquestrain/train.csv')

In [None]:
housing_df.head()

In [None]:
housing_df.info()

# Checking for null values

In [None]:
housing_df.isnull().sum()

# Taking only numerical variable to apply mutual information.

In [None]:

numeric_lst=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_cols = list(housing_df.select_dtypes(include=numeric_lst).columns)

# We have to find out mutual information with respect to each and every feature along with sales price.

# Numericals columns

In [None]:
numerical_cols

# Ceating a dataframe for all the numerical_cols

In [None]:
housing_df=housing_df[numerical_cols]

In [None]:
housing_df.head()

In [None]:
housing_df=housing_df.drop("Id",axis=1)

In [None]:
### It is always a good practice to split train and test data to avoid
#overfitting
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(housing_df.drop(labels=['SalePrice'], axis=1),
    housing_df['SalePrice'],
    test_size=0.3,
    random_state=0)

In [None]:
X_train

In [None]:
X_train.isnull().sum()

# Filling null values with zero. 
Higher the value you get for any feature ,the more better it is nd more dependent to target feature.

In [None]:
from sklearn.feature_selection import mutual_info_regression
# determine the mutual information
mutual_info = mutual_info_regression(X_train.fillna(0), y_train)
mutual_info

Mutual information (MI) [1] between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.As the value near to one the more dependent that particular feature is.

# Converting all the values into series

In [None]:
mutual_info = pd.Series(mutual_info)
mutual_info.index = X_train.columns
mutual_info.sort_values(ascending=False)

In [None]:
mutual_info.sort_values(ascending=False).plot.bar(figsize=(15,5))

In [None]:
from sklearn.feature_selection import SelectPercentile

We will choose only top percentile fatures. SelectPercentile helps us to choose top feature out of all the features.

In [None]:
## Selecting the top 20 percentile
selected_top_columns = SelectPercentile(mutual_info_regression, percentile=20)
selected_top_columns.fit(X_train.fillna(0), y_train)

In below code, false means that particular feature is not belonging to top 20 percentile. 

In [None]:
selected_top_columns.get_support()

# Getting top most important feature

In [None]:
X_train.columns[selected_top_columns.get_support()]

# 5- Feature Selection Using Chi2 Statistical Analysis

# Fisher Score- Chisquare Test For Feature Selection
Compute chi-squared stats between each non-negative feature and class.

This score should be used to evaluate categorical variables in a classification task.
This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification. The Chi Square statistic is commonly used for testing relationships between categorical variables.

It compares the observed distribution of the different classes of target Y among the different categories of the feature, against the expected distribution of the target classes, regardless of the feature categories.

In [None]:
import seaborn as sns
df=sns.load_dataset('titanic')
import numpy as np

In [None]:
df.head()

In [None]:
df.info()

So,i am considering categorical features and will try to find out the top important features.

Creating a data frame for categorical features.we need to compare all the categories with the output category (Survived)

In [None]:
##['sex','embarked','alone','pclass','Survived']
df=df[['sex','embarked','alone','pclass','survived']]
df.head()

In [None]:
df['sex']=np.where(df['sex']=="male",1,0)
df.head()

In [None]:
### Let's perform label encoding on sex column
import numpy as np
### let's perform label encoding on embarked
ordinal_label = {k: i for i, k in enumerate(df['embarked'].unique(), 0)}
df['embarked'] = df['embarked'].map(ordinal_label)

In [None]:
df.head()

# Performing label encoding on each and every column.

In [None]:
### let's perform label encoding on alone
df['alone']=np.where(df['alone']==True,1,0)

In [None]:
df.head()

In [None]:
### train Test split is usually done to avaoid overfitting
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df[['sex','embarked','alone','pclass']],
                                              df['survived'],test_size=0.3,random_state=100)

In [None]:
X_train.head()

In [None]:
X_train['sex'].unique()

In [None]:
X_train.isnull().sum()

In [None]:
## Perform chi2 test
### chi2 returns 2 values
### Fscore and the pvalue
from sklearn.feature_selection import chi2
f_p_values=chi2(X_train,y_train)

Chi2 gives us two value-

Fscore - fscore needs to be higher, the more the value of fscore the more important feature is

Pvalue - lesser the pvalue the more important the feature is

1st array values is of fscore

2nd array values is of pvalue

In [None]:
f_p_values

# Make a series of these p_values

In [None]:
import pandas as pd
p_values=pd.Series(f_p_values[1])
p_values.index=X_train.columns
p_values

Sort the series in ascending order

In [None]:
p_values.sort_index(ascending=False)

# Observation
Sex Column is the most important column when compared to the output feature Survived