Feature Selection: It is the process of selecting the most important/relevant features of a dataset.


 - It enables the machine learning algorithm to train faster.</br>
 - It reduces the complexity of a model and makes it easier to interpret.</br>
 - It improves the accuracy of a model if the right subset is chosen.</br>
 - It reduces Overfitting.</br>

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('D://DataScienceCollection//DataSets//IRIS//iris.csv')

In [3]:
data.head(5)

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


# WAY#1: Thresholding on Numerical Features

#### This approach removes columns whith low variance (less likely columns for analysis)

In [4]:
from sklearn.feature_selection import VarianceThreshold

In [5]:
columns=data[['sepallength','sepalwidth','petallength','petalwidth']]
columns.head(5)

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [6]:
# Create thresholder
thresholder = VarianceThreshold(threshold=.5)

# Create high variance feature matrix
features_high_variance = thresholder.fit_transform(columns)

# View high variance feature matrix
features_high_variance[0:5]

array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2],
       [4.6, 1.5, 0.2],
       [5. , 1.4, 0.2]])

This concludes that **'sepalwidth'** is a columns which holds low variance [according to the defined threshold]. Therefore, it has been removed from the analysis dataset. Variance Threshold works on the concept of - Features with low variance are likely less interesting (and useful) than features with high variance.

#### Points To Note:
[1] Threshold needs to be set manually. <br/>
[2] This approach will not work when feature sets contain different units (e.g., one feature is in years while a another feature is in dollars)

# WAY#2: Thresholding on Missing Values

In [7]:
#half_count = len(df) / 2
#df = df.dropna(thresh=half_count,axis=1) # Drop any column with more than 50% missing values

# WAY#3: Thresholding on Binary Feature Variance


In [8]:
# Create feature Dataframe with:
# Feature 0: 80% class 0
# Feature 1: 80% class 1
# Feature 2: 60% class 0, 40% class 1
dummyData = {'col1': [0,0,0,0,1], 'col2': [1,1,1,1,0], 'col3':[0,1,0,1,0]}
dummyDataFrame = pd.DataFrame(data=dummyData)

# Run threshold by -Bernoulli Random Variable Variance
thresholder = VarianceThreshold(threshold=(.75 * (1 - .75)))
thresholder.fit_transform(dummyDataFrame)

array([[0],
       [1],
       [0],
       [1],
       [0]], dtype=int64)

# WAY#4: Feature Selection from Highly Correlated Features

#### If two features are highly correlated, then the information they contain is very similar, and it is likely redundant to include both features. Therefore, we remove one of them from the feature set.

In [9]:
# Create feature Dataframe with two highly correlated features
dummyData = {'col1': [1,2,3,4,5,6,7,8,9], 'col2': [1,2,3,4,5,6,7,7,7], 'col3':[1,0,1,0,1,0,1,0,1]}

# Convert feature matrix into DataFrame
dataframe = pd.DataFrame(dummyData)

# Create correlation matrix
corr_matrix = dataframe.corr().abs()

In [10]:
#Notice the 1 in te diagnols
corr_matrix

Unnamed: 0,col1,col2,col3
col1,1.0,0.976103,0.0
col2,0.976103,1.0,0.034503
col3,0.0,0.034503,1.0


In [11]:
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),
                          k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

In [12]:
to_drop

['col2']

In [13]:
# Drop features
dataframe.drop(to_drop, axis=1).head(3)

Unnamed: 0,col1,col3
0,1,1
1,2,0
2,3,1


In [14]:
# Remembering idxmax() used in Recommendation Systems, that was used for finding row where values for column is maximum. 
# Here, we are on different perception.Example of idxmax()

dummyData = {'col1': [1,2,3,4,5,6,7,8,9], 'col2': [1,2,3,4,5,6,7,77,7], 'col3':[1,0,1,0,1,0,1,0,1]}
dummyDataFrame = pd.DataFrame(data=dummyData)
dummyDataFrame.idxmax(axis=1)

0    col1
1    col1
2    col1
3    col1
4    col1
5    col1
6    col1
7    col2
8    col1
dtype: object

# WAY#5: Feature Selection on Irrelevant Features for Classification

#### CHI2 WAY

In [15]:
# Import the necessary libraries first
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [22]:
from sklearn import datasets
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
Y = iris.target


# Feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)

# Summarize scores
np.set_printoptions(precision=2)
print(fit.scores_)


features = fit.transform(X)
# Summarize selected features
print(features[0:5,:])

[ 10.82   3.59 116.17  67.24]
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


#### ANOVA WAY

In [25]:
from sklearn.feature_selection import chi2, f_classif

iris = load_iris()
features = iris.data
target = iris.target

# Select two features with highest F-values
fvalue_selector = SelectKBest(f_classif, k=2)
features_kbest = fvalue_selector.fit_transform(features, target)
print(features_kbest[0:5,:])

[[1.4 0.2]
 [1.4 0.2]
 [1.3 0.2]
 [1.5 0.2]
 [1.4 0.2]]
