# Feature selection
- Feature Selection is the process of selecting a subset of relevant features for use in model construction.

- Statistical tests can be used to select those features that have the strongest relationship with the output variable.

- The ***scikit-learn*** library provides the ***SelectKBest*** class that can be used with different statistical tests to select a specific number of features.

# 1. ANOVA F_Value
- It is a Univariate feature selection. It works by selecting the best features based on univariate statistical tests.
- Linear model for testing the individual effect of each of many regressors. 

This is done in 2 steps:

- The correlation between each regressor and the target is computed, that is,
- It is converted to an F score then to a p-value.

In [2]:
import numpy as np
import pandas as pd

# setting the seed
np.random.seed(0)

# importing the f_regression class 
from sklearn.feature_selection import f_regression, SelectKBest

#load the dataset
df = pd.read_excel('../dataset/energy-efficiency-dataset.xlsx')

X = df.iloc[:, 0:8] # selecting the first 8 columns into X
y1 = df.iloc[:, 8] # This corresponds to heating load 
y2 = df.iloc[:, 9] # this is cooling load

# Number of features before applying f_regression
print("Features: {}".format(X.columns))
print('# features before r_regression: %d\n' % X.shape[1])

Features: Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8'], dtype='object')
# features before r_regression: 8



***SelectKBest*** selects the best k features (k value is given manually).
- I have set k = 6 and SelectKbest will choose best 6 features based on f_scores

In [3]:
# 1. Feature selection for heating load (Y1)


# selectkbset selects the k best features according to scores
test = SelectKBest(score_func = f_regression, k = 6).fit(X, y1)
X_new = test.transform(X)
X_new = pd.DataFrame(X_new)

# F_scores for each feature
for i in range(len(X.columns)):
    print('X%d F_value: %.2f'%(i+1, test.scores_[i]))
    
# Number of features after applying f_regression
print('\n# features After r_regression: %d  | As we have set k=6\n' % X_new.shape[1])

# Selected Features are :
features = []
for i in range(len(test.get_support())):
    if(test.get_support()[i]):
        features.append("X" + str(test.get_support()[i] + i))
print("Selected Features are: " , features)

X1 F_value: 484.05
X2 F_value: 585.26
X3 F_value: 200.73
X4 F_value: 2211.62
X5 F_value: 2900.59
X6 F_value: 0.01
X7 F_value: 60.16
X8 F_value: 5.89

# features After r_regression: 6  | As we have set k=6

Selected Features are:  ['X1', 'X2', 'X3', 'X4', 'X5', 'X7']


In [4]:
# 1. Feature selection for Cooling load (Y2)


# selectkbset selects the k best features according to scores
test = SelectKBest(score_func = f_regression, k = 6).fit(X, y2)
X_new = test.transform(X)
X_new = pd.DataFrame(X_new)

# F_scores for each feature
for i in range(len(X.columns)):
    print('X%d F_value: %.2f'%(i+1, test.scores_[i]))
    
# Number of features after applying f_regression
print('\n# features After r_regression: %d  | As we have set k=6\n' % X_new.shape[1])

# Selected Features are :
features = []
for i in range(len(test.get_support())):
    if(test.get_support()[i]):
        features.append("X" + str(test.get_support()[i] + i))
print("Selected Features are: " , features)

X1 F_value: 515.76
X2 F_value: 634.18
X3 F_value: 170.92
X4 F_value: 2226.03
X5 F_value: 3111.13
X6 F_value: 0.16
X7 F_value: 34.47
X8 F_value: 1.96

# features After r_regression: 6  | As we have set k=6

Selected Features are:  ['X1', 'X2', 'X3', 'X4', 'X5', 'X7']


# 2. Mutual information (MI)
- MI between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

In [5]:
# importing the f_regression class 
from sklearn.feature_selection import mutual_info_regression, SelectKBest

# Number of features before applying mutual_info_regression
print("Features: {}".format(X.columns))
print('\n# features before mutual_info_regression: %d\n' % X.shape[1])

Features: Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8'], dtype='object')

# features before mutual_info_regression: 8



In [6]:
# 1. Feature selection for heating load (Y1)


# selectkbset selects the k best features according to scores
test = SelectKBest(score_func = mutual_info_regression, k = 6).fit(X, y1)
X_new = test.transform(X)
X_new = pd.DataFrame(X_new)

# F_scores for each feature
for i in range(len(X.columns)):
    print('X%d estimated mutual info: %.2f'%(i+1, test.scores_[i]))
    
# Number of features after applying f_regression
print('\n# features After r_regression: %d  | As we have set k=6\n' % X_new.shape[1])

# Selected Features are :
features = []
for i in range(len(test.get_support())):
    if(test.get_support()[i]):
        features.append("X" + str(test.get_support()[i] + i))
print("Selected Features are: " , features)

X1 estimated mutual info: 1.73
X2 estimated mutual info: 1.73
X3 estimated mutual info: 1.12
X4 estimated mutual info: 0.93
X5 estimated mutual info: 0.66
X6 estimated mutual info: 0.00
X7 estimated mutual info: 0.71
X8 estimated mutual info: 0.22

# features After r_regression: 6  | As we have set k=6

Selected Features are:  ['X1', 'X2', 'X3', 'X4', 'X5', 'X7']


In [7]:
# 1. Feature selection for cooling load (Y2)


# selectkbset selects the k best features according to scores
test = SelectKBest(score_func = mutual_info_regression, k = 6).fit(X, y2)
X_new = test.transform(X)
X_new = pd.DataFrame(X_new)

# F_scores for each feature
for i in range(len(X.columns)):
    print('X%d estimated mutual info: %.2f'%(i+1, test.scores_[i]))
    
# Number of features after applying f_regression
print('\n# features After r_regression: %d  | As we have set k=6\n' % X_new.shape[1])

# Selected Features are :
features = []
for i in range(len(test.get_support())):
    if(test.get_support()[i]):
        features.append("X" + str(test.get_support()[i] + i))
print("Selected Features are: " , features)

X1 estimated mutual info: 1.42
X2 estimated mutual info: 1.42
X3 estimated mutual info: 0.87
X4 estimated mutual info: 0.88
X5 estimated mutual info: 0.68
X6 estimated mutual info: 0.00
X7 estimated mutual info: 0.73
X8 estimated mutual info: 0.15

# features After r_regression: 6  | As we have set k=6

Selected Features are:  ['X1', 'X2', 'X3', 'X4', 'X5', 'X7']
