Removing features with low variance

The VarianceThreshold function (import see below) alows you to remove features that have a variance in their values below a defined threshold

Start with the data frame df defined below, and print the values of df before and after removing features with a variance below 0.2

To select the features you need to fit your model first and then use a mask on the columns of the data frame

Hint: Method get_support might be useful to apply the mask

In [1]:
#Removing features with low variance
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

x = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
df=pd.DataFrame(x,columns=['A','B','C'])
print("\nOriginal data frame:\n",df)

sel = VarianceThreshold(threshold=(0.2))
sel.fit(df)
reduced_df=df[df.columns[sel.get_support(indices=True)]]

print("Data frame without low variance features:\n",reduced_df)


Original data frame:
    A  B  C
0  0  0  1
1  0  1  0
2  1  0  0
3  0  1  1
4  0  1  0
5  0  1  1
Data frame without low variance features:
    B  C
0  0  1
1  1  0
2  0  0
3  1  1
4  1  0
5  1  1


Filter approaches for feature selection

Select features using Mutual Information = Information Gain within the SelectKBest function

Start by constructing a data frame from the load_digits dataset, columns should be index-named 'a1','a2',....'a63' + 'Label' 

Select the 10 best features and print the data frame header before and after selection

In [2]:
#Filter approaches
#Univariate feature selection
import pandas as pd
import numpy as np
from sklearn.datasets import load_digits
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

#Load data into array
x = load_digits()

#Construct data frame
column_names=['a'+str(i) for i in range(1, x.data.shape[1]+1)]
column_names.append('Label')
df=pd.DataFrame(np.column_stack([x.data,x.target]),columns=column_names)
print('Original data frame:',df.head)

#Select 10 features using Mutual Information = Information Gain
sel = SelectKBest(mutual_info_classif, k=10)
sel.fit(df.loc[:,'a1':'a63'], df['Label'])
remaining_columns= df.columns[sel.get_support(indices=True)] 
remaining_columns= remaining_columns.insert(len(remaining_columns),'Label')
reduced_df=df[remaining_columns]

print('Reduced data frame (first rows): \n',reduced_df)

Original data frame: <bound method NDFrame.head of        a1   a2    a3    a4    a5    a6   a7   a8   a9  a10  ...  a56  a57  \
0     0.0  0.0   5.0  13.0   9.0   1.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
1     0.0  0.0   0.0  12.0  13.0   5.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
2     0.0  0.0   0.0   4.0  15.0  12.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
3     0.0  0.0   7.0  15.0  13.0   1.0  0.0  0.0  0.0  8.0  ...  0.0  0.0   
4     0.0  0.0   0.0   1.0  11.0   0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
...   ...  ...   ...   ...   ...   ...  ...  ...  ...  ...  ...  ...  ...   
1792  0.0  0.0   4.0  10.0  13.0   6.0  0.0  0.0  0.0  1.0  ...  0.0  0.0   
1793  0.0  0.0   6.0  16.0  13.0  11.0  1.0  0.0  0.0  0.0  ...  0.0  0.0   
1794  0.0  0.0   1.0  11.0  15.0   1.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
1795  0.0  0.0   2.0  10.0   7.0   0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
1796  0.0  0.0  10.0  14.0   8.0   1.0  0.0  0.0  0.0  2.0  ...  0.0  0.0   

      a58  a59   a60   a

Recursive Feature Elimination (RFE)

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features.That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

Based on the given data frame, first create your RFE object. Use "SVC" as an estimator and restrict to 2 features. Fit the RFE object.

Now we fill select two features based on the fitted RFE object.

First, create a column mask based on the attribute rfe.support_. The column mask should be an extension of this attribute to mask the features AND the label (label should always be selected)

Second, use your column mask to mask the df_columns list and store the result in variable reduced_iris_features

Third, use reduced_iris_features as an index on the data frame to do the final feature selection and store the result in reduced_df

Finally, print reduced_df

In [3]:
#Wrapper approaches
#Backward elimination using Recursive feature eliminationÂ¶
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn import feature_selection
import pandas as pd
import numpy as np

iris = load_iris()
x = iris.data
y = iris.target
x_y= np.concatenate((x.reshape(150,4),y.reshape(150,1)),1)
#print(x_y)
#print(iris.target_names)

#Create column list
df_columns=iris.feature_names
df_columns.append("Label")
#Create PANDAS data frame
df = pd.DataFrame(x_y,columns=df_columns)
#Map label index to label name
df['Label']=df['Label'].map(lambda x: iris.target_names[int(x)])
print("\nOriginal Iris Data Set:")
print(df)

#Create the RFE object and rank features
num_features=2
svc = SVC(kernel="linear", C=1)
rfe = feature_selection.RFE(estimator=svc, n_features_to_select=num_features, step=1)
rfe.fit(x, y)
#print("Selected features will have a ranking=1 and support=TRUE")
#print(iris.feature_names," ",rfe.ranking_," ",rfe.support_)
#print(x.shape)

#extend column-mask by one column for the label (always true)
column_mask=np.append(rfe.support_,True)
#use list column_mask to mask df-columns list
reduced_iris_features = [df_columns[i] for i in range(len(df_columns)) if column_mask[i]]
print("Reduced_iris_features: ",reduced_iris_features)
# use reduced_iris_features to reduce the data frame
reduced_df=df[reduced_iris_features]

print("\nIris Data Set reduced to ",num_features," features: \n",reduced_df)


Original Iris Data Set:
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  5.1               3.5                1.4               0.2   
1                  4.9               3.0                1.4               0.2   
2                  4.7               3.2                1.3               0.2   
3                  4.6               3.1                1.5               0.2   
4                  5.0               3.6                1.4               0.2   
..                 ...               ...                ...               ...   
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8   

  