## Quasi-constant features

Quasi-constant features are those that show the same value for the great majority of the observations of the dataset. In general, these features provide little, if any, information that allows a machine learning model to discriminate or predict a target. But there can be exceptions. So you should be careful when removing these type of features.

Identifying and removing quasi-constant features, is an easy first step towards feature selection and more interpretable machine learning models.

Here, I will demonstrate how to identify quasi-constant features using a dataset that I created for this course. 

To identify quasi-constant features, we can use the VarianceThreshold from Scikit-learn, or we can code it ourselves. If we use the VarianceThreshold, all our variables need to be numerical. If we code it manually however, we can apply the code to both numerical and categorical variables.

I will show 2 snippets of code, 1 where I use the VarianceThreshold and 1 manually coded alternative.

In [28]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import VarianceThreshold

In [128]:
# load dataset

# (feel free to write some code to explore the dataset and become
# familiar with it ahead of this demo)

data = pd.read_csv('C:/Users/RAJENDRA REDDY/Downloads/Genre0.csv')
data.shape

(204, 36)

**Important**

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [129]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['class','song'], axis=1), # drop the target
    data['class'], # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((142, 34), (62, 34))

## Remove constant features

First, I will remove constant features like I did in the previous lecture. This will allow a better visualisation of the quasi-constant ones.

In [130]:
# using the code from the previous lecture
# I remove 34 constant features

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((142, 30), (62, 30))

## Remove quasi-constant features

### Using the VarianceThreshold from sklearn

The VarianceThreshold from sklearn provides a simple baseline approach to feature selection. It removes all features which variance doesn’t meet a certain threshold. By default, it removes all zero-variance features, as we did in the previous notebook.

Here, we will change the default threshold to remove quasi-constant features, or, I should better say, features with low-variance:

Check the Scikit-learn docs for more details:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

In [131]:
sel = VarianceThreshold(threshold=0.01)  

sel.fit(X_train)  # fit finds the features with low variance

VarianceThreshold(threshold=0.01)

In [132]:
# get_support is a boolean vector that indicates which features 
# are retained, that is, which features have a higher variance than
# the threshold we indicated.

# If we sum over get_support, we get the number
# of features that are not quasi-constant

print(sum(sel.get_support()))
not_quasi_constant = X_train.columns[sel.get_support()]
not_quasi_constant

18


Index(['melspectogram_max', 'mfcc_min', 'mfcc_max', 'rms_max',
       'spectral_centroid_min', 'spectral_centroid_max',
       'spectral_bandwidth_min', 'spectral_bandwidth_max',
       'spectral_contrast_min', 'spectral_contrast_max',
       'spectral_rolloff_min', 'spectral_rolloff_max', 'poly_features_min',
       'poly_features_max', 'zero_crossing_rate_max', 'delta_mfcc_min',
       'delta_mfcc_max', 'mel_to_stft_max'],
      dtype='object')

In [121]:
# let's print the number of quasi-constant features

quasi_constant = X_train.columns[~sel.get_support()]

len(quasi_constant)

11

We can see that 51 columns / variables are almost constant. This means that 51 variables show predominantly one value for the majority of observations of the training set. Let's explore a few if these variables below.

In [122]:
# let's print the variable names
quasi_constant

Index(['chroma_stft_min', 'chroma_cqt_min', 'chroma_cens_min',
       'chroma_cens_max', 'melspectogram_min', 'rms_min', 'rms_max',
       'spectral_flatness_min', 'spectral_flatness_max',
       'zero_crossing_rate_min', 'tempogram_min'],
      dtype='object')

In [9]:
# percentage of observations showing each of the different values
# of the variable

X_train['var_1'].value_counts() / np.float(len(X_train))

KeyError: 'var_1'

We can see that > 99% of the observations show one value, 0. Therefore, this features is fairly constant.

In [10]:
# let's explore another one

X_train['var_2'].value_counts() / np.float(len(X_train))

0    0.999971
1    0.000029
Name: var_2, dtype: float64

Go ahead and explore the rest of the quasi-constant variables.

We can then remove the quasi-constant features utilizing the transform() method from the VarianceThreshold. Remember that this returns a NumPy array without feature names, so if we want a dataframe we need to reconstitute it.

In [123]:
# capture feature names

feat_names = X_train.columns[sel.get_support()]
feat_names

Index(['melspectogram_max', 'mfcc_min', 'mfcc_max', 'spectral_centroid_min',
       'spectral_centroid_max', 'spectral_bandwidth_min',
       'spectral_bandwidth_max', 'spectral_contrast_min',
       'spectral_contrast_max', 'spectral_rolloff_min', 'spectral_rolloff_max',
       'poly_features_min', 'poly_features_max', 'tonnetz_min', 'tonnetz_max',
       'zero_crossing_rate_max', 'delta_mfcc_min', 'delta_mfcc_max',
       'mel_to_stft_max'],
      dtype='object')

In [133]:
#remove the quasi-constant features

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((142, 18), (62, 18))

In [134]:
trainy, testy = y_train, y_test
# define outlier detection model
trainX = X_train
testX =  X_test
from sklearn.datasets import make_classification
from sklearn.metrics import f1_score
from sklearn.svm import OneClassSVM
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from numpy import vstack
from sklearn.neighbors import LocalOutlierFactor


model = OneClassSVM(gamma='scale', nu=0.02)
model.fit(trainX)
yhat = model.predict(trainX)
# mark inliers 1, outliers -1
print('One class SVM')
print('Accuracy Score: %.3f' % accuracy_score(y_train, yhat))
print('F1 Score: %.3f' % f1_score(y_train, yhat, pos_label=1))
print('Precision Score: %.3f' % precision_score(y_train, yhat, average='micro'))
print('Recall Score: %.3f' % recall_score(y_train, yhat, average='micro'))


model = EllipticEnvelope(contamination=0.1)
model.fit(trainX)
yhat = model.predict(trainX)
# mark inliers 1, outliers -1

# calculate score
print('EllipticEnvelope')
print('Accuracy Score: %.3f' % accuracy_score(y_train, yhat))
print('F1 Score: %.3f' % f1_score(y_train, yhat, pos_label=1))
print('Precision Score: %.3f' % precision_score(y_train, yhat, average='micro'))
print('Recall Score: %.3f' % recall_score(y_train, yhat, average='micro'))


# make a prediction with a lof model
def lof_predict(model, trainX, testX):
	# create one large dataset
	composite = vstack((trainX, testX))
	# make prediction on composite dataset
	yhat = model.fit_predict(composite)
	# return just the predictions on the test set
	return yhat[len(trainX):]


model = IsolationForest(contamination=0.1)
model.fit(trainX)
yhat = lof_predict(model,testX,trainX)
# mark inliers 1, outliers -1

print('Isolation forest')
print('Accuracy Score: %.3f' % accuracy_score(y_train, yhat))
print('F1 Score: %.3f' % f1_score(y_train, yhat, pos_label=1))
print('Precision Score: %.3f' % precision_score(y_train, yhat, average='micro'))
print('Recall Score: %.3f' % recall_score(y_train, yhat, average='micro'))

model = LocalOutlierFactor(contamination=0.1)
yhat = lof_predict(model,testX,trainX)
# mark inliers 1, outliers -1
print('LocalOutlierFactor')
print('Accuracy Score: %.3f' % accuracy_score(y_train, yhat))
print('F1 Score: %.3f' % f1_score(y_train, yhat, pos_label=1))
print('Precision Score: %.3f' % precision_score(y_train, yhat, average='micro'))
print('Recall Score: %.3f' % recall_score(y_train, yhat, average='micro'))



One class SVM
Accuracy Score: 0.979
F1 Score: 0.989
Precision Score: 0.979
Recall Score: 0.979
EllipticEnvelope
Accuracy Score: 0.894
F1 Score: 0.944
Precision Score: 0.894
Recall Score: 0.894
Isolation forest
Accuracy Score: 0.915
F1 Score: 0.956
Precision Score: 0.915
Recall Score: 0.915
LocalOutlierFactor
Accuracy Score: 0.915
F1 Score: 0.956
Precision Score: 0.915
Recall Score: 0.915


In [27]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.covariance import EllipticEnvelope
# generate dataset

# define outlier detection model
model = EllipticEnvelope(contamination=0.01)
# fit on majority class

model.fit(trainX)
# detect outliers in the test set
yhat = model.predict(trainX)
# mark inliers 1, outliers -1

# calculate score
score = f1_score(y_train, yhat, pos_label=1)
print('F1 Score: %.3f' % score)

F1 Score: 0.993


In [28]:
# isolation forest for imbalanced classification
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.ensemble import IsolationForest
# generate dataset

# split into train/test sets

# define outlier detection model
model = IsolationForest(contamination=0.01)
# fit on majority class

model.fit(trainX)
# detect outliers in the test set
yhat = model.predict(trainX)
# mark inliers 1, outliers -1

# calculate score
score = f1_score(trainy, yhat, pos_label=1)
print('F1 Score: %.3f' % score)

F1 Score: 0.993


In [79]:
from numpy import vstack
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.neighbors import LocalOutlierFactor

# make a prediction with a lof model
def lof_predict(model, trainX, testX):
	# create one large dataset
	composite = vstack((trainX, testX))
	# make prediction on composite dataset
	yhat = model.fit_predict(composite)
	# return just the predictions on the test set
	return yhat[len(trainX):]

# generate dataset

# split into train/test sets

# define outlier detection model
model = LocalOutlierFactor(contamination=0.01)
# get examples for just the majority class

# detect outliers in the test set
yhat = lof_predict(model,testX,trainX)
# mark inliers 1, outliers -1

# calculate score
model = LocalOutlierFactor(contamination=0.01)
yhat = lof_predict(model,testX,trainX)
# mark inliers 1, outliers -1
print('LocalOutlierFactor')
print('Accuracy Score: %.3f' % accuracy_score(y_train, yhat))
print('F1 Score: %.3f' % score)
print('Precision Score: %.3f' % precision_score(y_train, yhat, average='micro'))
print('Recall Score: %.3f' % recall_score(y_train, yhat, average='micro'))

LocalOutlierFactor
Accuracy Score: 0.986
F1 Score: 0.993
Precision Score: 0.986
Recall Score: 0.986


By removing constant and almost constant features, we reduced the feature space from 300 to 215. This means, that 85 features were removed from this dataset. Almost a third!!

In [25]:
# trasnform the array into a dataframe

X_train = pd.DataFrame(X_train, columns=feat_names)
X_test = pd.DataFrame(X_test, columns=feat_names)

X_test.head()

Unnamed: 0,melspectogram_max,mfcc_min,mfcc_max,rms_max,spectral_centroid_min,spectral_centroid_max,spectral_bandwidth_min,spectral_bandwidth_max,spectral_contrast_min,spectral_contrast_max,spectral_rolloff_min,spectral_rolloff_max,poly_features_min,poly_features_max,tonnetz_min,tonnetz_max,zero_crossing_rate_max,delta_mfcc_min,delta_mfcc_max,mel_to_stft_max
0,552.49475,-289.0811,169.05515,0.193857,696.721192,6754.124528,1473.225231,3952.254479,4.725219,48.074677,796.728516,9905.273438,0.253625,2.660251,-0.596752,0.647851,0.536133,-20.846766,26.345085,9.931176
1,685.15735,-397.39444,169.58191,0.178042,995.816453,4185.68104,1321.707466,2943.124273,3.065951,53.888713,2153.320313,6976.757813,0.042618,1.694155,-0.393235,0.49696,0.313965,-22.865694,30.942684,9.972648
2,341.73334,-493.7221,197.31262,0.141993,506.885685,1884.365113,811.29237,2018.019387,5.09944,47.304562,839.794922,3542.211914,0.015032,0.543955,-0.570008,0.465677,0.166016,-12.827309,22.043661,8.632686
3,201.66867,-374.01636,112.71471,0.11237,1244.511529,3793.308356,1498.854478,3007.64228,5.281376,44.970377,3025.415039,7030.59082,0.068378,0.92968,-0.454395,0.529148,0.256836,-24.823927,28.989141,8.438001
4,1572.452,-248.96866,160.06061,0.311368,1160.999467,5808.814918,1090.000846,3483.642221,1.667117,50.155734,1765.722656,8817.84668,0.226985,2.929869,-0.51744,0.53414,0.501953,-21.953081,26.134052,12.696788


### Coding it ourselves

First, I will separate the dataset into train and test and remove the constant features again. Then, I will provide an alternative method to find out quasi-constant features.

This method, as opposed to the VarianceThreshold, can be used for both **numerical and categorical** variables.

In [11]:
# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['class','song'], axis=1),
    data['class'],
    test_size=0.3,
    random_state=0)

# remove constant features
# using the code from the previous lecture

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((142, 30), (62, 30))

In [12]:
# create an empty list
quasi_constant_feat = []

# iterate over every feature
for feature in X_train.columns:

    # find the predominant value, that is the value that is shared
    # by most observations
    predominant = (X_train[feature].value_counts() / np.float(
        len(X_train))).sort_values(ascending=False).values[0]

    # evaluate the predominant feature: do more than 99% of the observations
    # show 1 value?
    if predominant > 0.998:
        
        # if yes, add the variable to the list
        quasi_constant_feat.append(feature)

len(quasi_constant_feat)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if __name__ == '__main__':


0

Our method was a bit more aggressive than VarianceThreshold from sklearn with the threshold that we selected above. It found 108 features that show predominantly 1 value for the majority of the observations. 

Let's see how some of the quasi constant features look like.

In [13]:
# print the feature names

quasi_constant_feat

[]

In [17]:
# select one feature from the list

quasi_constant_feat[2]

'var_3'

In [18]:
X_train['var_3'].value_counts() / np.float(len(X_train))

0.0000         0.999629
35685.9459     0.000029
3583.3941      0.000029
15028.0560     0.000029
52105.7901     0.000029
10281.6000     0.000029
86718.0000     0.000029
207901.3365    0.000029
25905.4866     0.000029
5209.9500      0.000029
2641.0164      0.000029
12542.3100     0.000029
861.0900       0.000029
27.3000        0.000029
Name: var_3, dtype: float64

The feature shows 0 for more than 99.9% of the observations. But, it also shows a few different values for a very tiny proportion of the observations. This fact, will increase the feature variance, that is why, this feature is not captured by the VarianceThreshold in our previous cell. Yet, we can see that it is quasi-constant.

Keep in mind that the thresholds are arbitrary and decided by the user.

In [14]:
# finally, let's drop the quasi-constant features:

X_train.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_test.drop(labels=quasi_constant_feat, axis=1, inplace=True)

X_train.shape, X_test.shape

((142, 30), (62, 30))

We see, how, we removed almost half of the original variables!!! We passed from 300 variables to 158.

That is all for this lecture, I hope you enjoyed it and see you in the next one!