## Quasi-constant features

Quasi-constant features are those that show the same value for the great majority of the observations of the dataset. In general, these features provide little if any information that allows a machine learning model to discriminate or predict a target. But there can be exceptions. So you should be careful when removing these type of features.

Identifying and removing quasi-constant features, is an easy first step towards feature selection and more easily interpretable machine learning models.

Here I will demonstrate how to identify quasi-constant features using the Santander Customer Satisfaction dataset from Kaggle. 

To identify constant features, we can use the VarianceThreshold function from sklearn, or we can code it ourselves. I will show 2 snippets of code with both procedures.

In [24]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import VarianceThreshold

In [54]:
# load the Santander customer satisfaction dataset from Kaggle
# I load just a few rows for the demonstration
data = pd.read_csv('Advertising.csv')
data.shape

(200, 7)

In [55]:
# check the presence of null data.
# The snippets below will be able to compare nan values between 2 columns,
# so in principle missing data are not a problem.
# in any case, we see that there are no missing data in this dataset

[col for col in data.columns if data[col].isnull().sum() > 0]

[]

### Important

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [56]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['sales'], axis=1),
    data['sales'],
    test_size=0.2,
    random_state=0)

X_train.shape, X_test.shape

((160, 6), (40, 6))

In [57]:
constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

In [58]:
constant_features

['Constant']

### Remove constant features

First, I will remove constant features like I did in the previous lecture. This will allow a better visualisation of the quasi-constant ones.

In [59]:
# using the code from the previous lecture
# I remove 58 constant features

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((160, 5), (40, 5))

## Removing quasi-constant features
### Using variance threshold from sklearn

Variance threshold from sklearn is a simple baseline approach to feature selection. It removes all features which variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

Here, I will change the default threshold to remove almost / quasi-constant features.

In [65]:
sel = VarianceThreshold(
    threshold=0.0126)  # 0.1 indicates 99% of observations approximately

sel.fit(X_train)  # fit finds the features with low variance

VarianceThreshold(threshold=0.0126)

In [66]:
# get_support is a boolean vector that indicates which features 
# are retained. If we sum over get_support, we get the number
# of features that are not quasi-constant
sum(sel.get_support())

4

In [67]:
# another way of doing the above operation:
len(X_train.columns[sel.get_support()])

4

In [68]:
# finally we can print the quasi-constant features
print(
    len([
        x for x in X_train.columns
        if x not in X_train.columns[sel.get_support()]
    ]))

[x for x in X_train.columns if x not in X_train.columns[sel.get_support()]]

1


['Quase']

We can see that 50 columns / variables are almost constant. This means that 50 variables show predominantly one value for ~99% the observations of the training set. Let's see below.

In [64]:
# percentage of observations showing each of the different values
X_train['Quase'].value_counts() / np.float(len(X_train))

1    0.9875
2    0.0125
Name: Quase, dtype: float64

We can see that > 99% of the observations show one value, 0. Therefore, this features is almost constant.

In [69]:
# we can then remove the features like this
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((160, 4), (40, 4))

By removing constant and almost constant features, we reduced the feature space from 370 to 261. More than 100 features were removed from the present dataset.

### Coding it ourselves

First, I will reload the dataset and remove the constant features.

In [71]:
# load the dataset
data = pd.read_csv('Advertising.csv', nrows=50000)

# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['sales'], axis=1),
    data['sales'],
    test_size=0.2,
    random_state=0)

# remove constant features
# using the code from the previous lecture
constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((160, 5), (40, 5))

In [74]:
quasi_constant_feat = []
for feature in X_train.columns:

    # find the predominant value
    predominant = (X_train[feature].value_counts() / np.float(
        len(X_train))).sort_values(ascending=False).values[0]

    # evaluate predominant feature
    if predominant > 0.999:
        quasi_constant_feat.append(feature)

len(quasi_constant_feat)

0

Our method was a bit more aggressive than VarianceThreshold from sklearn with the threshold that we selected above. It found 119 features that show predominantly 1 value for the majority of the observations. Let's see how some of the quasi constant features look like.

In [15]:
# select the first one from the list
quasi_constant_feat[0]

'imp_op_var40_efect_ult1'

In [16]:
X_train['imp_op_var40_efect_ult1'].value_counts() / np.float(len(X_train))

0.00       0.999400
900.00     0.000086
60.00      0.000057
1800.00    0.000057
600.00     0.000057
930.00     0.000029
420.00     0.000029
74.28      0.000029
270.00     0.000029
1200.00    0.000029
6600.00    0.000029
870.00     0.000029
750.00     0.000029
300.00     0.000029
120.00     0.000029
210.00     0.000029
150.00     0.000029
Name: imp_op_var40_efect_ult1, dtype: float64

The feature shows 0 for more than 99.9% of the observations.

That is all for this lecture, I hope you enjoyed it and see you in the next one!