## Quasi-constant features

Quasi-constant features are those that show the same value for the great majority of the observations of the dataset. In general, these features provide little, if any, information that allows a machine learning model to discriminate or predict a target. But there can be exceptions. So you should be careful when removing these type of features.

Identifying and removing quasi-constant features, is an easy first step towards feature selection and more interpretable machine learning models.
To identify quasi-constant features, we can use the VarianceThreshold from Scikit-learn, or we can code it ourselves. If we use the VarianceThreshold, all our variables need to be numerical. If we code it manually however, we can apply the code to both numerical and categorical variables.

2 snippets of code, 1 where I use the VarianceThreshold and 1 manually coded alternative.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

In [2]:
path = "https://frenzy86.s3.eu-west-2.amazonaws.com/python/data/dataset_1.csv"
# path = '../dataset_1.csv'

In [3]:
data = pd.read_csv(path)
data.shape

(50000, 301)

In [5]:
data

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_292,var_293,var_294,var_295,var_296,var_297,var_298,var_299,var_300,target
0,0,0,0.0,0.00,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
1,0,0,0.0,3.00,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
2,0,0,0.0,5.88,0.0,0,0,0,0,0,...,0.00,0,0,3,0,0,0,0.0,67772.7216,0
3,0,0,0.0,14.10,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
4,0,0,0.0,5.76,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,0,0,0.0,2.85,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
49996,0,0,0.0,2.91,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
49997,0,0,0.0,8.46,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
49998,0,0,0.0,2.76,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0


**Important**

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [4]:
TARGET = 'target'

X = data.drop(labels=[TARGET], axis=1)
y =data[TARGET]

X_train, X_test, y_train, y_test = train_test_split(X,  # drop the target
                                                    y,  # just the target
                                                    test_size=0.3,
                                                    random_state=667,
                                                    )

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## Remove constant features
remove constant features like I did in the previous lecture. This will allow a better visualisation of the quasi-constant ones.

In [6]:
# using the code from the previous lecture
# remove 46 constant features
constant_features = [feat for feat in X_train.columns if X_train[feat].std() == 0]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 254), (15000, 254))

## Remove quasi-constant features

### Using the VarianceThreshold from sklearn

The VarianceThreshold from sklearn provides a simple baseline approach to feature selection. It removes all features which variance doesn’t meet a certain threshold. By default, it removes all zero-variance features, as we did in the previous notebook.

Here, we will change the default threshold to remove quasi-constant features, or, I should better say, features with low-variance:

Check the Scikit-learn docs for more details:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

In [7]:
sel = VarianceThreshold(threshold=0.01)
sel.fit(X_train)  # fit finds the features with low variance

In [8]:
# get_support is a boolean vector that indicates which features
# are retained, that is, which features have a higher variance than the threshold we indicated.

# If we sum over get_support, we get the number of features that are not quasi-constant
sum(sel.get_support())

213

In [10]:
# let's print the number of quasi-constant features
quasi_constant = X_train.columns[~sel.get_support()]
len(quasi_constant)

41

We can see that 41 columns / variables are almost constant. This means that 41 variables show predominantly one value for the majority of observations of the training set. Let's explore a few if these variables below.

In [11]:
# let's print the variable names
quasi_constant

Index(['var_2', 'var_7', 'var_9', 'var_10', 'var_28', 'var_43', 'var_45',
       'var_53', 'var_56', 'var_59', 'var_66', 'var_67', 'var_69', 'var_71',
       'var_106', 'var_116', 'var_137', 'var_141', 'var_170', 'var_177',
       'var_189', 'var_194', 'var_197', 'var_198', 'var_218', 'var_219',
       'var_233', 'var_234', 'var_235', 'var_245', 'var_249', 'var_250',
       'var_251', 'var_256', 'var_260', 'var_267', 'var_274', 'var_282',
       'var_287', 'var_289', 'var_298'],
      dtype='object')

In [13]:
# percentage of observations showing each of the different values of the variable
X_train['var_1'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
var_1,Unnamed: 1_level_1
0,0.999457
3,0.000343
6,0.0002


We can see that > 99% of the observations show one value, 0. Therefore, this features is fairly constant.

In [14]:
# let's explore another one
X_train['var_2'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
var_2,Unnamed: 1_level_1
0,0.999971
1,2.9e-05


Go ahead and explore the rest of the quasi-constant variables.

We can then remove the quasi-constant features utilizing the transform() method from the VarianceThreshold. Remember that this returns a NumPy array without feature names, so if we want a dataframe we need to reconstitute it.

In [15]:
# capture feature names
feat_names = X_train.columns[sel.get_support()]

In [16]:
# remove the quasi-constant features
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 213), (15000, 213))

By removing constant and almost constant features, we reduced the feature space from 300 to 213. This means, that 85 features were removed from this dataset. Almost a third!!

In [17]:
X_train = pd.DataFrame(X_train, columns=feat_names)
X_test = pd.DataFrame(X_test, columns=feat_names)
X_test

Unnamed: 0,var_1,var_3,var_4,var_5,var_6,var_8,var_11,var_12,var_13,var_14,...,var_286,var_288,var_290,var_291,var_292,var_293,var_295,var_296,var_299,var_300
0,0.0,0.0,2.82,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000
1,0.0,0.0,2.97,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000
2,0.0,0.0,2.94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000
3,0.0,0.0,2.88,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000
4,0.0,0.0,3.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14995,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000
14996,0.0,0.0,5.52,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000
14997,0.0,0.0,2.82,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000
14998,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000


### Coding it ourselves

First, I will separate the dataset into train and test and remove the constant features again. Then, I will provide an alternative method to find out quasi-constant features.

This method, as opposed to the VarianceThreshold, can be used for both **numerical and categorical** variables.

In [18]:
TARGET = 'target'

X = data.drop(labels=[TARGET], axis=1)
y =data[TARGET]

X_train, X_test, y_train, y_test = train_test_split(X,  # drop the target
                                                    y,  # just the target
                                                    test_size=0.3,
                                                    random_state=667,
                                                    )

X_train.shape, X_test.shape

# remove constant features
# using the code from the previous lecture

constant_features = [feat for feat in X_train.columns if X_train[feat].std() == 0]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 254), (15000, 254))

In [19]:
quasi_constant_feat = []
for feature in X_train.columns:
    # find the predominant value, that is the value that is shared by most observations
    predominant = X_train[feature].value_counts(normalize=True).sort_values(ascending=False).values[0]
    # evaluate the predominant feature: do more than 99% of the observations show 1 value?
    if predominant > 0.998:
        quasi_constant_feat.append(feature)

len(quasi_constant_feat)

96

Our method was a bit more aggressive than VarianceThreshold from sklearn with the threshold that we selected above. It found 108 features that show predominantly 1 value for the majority of the observations.

Let's see how some of the quasi constant features look like.

In [20]:
# print the feature names
quasi_constant_feat

['var_1',
 'var_2',
 'var_3',
 'var_6',
 'var_7',
 'var_9',
 'var_10',
 'var_11',
 'var_12',
 'var_14',
 'var_16',
 'var_20',
 'var_24',
 'var_28',
 'var_32',
 'var_34',
 'var_39',
 'var_40',
 'var_42',
 'var_43',
 'var_45',
 'var_48',
 'var_53',
 'var_56',
 'var_59',
 'var_60',
 'var_65',
 'var_66',
 'var_67',
 'var_69',
 'var_71',
 'var_73',
 'var_77',
 'var_78',
 'var_90',
 'var_95',
 'var_98',
 'var_102',
 'var_106',
 'var_111',
 'var_115',
 'var_116',
 'var_125',
 'var_126',
 'var_129',
 'var_130',
 'var_136',
 'var_138',
 'var_141',
 'var_142',
 'var_146',
 'var_149',
 'var_150',
 'var_151',
 'var_159',
 'var_170',
 'var_183',
 'var_184',
 'var_189',
 'var_197',
 'var_202',
 'var_204',
 'var_210',
 'var_211',
 'var_216',
 'var_219',
 'var_221',
 'var_224',
 'var_228',
 'var_233',
 'var_234',
 'var_235',
 'var_236',
 'var_237',
 'var_239',
 'var_243',
 'var_245',
 'var_246',
 'var_249',
 'var_251',
 'var_254',
 'var_257',
 'var_260',
 'var_263',
 'var_264',
 'var_265',
 'var_267',

In [21]:
# select one feature from the list
quasi_constant_feat[2]

'var_3'

In [22]:
X_train['var_3'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
var_3,Unnamed: 1_level_1
0.0,0.999457
13297.032,2.9e-05
7134.8904,2.9e-05
52105.7901,2.9e-05
2928.915,2.9e-05
25905.4866,2.9e-05
3583.3941,2.9e-05
12542.31,2.9e-05
207901.3365,2.9e-05
6211.5165,2.9e-05


The feature shows 0 for more than 99.9% of the observations. But, it also shows a few different values for a very tiny proportion of the observations. This fact, will increase the feature variance, that is why, this feature is not captured by the VarianceThreshold in our previous cell. Yet, we can see that it is quasi-constant.

Keep in mind that the thresholds are arbitrary and decided by the user.

In [23]:
# finally, let's drop the quasi-constant features:
X_train.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_test.drop(labels=quasi_constant_feat, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 158), (15000, 158))