## Quasi-constant features

Quasi-constant features are those that show the same value for the great majority of the observations of the dataset. In general, these features provide little, if any, information that allows a machine learning model to discriminate or predict a target.

2 main ways to remove constant features:
- from sklearn.feature_selection import **VarianceThreshold**
- import pandas -- **value_counts()**

Here are 2 snippets of code, each for each ways mentioned above.

(In this demo, I will demonstrate how to identify quasi-constant features using a toy dataset.)

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import VarianceThreshold

In [2]:
# load dataset

# (feel free to write some code to explore the dataset and become
# familiar with it ahead of this demo)

data = pd.read_csv('../dataset_1.csv')
data.shape

(50000, 301)

**note**: do feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [3]:
# train and test dss

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## Remove constant features

First, constant features to be removed to get better visualisation of the quasi-constant ones.

In [4]:
# removing constants

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

## Remove quasi-constant features

### A) Using the VarianceThreshold from sklearn


In [5]:
sel = VarianceThreshold(threshold=0.01)

sel.fit(X_train)

VarianceThreshold(threshold=0.01)

In [6]:
# after the above fitting, 
# sum(sel.get_support()) returns the number of features that are not quasi-constant

sum(sel.get_support())

215

In [7]:
# print the number of quasi-constant features

quasi_constant = X_train.columns[~sel.get_support()]

len(quasi_constant)

51

51 columns are 99%+ constant (aka quasi-constant).

In [8]:
# print the  quasi-constant variable names
quasi_constant

Index(['var_1', 'var_2', 'var_7', 'var_9', 'var_10', 'var_19', 'var_28',
       'var_36', 'var_43', 'var_45', 'var_53', 'var_56', 'var_59', 'var_66',
       'var_67', 'var_69', 'var_71', 'var_104', 'var_106', 'var_116',
       'var_133', 'var_137', 'var_141', 'var_146', 'var_177', 'var_187',
       'var_189', 'var_194', 'var_197', 'var_198', 'var_202', 'var_218',
       'var_219', 'var_223', 'var_233', 'var_234', 'var_235', 'var_245',
       'var_247', 'var_249', 'var_250', 'var_251', 'var_256', 'var_260',
       'var_267', 'var_274', 'var_282', 'var_285', 'var_287', 'var_289',
       'var_298'],
      dtype='object')

In [9]:
# for instance, % of observations showing each of the different values
# of the variable

print(X_train['var_1'].value_counts(), "\n", "="*60)
print(X_train['var_1'].value_counts() / np.float(len(X_train)))

0    34987
3        7
6        5
9        1
Name: var_1, dtype: int64 
0    0.999629
3    0.000200
6    0.000143
9    0.000029
Name: var_1, dtype: float64


In [10]:
# capture the quasi-constant feature names

feat_names = X_train.columns[sel.get_support()]

In [11]:
# remove the quasi-constant features

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 215), (15000, 215))

By removing constant and almost constant features, we reduced the feature space from 300 to 215. This means, that 85 features were removed from this dataset. Almost a third!!

In [12]:
# trasnform the array into a dataframe

X_train = pd.DataFrame(X_train, columns=feat_names)
X_test = pd.DataFrame(X_test, columns=feat_names)

X_test.head()

Unnamed: 0,var_3,var_4,var_5,var_6,var_8,var_11,var_12,var_13,var_14,var_15,...,var_286,var_288,var_290,var_291,var_292,var_293,var_295,var_296,var_299,var_300
0,0.0,2.79,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,2.94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0
3,0.0,2.76,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,2.94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### B) import pandas -- **value_counts()**

This method, as opposed to the VarianceThreshold, can be used for both **numerical and categorical** variables.

In [13]:
# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

# remove constant features

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

In [14]:
# create an empty list
quasi_constant_feat = []

# iterate over every feature
for feature in X_train.columns:

    # find the predominant value, that is the value that is shared
    # by most observations
    predominant = (X_train[feature].value_counts() / np.float(
        len(X_train))).sort_values(ascending=False).values[0]

    # evaluate the predominant feature
    if predominant > 0.998:
        
        # add the variable to the list
        quasi_constant_feat.append(feature)

len(quasi_constant_feat)

108

This method is a bit more aggressive than VarianceThreshold from sklearn with the threshold that we selected above. It found 108 features that show predominantly 1 value for the majority of the observations. 

In [15]:
# print the feature names

quasi_constant_feat

['var_1',
 'var_2',
 'var_3',
 'var_6',
 'var_7',
 'var_9',
 'var_10',
 'var_11',
 'var_12',
 'var_14',
 'var_16',
 'var_20',
 'var_24',
 'var_28',
 'var_32',
 'var_34',
 'var_36',
 'var_39',
 'var_40',
 'var_42',
 'var_43',
 'var_45',
 'var_48',
 'var_53',
 'var_56',
 'var_59',
 'var_60',
 'var_65',
 'var_66',
 'var_67',
 'var_69',
 'var_71',
 'var_72',
 'var_73',
 'var_77',
 'var_78',
 'var_90',
 'var_95',
 'var_98',
 'var_102',
 'var_104',
 'var_106',
 'var_111',
 'var_115',
 'var_116',
 'var_124',
 'var_125',
 'var_126',
 'var_129',
 'var_130',
 'var_133',
 'var_136',
 'var_138',
 'var_141',
 'var_142',
 'var_146',
 'var_149',
 'var_150',
 'var_151',
 'var_153',
 'var_159',
 'var_183',
 'var_184',
 'var_187',
 'var_189',
 'var_197',
 'var_202',
 'var_204',
 'var_210',
 'var_211',
 'var_216',
 'var_217',
 'var_219',
 'var_221',
 'var_223',
 'var_224',
 'var_228',
 'var_233',
 'var_234',
 'var_235',
 'var_236',
 'var_237',
 'var_239',
 'var_243',
 'var_245',
 'var_246',
 'var_247',
 

In [16]:
X_train['var_3'].value_counts() / np.float(len(X_train))

0.0000         0.999629
35685.9459     0.000029
3583.3941      0.000029
15028.0560     0.000029
52105.7901     0.000029
10281.6000     0.000029
86718.0000     0.000029
207901.3365    0.000029
25905.4866     0.000029
5209.9500      0.000029
2641.0164      0.000029
12542.3100     0.000029
861.0900       0.000029
27.3000        0.000029
Name: var_3, dtype: float64

The feature shows 0 for more than 99.9% of the observations. But, it also shows a few different values for a very tiny proportion of the observations. This fact, will increase the feature variance, that is why, this feature is not captured by the VarianceThreshold in our previous cell. Yet, we can see that it is quasi-constant.

In [17]:
# finally, drop the quasi-constant features:

X_train.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_test.drop(labels=quasi_constant_feat, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 158), (15000, 158))

From 300 variables to 158.