## Constant features

Constant features are those that show the same value, just one value, for all the observations of the dataset. This is, the same value for all the rows of

the dataset. These features provide no information that allows a machine learning model to discriminate or predict a target.


Identifying and removing constant features, is an easy first step towards feature selection and more easily interpretable machine learning models.

Here I will demonstrate how to identify constant features using the Santander Customer Satisfaction dataset from Kaggle. 

To identify constant features, we can use the VarianceThreshold function from sklearn, or we can code it ourselves. I will show 2 snippets of code with both procedures.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import VarianceThreshold

## Removing constant features

In [22]:
# load the Santander customer satisfaction dataset from Kaggle
# I load just a few rows for the demonstration

data = pd.read_csv('Advertising.csv')
data.shape

(200, 5)

In [25]:
data.head()

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [23]:
# check the presence of null data.
# The snippets below will be able to compare nan values between 2 columns,
# so in principle missing data are not a problem.
# in any case, we see that there are no missing data in this dataset
# Use the code to check for null  data

[col for col in data.columns if data[col].isnull().sum() > 0]

[]

In [24]:
#reference for list comprehension
for col in data.columns:
    if data[col].isnull().sum() > 0:
        print(col)
    

### Important

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfitting.

In [49]:
# separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['sales'], axis=1),
    data['sales'],
    test_size=0.2,
    random_state=0)

X_train.shape, X_test.shape

((160, 4), (40, 4))

In [50]:
data.drop(labels=['sales'], axis=1)

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper
0,1,230.1,37.8,69.2
1,2,44.5,39.3,45.1
2,3,17.2,45.9,69.3
3,4,151.5,41.3,58.5
4,5,180.8,10.8,58.4
...,...,...,...,...
195,196,38.2,3.7,13.8
196,197,94.2,4.9,8.1
197,198,177.0,9.3,6.4
198,199,283.6,42.0,66.2


In [51]:
X_train

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper
134,135,36.9,38.6,65.6
66,67,31.5,24.6,2.2
26,27,142.9,29.3,12.6
113,114,209.6,20.6,10.7
168,169,215.4,23.6,57.6
...,...,...,...,...
67,68,139.3,14.5,10.2
192,193,17.2,4.1,31.6
117,118,76.4,0.8,14.8
47,48,239.9,41.5,18.5


### Using variance threshold from sklearn

Variance threshold from sklearn is a simple baseline approach to feature selection. It removes all features which variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

In [53]:
sel = VarianceThreshold(threshold=0)
sel.fit(X_train)  # fit finds the features with zero variance

VarianceThreshold(threshold=0)

In [54]:
# get_support is a boolean vector that indicates which features are retained
# if we sum over get_support, we get the number of features that are not constant
sum(sel.get_support())

4

In [55]:
X_train.columns[sel.get_support()]

Index(['Unnamed: 0', 'TV', 'radio', 'newspaper'], dtype='object')

In [56]:
# another way of finding non-constant features is like this:
len(X_train.columns[sel.get_support()])

4

In [57]:
# finally we can print the constant features
print(
    len([
        x for x in X_train.columns
        if x not in X_train.columns[sel.get_support()]
    ]))

[x for x in X_train.columns if x not in X_train.columns[sel.get_support()]]

0


[]

We can see that 58 columns / variables are constant. This means that 58 variables show the same value, just one value, for all the observations of the training set.

In [9]:
# let's visualise the values of one of the constant variables
# as an example

X_train['ind_var2_0'].unique()

array([0], dtype=int64)

We then use the transform function to reduce the training and testing sets. See below.

In [58]:
# remove constant features
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((160, 4), (40, 4))

### Coding it ourselves

In the following cells, I will show an alternative to the VarianceThreshold function of sklearn.

In [59]:
# load the dataset again
data = pd.read_csv('santander.csv', nrows=50000)

# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 370), (15000, 370))

In [12]:
# short and easy: find constant features
# in this case, all features are numeric, so this will suffice

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

len(constant_features)

58

In [13]:
# we can then drop these columns from the train and test sets
X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 312), (15000, 312))

We see how by removing constant features, we managed to reduced the feature space quite a bit.

Both varianceThreshold and the snippet of code I provided work with numerical variables. What can we do to find constant categorical variables?

One alternatively is to encode the categories as numbers and then use the code above. But then you will put effort in pre-processing variables that are not informative.

Alternatively, you can use the code below.

### Removing constant features for categorical variables

In [14]:
# load the dataset again
data = pd.read_csv('santander.csv', nrows=50000)

# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 370), (15000, 370))

In [15]:
# I will transform all these numeric features into
# categorical features for the demonstration
# to simulate that they are categorical

X_train = X_train.astype('O')
X_train.dtypes

ID                         object
var3                       object
var15                      object
imp_ent_var16_ult1         object
imp_op_var39_comer_ult1    object
                            ...  
saldo_medio_var44_hace2    object
saldo_medio_var44_hace3    object
saldo_medio_var44_ult1     object
saldo_medio_var44_ult3     object
var38                      object
Length: 370, dtype: object

In [16]:
# and now find those columns that contain only 1 label:
constant_features = [
    feat for feat in X_train.columns if len(X_train[feat].unique()) == 1
]

len(constant_features)

58

Same as before, we observe 58 variables that show only 1 value across all the observations of the dataset. We can appreciate the usefulness of looking out for constant variables at the beginning of any modeling exercise.

That is all for this lecture, I hope you enjoyed it and see you in the next one!