## Constant features

Constant features are those that show the same value, just one value, for all the observations of the dataset. In other words, the same value for all the rows of the dataset. These features provide no information that allows a machine learning model to discriminate or predict a target.

Identifying and removing constant features is an easy first step towards feature selection and more easily interpretable machine learning models.

Here, I will demonstrate how to identify constant features using a dataset that I created for this course. 

To identify constant features, we can use the VarianceThreshold from Scikit-learn, or we can code it ourselves. If using the VarianceThreshold, all our variables need to be numerical. If we do it manually however, we can apply the code to both numerical and categorical variables.

I will show 3 snippets of code, 1 where I use the VarianceThreshold and 2 manually coded alternatives.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import VarianceThreshold

## Removing constant features

In [2]:
# load our first dataset

# (feel free to write some code to explore the dataset and become
# familiar with it ahead of this demo)

data = pd.read_csv('../dataset_1.csv')
data.shape

(50000, 301)

**Important**

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfitting.

In [3]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),  # drop the target
    data['target'],  # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

### Using VarianceThreshold from Scikit-learn

The VarianceThreshold from sklearn provides a simple baseline approach to feature selection. It removes all features which variance doesn’t meet a certain threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

In [4]:
sel = VarianceThreshold(threshold=0)

sel.fit(X_train)  # fit finds the features with zero variance

VarianceThreshold(threshold=0)

In [5]:
# get_support is a boolean vector that indicates which features are retained
# if we sum over get_support, we get the number of features that are not constant

# (go ahead and print the result of sel.get_support() to understand its output)

sum(sel.get_support())

266

In [6]:
# now let's print the number of constant feautures
# (see how we use ~ to exclude non-constant features)

constant = X_train.columns[~sel.get_support()]

len(constant)

34

We can see that 34 columns / variables are constant. This means that 34 variables show the same value, just one value, for all the observations of the training set.

In [7]:
# let's print the constant variable names

constant

Index(['var_23', 'var_33', 'var_44', 'var_61', 'var_80', 'var_81', 'var_87',
       'var_89', 'var_92', 'var_97', 'var_99', 'var_112', 'var_113', 'var_120',
       'var_122', 'var_127', 'var_135', 'var_158', 'var_167', 'var_170',
       'var_171', 'var_178', 'var_180', 'var_182', 'var_195', 'var_196',
       'var_201', 'var_212', 'var_215', 'var_225', 'var_227', 'var_248',
       'var_294', 'var_297'],
      dtype='object')

In [8]:
# let's visualise the values of one of the constant variables
# as an example

X_train['var_23'].unique()

array([0], dtype=int64)

In [9]:
# we can do the same for every feature:

for col in constant:
    print(col, X_train[col].unique())

var_23 [0]
var_33 [0]
var_44 [0]
var_61 [0]
var_80 [0]
var_81 [0]
var_87 [0]
var_89 [0.]
var_92 [0]
var_97 [0]
var_99 [0]
var_112 [0]
var_113 [0]
var_120 [0]
var_122 [0]
var_127 [0]
var_135 [0]
var_158 [0]
var_167 [0]
var_170 [0]
var_171 [0]
var_178 [0.]
var_180 [0.]
var_182 [0]
var_195 [0]
var_196 [0]
var_201 [0]
var_212 [0]
var_215 [0]
var_225 [0]
var_227 [0.]
var_248 [0]
var_294 [0]
var_297 [0]


We then use the transform() method of the VarianceThreshold to reduce the training and testing sets to its non-constant features.

Note that VarianceThreshold returns a NumPy array without feature names, so we need to capture the names first, and reconstitute the dataframe in a later step.

In [10]:
# capture non-constant feature names

feat_names = X_train.columns[sel.get_support()]

In [11]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

We passed from our original 300 variables, to 266.

In [12]:
# X_ train is a NumPy array
X_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [13]:
# reconstitute de dataframe

X_train = pd.DataFrame(X_train, columns=feat_names)
X_train.head()

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_289,var_290,var_291,var_292,var_293,var_295,var_296,var_298,var_299,var_300
0,0.0,0.0,0.0,2.79,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,2.97,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,2.79,85435.2,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,5.7,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Manual code 1: only works with numerical

In the following cells, I will show an alternative to the VarianceThreshold transformer of sklearn, were we write the code to find out constant variables, using the standard deviation from pandas.

In [14]:
# separate train and test (again, as we transformed the previous ones)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [15]:
# short and easy: find constant features

# in this dataset, all features are numeric,
# so this bit of code will suffice:

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

len(constant_features)

34

In [16]:
# drop these columns from the train and test sets:

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

We see how by removing constant features, we managed to reduced the feature space quite a bit.

Both the VarianceThreshold and the snippet of code I provided work with numerical variables. What can we do to find constant categorical variables?

One alternative is to encode the categories as numbers and then use the code above. But then you will put effort in pre-processing variables that are not informative.

The code below offers a better solution:

### Manual Code 2 - works also with categorical variables

In [17]:
# separate train and test (again, as we transformed the previous ones)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [18]:
# I will cast all the numeric features as object,
# to simulate that they are categorical

X_train = X_train.astype('O')
X_train.dtypes

var_1      object
var_2      object
var_3      object
var_4      object
var_5      object
            ...  
var_296    object
var_297    object
var_298    object
var_299    object
var_300    object
Length: 300, dtype: object

In [19]:
# to find variables that contain only 1 label/value
# we use the nunique() method from pandas, which returns the number
# of different values in a variable.

constant_features = [
    feat for feat in X_train.columns if X_train[feat].nunique() == 1
]

len(constant_features)

34

Same as before, we observe 34 variables that show only 1 value in all the observations of the dataset. Like this, we can appreciate the usefulness of looking out for constant variables at the beginning of any modeling exercise.

**Note** by default nunique() ignores missing values, so if your variables have missing values, use dropna=False within the parameters of nunique(). More details here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html

In [20]:
X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

That is all for this lecture, I hope you enjoyed it and see you in the next one!