## Constant and Quasi-constant features with Feature-engine

We will remove constant and quasi-constant features utilizing the new functionality from Feature-engine.

![alt text](https://frenzy86.s3.eu-west-2.amazonaws.com/python/feaut.png)



In [2]:
!pip install feature-engine -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/375.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/375.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.0/375.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from feature_engine.selection import DropConstantFeatures

In [4]:
path = "https://frenzy86.s3.eu-west-2.amazonaws.com/python/data/dataset_1.csv"
# path = '../dataset_1.csv'

In [5]:
data = pd.read_csv(path)
data.shape

(50000, 301)

In [6]:
data

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_292,var_293,var_294,var_295,var_296,var_297,var_298,var_299,var_300,target
0,0,0,0.0,0.00,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
1,0,0,0.0,3.00,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
2,0,0,0.0,5.88,0.0,0,0,0,0,0,...,0.00,0,0,3,0,0,0,0.0,67772.7216,0
3,0,0,0.0,14.10,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
4,0,0,0.0,5.76,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,0,0,0.0,2.85,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
49996,0,0,0.0,2.91,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
49997,0,0,0.0,8.46,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0
49998,0,0,0.0,2.76,0.0,0,0,0,0,0,...,0.00,0,0,0,0,0,0,0.0,0.0000,0


**Important**

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [7]:
TARGET = 'target'

X = data.drop(labels=[TARGET], axis=1)
y =data[TARGET]

X_train, X_test, y_train, y_test = train_test_split(X,  # drop the target
                                                    y,  # just the target
                                                    test_size=0.3,
                                                    random_state=667,
                                                    )

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## Remove constant features

The DropConstantFeatures class from Feature-engine finds and removes constant and quasi-constant features from a dataset. We can remove constant features by setting the parameter tol to 1, or quasi-constant with smaller values for tol.

In [8]:
sel = DropConstantFeatures(tol=1,
                           variables=None,
                           missing_values='raise',
                           )
sel.fit(X_train)

In [9]:
# list of constant features
sel.features_to_drop_

['var_23',
 'var_33',
 'var_36',
 'var_44',
 'var_61',
 'var_72',
 'var_80',
 'var_81',
 'var_87',
 'var_89',
 'var_92',
 'var_97',
 'var_99',
 'var_104',
 'var_112',
 'var_113',
 'var_120',
 'var_122',
 'var_124',
 'var_127',
 'var_133',
 'var_135',
 'var_153',
 'var_158',
 'var_167',
 'var_171',
 'var_178',
 'var_180',
 'var_182',
 'var_187',
 'var_195',
 'var_196',
 'var_201',
 'var_212',
 'var_215',
 'var_217',
 'var_223',
 'var_225',
 'var_227',
 'var_247',
 'var_248',
 'var_280',
 'var_283',
 'var_285',
 'var_294',
 'var_297']

In [10]:
# number of constant features
len(sel.features_to_drop_)

46

In [11]:
# let's explore 1 of the constant feature values
X_train[sel.features_to_drop_[0]].unique()

array([0])

In [12]:
# remove constant features from the data

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 254), (15000, 254))

The datasets now contain 46 features less.

## Remove quasi-constant features

In [13]:
sel = DropConstantFeatures(tol=0.998, variables=None, missing_values='raise')
sel.fit(X_train)

In [14]:
# number of quasi-constant features
len(sel.features_to_drop_)

96

In [15]:
# list of quasi-constant features
sel.features_to_drop_

['var_1',
 'var_2',
 'var_3',
 'var_6',
 'var_7',
 'var_9',
 'var_10',
 'var_11',
 'var_12',
 'var_14',
 'var_16',
 'var_20',
 'var_24',
 'var_28',
 'var_32',
 'var_34',
 'var_39',
 'var_40',
 'var_42',
 'var_43',
 'var_45',
 'var_48',
 'var_53',
 'var_56',
 'var_59',
 'var_60',
 'var_65',
 'var_66',
 'var_67',
 'var_69',
 'var_71',
 'var_73',
 'var_77',
 'var_78',
 'var_90',
 'var_95',
 'var_98',
 'var_102',
 'var_106',
 'var_111',
 'var_115',
 'var_116',
 'var_125',
 'var_126',
 'var_129',
 'var_130',
 'var_136',
 'var_138',
 'var_141',
 'var_142',
 'var_146',
 'var_149',
 'var_150',
 'var_151',
 'var_159',
 'var_170',
 'var_183',
 'var_184',
 'var_189',
 'var_197',
 'var_202',
 'var_204',
 'var_210',
 'var_211',
 'var_216',
 'var_219',
 'var_221',
 'var_224',
 'var_228',
 'var_233',
 'var_234',
 'var_235',
 'var_236',
 'var_237',
 'var_239',
 'var_243',
 'var_245',
 'var_246',
 'var_249',
 'var_251',
 'var_254',
 'var_257',
 'var_260',
 'var_263',
 'var_264',
 'var_265',
 'var_267',

In [16]:
# percentage of observations showing each of the different values
# of the variable

var = sel.features_to_drop_[0]
X_train[var].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
var_1,Unnamed: 1_level_1
0,0.999457
3,0.000343
6,0.0002


We can see that > 99% of the observations show one value, 0. Therefore, this features is fairly constant.

In [17]:
# let's explore another one

var = sel.features_to_drop_[2]
X_train[var].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
var_3,Unnamed: 1_level_1
0.0,0.999457
13297.032,2.9e-05
7134.8904,2.9e-05
52105.7901,2.9e-05
2928.915,2.9e-05
25905.4866,2.9e-05
3583.3941,2.9e-05
12542.31,2.9e-05
207901.3365,2.9e-05
6211.5165,2.9e-05


Go ahead and explore the rest of the quasi-constant variables.

We can then remove the quasi-constant features utilizing the transform() method. Feature-engine returns dataframes by default.

In [18]:
#remove the quasi-constant features
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 158), (15000, 158))

By removing constant and almost constant features, we reduced the feature space from 300 to 158.