## Dropping Constant/ Quasi Constant Features

Constant Features that show single values in all the observations in the dataset. These features provide no information that allows ML models to predict the target.

We can drop constant features using Sklearn's Variance Threshold.
Refer Document: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html


### Variance Threshold:

Variance Threshold is a feature selector that removes all the low variance features from the dataset that are of no great use in modelling.

It looks only at the features (x), not the desired outputs (y), and can thus be used for unsupervised learning.

Default Value of Threshold is 0

- If Variance Threshold = 0 (Remove Constant Features )
- If Variance Threshold > 0 (Remove Quasi-Constant Features )

### Python Implementation:

In [2]:
import pandas as pd
import numpy as np

In [4]:
# Loading data from train.csv file
train_df = pd.read_csv("train data credit card.csv")
train_df.head(5)

Unnamed: 0,ID,Gender,Age,Region_Code,Occupation,Channel_Code,Vintage,Credit_Product,Avg_Account_Balance,Is_Active,Is_Lead
0,NNVBBKZB,Female,73,RG268,Other,X3,43,No,1045696,No,0
1,IDD62UNG,Female,30,RG277,Salaried,X1,32,No,581988,No,0
2,HD3DSEMC,Female,56,RG268,Self_Employed,X3,26,No,1484315,Yes,0
3,BF3NC7KV,Male,34,RG270,Salaried,X1,19,No,470454,No,0
4,TEASRWXV,Female,30,RG282,Salaried,X1,33,No,886787,No,0


In [5]:
train_df.shape

(245725, 11)

#### Shortening the huge dataset

In [12]:
train = train_df.loc[1:40000,:]
train.shape

(40000, 11)

#### Filling Null values if any

In [16]:
train = train.fillna("None")
test = test.fillna("None")

#### Dropping ID Column , defining target

In [21]:
train1 = train.drop(["ID","Is_Lead"],axis=1)
y = train["Is_Lead"]
test1 = test.drop("ID",axis=1) 

As, Variance Threshold can work only upon numerical data. We need to first convert the data types of other non-integer/non-float columns. For this we will use Ordinal Encoder.

#### To see no. of unique values in each column:

In [38]:
train1.nunique(axis=0)

Gender                     2
Age                       62
Region_Code               35
Occupation                 4
Channel_Code               4
Vintage                   66
Credit_Product             3
Avg_Account_Balance    35278
Is_Active                  2
dtype: int64

### Using Ordinal Encoder: Required Before Thresholding

In ordinal encoding, each unique category value is assigned an integer value. For example, “red” is 1, “green” is 2, and “blue” is 3. This is called an ordinal encoding or an integer encoding and is easily reversible. Often, integer values starting at zero are used.

In [39]:
train1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 1 to 40000
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Gender               40000 non-null  object
 1   Age                  40000 non-null  int64 
 2   Region_Code          40000 non-null  object
 3   Occupation           40000 non-null  object
 4   Channel_Code         40000 non-null  object
 5   Vintage              40000 non-null  int64 
 6   Credit_Product       40000 non-null  object
 7   Avg_Account_Balance  40000 non-null  int64 
 8   Is_Active            40000 non-null  object
dtypes: int64(3), object(6)
memory usage: 2.7+ MB


In [45]:
# import ordinal encoder from sklearn
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()
  
# Transform the data
train1[["Gender","Region_Code","Occupation","Channel_Code","Credit_Product","Is_Active"]] = ord_enc.fit_transform(train1[["Gender","Region_Code","Occupation","Channel_Code","Credit_Product","Is_Active"]])

In [46]:
train1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 1 to 40000
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Gender               40000 non-null  float64
 1   Age                  40000 non-null  int64  
 2   Region_Code          40000 non-null  float64
 3   Occupation           40000 non-null  float64
 4   Channel_Code         40000 non-null  float64
 5   Vintage              40000 non-null  int64  
 6   Credit_Product       40000 non-null  float64
 7   Avg_Account_Balance  40000 non-null  int64  
 8   Is_Active            40000 non-null  float64
dtypes: float64(6), int64(3)
memory usage: 2.7 MB


## MAIN CODE:

#### Defining and Fiting Threshold

For quasi-constant features, that have the same value for a very large subset, using threshold as 0.01 would mean dropping the column where 99% of the values are similar.

In [52]:
from sklearn.feature_selection import VarianceThreshold

var_thr = VarianceThreshold(threshold = 0.25) #Removing both constant and quasi-constant
var_thr.fit(train1)

var_thr.get_support()

array([False,  True,  True,  True,  True,  True,  True,  True, False])

In [59]:
sum(var_thr.get_support())   #Sum for High Variance Columns

7

OUTPUT:
- True : Low Variance
- False: High Variance

#### Picking Up the low Variance Columns: 

As per my above code, i am dropping columns that are 75% or more similar (you can keep any value you prefer)

In [61]:
concol = [column for column in train1.columns 
          if column not in train1.columns[var_thr.get_support()]]

for features in concol:
    print(features)

Gender
Is_Active


#### Dropping Low Variance Columns:

In [63]:
train1.drop(concol,axis=1)

Unnamed: 0,Age,Region_Code,Occupation,Channel_Code,Vintage,Credit_Product,Avg_Account_Balance
1,30,27.0,2.0,0.0,32,0.0,581988
2,56,18.0,3.0,2.0,26,0.0,1484315
3,34,20.0,2.0,0.0,19,0.0,470454
4,30,32.0,2.0,0.0,33,0.0,886787
5,56,11.0,3.0,0.0,32,0.0,544163
...,...,...,...,...,...,...,...
39996,61,1.0,3.0,1.0,26,0.0,822920
39997,54,0.0,3.0,2.0,127,0.0,827797
39998,26,18.0,3.0,0.0,13,0.0,1254855
39999,40,33.0,0.0,2.0,86,1.0,1805249


In [64]:
train1.columns

Index(['Gender', 'Age', 'Region_Code', 'Occupation', 'Channel_Code', 'Vintage',
       'Credit_Product', 'Avg_Account_Balance', 'Is_Active'],
      dtype='object')

This is how we can see which are the columns that have high variance and thus contribute in better models. Don't forget to convert the columns dtype to integer or flow before applying thresold.

Once you identify your low variance columns, you can always reverse the encoding and continue your journey :)