## Duplicated features

Often datasets contain one or more features that show the same values across all the observations. This means that both features are in essence identical. In addition, it is not unusual to introduce duplicated features after performing **one hot encoding** of categorical variables, particularly when using several highly cardinal variables.

Identifying and removing duplicated, and therefore redundant features, is an easy first step towards feature selection and more easily interpretable machine learning models.

Here I will demonstrate how to identify duplicated features using the Santander Customer Satisfaction dataset from Kaggle. 

There is no function in python and pandas to find duplicated columns. I will show 2 snippets of code, one that you can apply to small datasets, and a second snippet that you can use on larger datasets. The first piece of code, is computationally costly, so your computer might run out of memory.

**Note**
Finding duplicated features is a computationally costly operation in Python, therefore depending on the size of your dataset, you might not always be able to perform it.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

## Removing duplicate features

In [None]:
# load the Santander customer satisfaction dataset from Kaggle
# I load just a few rows for the demonstration
data = pd.read_csv('santander.csv', nrows=15000)
data.shape

(15000, 371)

In [None]:
# check the presence of null data.
# The snippets below will be able to compare nan values between 2 columns,
# so in principle missing data are not a problem.
# in any case, we see that there are no missing data in this dataset

[col for col in data.columns if data[col].isnull().sum() > 0]

[]

### Important

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [None]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((10500, 370), (4500, 370))

Pandas has the function 'duplicated' that evaluates if the dataframe contains duplicated rows. We can use this function to check for duplicated columns if we transpose the dataframe first. By transposing the dataframe, we obtain a new dataframe where the columns are now rows, and with the 'duplicated' method we can go ahead an identify those that are duplicated. 

Once we identify them, we can remove the duplicated rows. See below.

### Code Snippet for small datasets

Using pandas transpose is computationally expensive, so the computer may run out of memory. That is why we can only use this code block on small datasets. How small will depend of your computer specifications.

In [None]:
# transpose the dataframe, so that the columns are the rows of the new dataframe
data_t = X_train.T
data_t.head()

Unnamed: 0,10439,9236,818,11504,11722,5276,6863,13463,10228,11462,10869,7086,1690,8430,6859,9607,7710,11780,13838,3955,13830,10134,3130,14237,14970,3243,4287,11486,10426,13607,12086,4132,1704,10080,8640,1533,1653,12200,5856,8879,...,797,755,10200,8291,2496,7599,1871,2046,7877,4851,5072,2163,6036,6921,6216,11085,537,9893,2897,7768,2222,10327,2599,705,14650,3468,6744,14935,14116,5874,4373,7891,9225,14019,4859,13123,3264,9845,10799,2732
ID,20941.0,18583.0,1623.0,23060.0,23512.0,10564.0,13779.0,26969.0,20502.0,22981.0,21788.0,14212.0,3370.0,16995.0,13769.0,19299.0,15492.0,23621.0,27748.0,7927.0,27737.0,20333.0,6307.0,28615.0,30109.0,6520.0,8621.0,23022.0,20920.0,27280.0,24217.0,8293.0,3400.0,20224.0,17406.0,3024.0,3281.0,24475.0,11748.0,17874.0,...,1592.0,1515.0,20451.0,16719.0,5003.0,15262.0,3745.0,4084.0,15866.0,9708.0,10130.0,4328.0,12130.0,13876.0,12494.0,22215.0,1085.0,19885.0,5789.0,15640.0,4464.0,20720.0,5204.0,1420.0,29458.0,6952.0,13545.0,30044.0,28354.0,11799.0,8783.0,15901.0,18564.0,28142.0,9723.0,26306.0,6557.0,19796.0,21653.0,5441.0
var3,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
var15,23.0,39.0,22.0,23.0,37.0,23.0,27.0,43.0,23.0,27.0,23.0,23.0,28.0,25.0,32.0,42.0,24.0,62.0,31.0,48.0,45.0,26.0,38.0,29.0,24.0,23.0,23.0,23.0,50.0,45.0,48.0,43.0,23.0,32.0,23.0,48.0,23.0,25.0,42.0,30.0,...,36.0,28.0,31.0,35.0,25.0,23.0,23.0,28.0,23.0,23.0,58.0,31.0,37.0,38.0,39.0,37.0,24.0,23.0,23.0,30.0,23.0,43.0,26.0,23.0,45.0,43.0,58.0,24.0,25.0,25.0,23.0,24.0,33.0,45.0,24.0,37.0,24.0,38.0,28.0,23.0
imp_ent_var16_ult1,0.0,0.0,150.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,150.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1321.74,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
imp_op_var39_comer_ult1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,160.35,0.0,0.0,0.0,30.0,30.0,0.0,0.0,0.0,2197.41,1038.09,0.0,6.81,0.0,0.0,0.0,57.45,0.0,0.0,42.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,380.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1009.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3616.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# check if there are duplicated rows (the columns of the original dataframe)
# this is a computionally expensive operation, so it might take a while
# sum indicates how many rows are duplicated

data_t.duplicated().sum()

105

We can see that 105 columns / variables are duplicated. This means that 105 variables are identical to at least another variable within a dataset.

In [None]:
# visualise the duplicated rows (the columns of the original dataframe)
data_t[data_t.duplicated()]

Unnamed: 0,10439,9236,818,11504,11722,5276,6863,13463,10228,11462,10869,7086,1690,8430,6859,9607,7710,11780,13838,3955,13830,10134,3130,14237,14970,3243,4287,11486,10426,13607,12086,4132,1704,10080,8640,1533,1653,12200,5856,8879,...,797,755,10200,8291,2496,7599,1871,2046,7877,4851,5072,2163,6036,6921,6216,11085,537,9893,2897,7768,2222,10327,2599,705,14650,3468,6744,14935,14116,5874,4373,7891,9225,14019,4859,13123,3264,9845,10799,2732
ind_var2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var13_medio_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var13_medio,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var18_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ind_var18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
saldo_medio_var13_medio_hace2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
saldo_medio_var13_medio_hace3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
saldo_medio_var13_medio_ult1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
saldo_medio_var13_medio_ult3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# we can capture the duplicated features, by capturing the
# index values of the transposed dataframe like this:
duplicated_features = data_t[data_t.duplicated()].index.values
duplicated_features

array(['ind_var2', 'ind_var13_medio_0', 'ind_var13_medio', 'ind_var18_0',
       'ind_var18', 'ind_var26', 'ind_var25', 'ind_var27_0',
       'ind_var28_0', 'ind_var28', 'ind_var27', 'ind_var29_0',
       'ind_var29', 'ind_var32', 'ind_var34_0', 'ind_var34', 'ind_var37',
       'ind_var40_0', 'ind_var40', 'ind_var41', 'ind_var39', 'ind_var44',
       'ind_var46_0', 'ind_var46', 'num_var13_medio_0', 'num_var13_medio',
       'num_var18_0', 'num_var18', 'num_var26', 'num_var25',
       'num_var27_0', 'num_var28_0', 'num_var28', 'num_var27',
       'num_var29_0', 'num_var29', 'num_var32', 'num_var34_0',
       'num_var34', 'num_var37', 'num_var40_0', 'num_var40', 'num_var41',
       'num_var39', 'num_var46_0', 'num_var46', 'saldo_var13_medio',
       'saldo_var18', 'saldo_var28', 'saldo_var27', 'saldo_var29',
       'saldo_var34', 'saldo_var40', 'saldo_var41', 'saldo_var46',
       'delta_imp_amort_var18_1y3', 'delta_imp_amort_var34_1y3',
       'delta_imp_reemb_var17_1y3', 'delta_imp_ree

In [None]:
# alternatively, we can remove the duplicated rows,
# transpose the dataframe back to the variables as columns
# keep first indicates that we keep the first of a set of
# duplicated variables

data_unique = data_t.drop_duplicates(keep='first').T
data_unique.shape

(10500, 265)

We can see immediately how removing duplicated features helps reduce the feature space. We passed from 370 to 265 non-duplicated features.

In [None]:
# to find those columns in the original dataframe that were removed:

duplicated_features = [col for col in data.columns if col not in data_unique.columns]
duplicated_features 

['ind_var2',
 'ind_var13_medio_0',
 'ind_var13_medio',
 'ind_var18_0',
 'ind_var18',
 'ind_var26',
 'ind_var25',
 'ind_var27_0',
 'ind_var28_0',
 'ind_var28',
 'ind_var27',
 'ind_var29_0',
 'ind_var29',
 'ind_var32',
 'ind_var34_0',
 'ind_var34',
 'ind_var37',
 'ind_var40_0',
 'ind_var40',
 'ind_var41',
 'ind_var39',
 'ind_var44',
 'ind_var46_0',
 'ind_var46',
 'num_var13_medio_0',
 'num_var13_medio',
 'num_var18_0',
 'num_var18',
 'num_var26',
 'num_var25',
 'num_var27_0',
 'num_var28_0',
 'num_var28',
 'num_var27',
 'num_var29_0',
 'num_var29',
 'num_var32',
 'num_var34_0',
 'num_var34',
 'num_var37',
 'num_var40_0',
 'num_var40',
 'num_var41',
 'num_var39',
 'num_var46_0',
 'num_var46',
 'saldo_var13_medio',
 'saldo_var18',
 'saldo_var28',
 'saldo_var27',
 'saldo_var29',
 'saldo_var34',
 'saldo_var40',
 'saldo_var41',
 'saldo_var46',
 'delta_imp_amort_var18_1y3',
 'delta_imp_amort_var34_1y3',
 'delta_imp_reemb_var17_1y3',
 'delta_imp_reemb_var33_1y3',
 'delta_imp_trasp_var17_out_1y3

### Big datasets

Transposing a dataframe is memory costly if the dataframe is big. Therefore, we can use the alternative loop to find duplicated columns in bigger datasets.

In this case, I will use the same dataset, Santander from Kaggle, but I will load more rows. I expect to see less duplicated features, because by increasing the number of customers in the dataset, the probability of 2 customers having the same value across 2 or more features decreases. But this might as well not be the case. Let's have a look.

In [None]:
# load the dataset
data = pd.read_csv('santander.csv', nrows=50000)

# separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 370), (15000, 370))

In [None]:
# check for duplicated features in the training set
duplicated_feat = []
for i in range(0, len(X_train.columns)):
    if i % 10 == 0:  # this helps me understand how the loop is going
        print(i)

    col_1 = X_train.columns[i]

    for col_2 in X_train.columns[i + 1:]:
        if X_train[col_1].equals(X_train[col_2]):
            duplicated_feat.append(col_2)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360


In [None]:
# check how many features are duplicated
print(len(set(duplicated_feat)))

80


There are less duplicated features than when I loaded a smaller sample of the dataset. This behaviour is expected. Ideally you should work over the entire dataset.

In [None]:
# let's print the list of duplicated features
set(duplicated_feat)

{'delta_imp_amort_var34_1y3',
 'delta_imp_reemb_var33_1y3',
 'delta_num_reemb_var13_1y3',
 'delta_num_reemb_var17_1y3',
 'delta_num_reemb_var33_1y3',
 'delta_num_trasp_var17_in_1y3',
 'delta_num_trasp_var17_out_1y3',
 'delta_num_trasp_var33_in_1y3',
 'delta_num_trasp_var33_out_1y3',
 'delta_num_venta_var44_1y3',
 'imp_amort_var18_hace3',
 'imp_amort_var34_hace3',
 'imp_reemb_var13_hace3',
 'imp_reemb_var17_hace3',
 'imp_reemb_var33_hace3',
 'imp_reemb_var33_ult1',
 'imp_trasp_var17_out_hace3',
 'imp_trasp_var33_out_hace3',
 'imp_venta_var44_hace3',
 'ind_var13_medio',
 'ind_var13_medio_0',
 'ind_var18',
 'ind_var2',
 'ind_var25',
 'ind_var26',
 'ind_var27',
 'ind_var27_0',
 'ind_var28',
 'ind_var28_0',
 'ind_var29',
 'ind_var29_0',
 'ind_var32',
 'ind_var34',
 'ind_var34_0',
 'ind_var37',
 'ind_var39',
 'ind_var41',
 'ind_var46',
 'ind_var46_0',
 'num_meses_var13_medio_ult3',
 'num_reemb_var13_hace3',
 'num_reemb_var17_hace3',
 'num_reemb_var33_hace3',
 'num_reemb_var33_ult1',
 'num_tr

In [None]:
# we can go ahead and try to identify which set of features
# are identical

duplicated_feat = []
for i in range(0, len(X_train.columns)):

    col_1 = X_train.columns[i]

    for col_2 in X_train.columns[i + 1:]:

        # if the features are duplicated
        if X_train[col_1].equals(X_train[col_2]):

            #print them
            print(col_1)
            print(col_2)
            print()

            # and then append the duplicated one to a
            # list
            duplicated_feat.append(col_2)

ind_var2_0
ind_var2

ind_var2_0
ind_var13_medio_0

ind_var2_0
ind_var13_medio

ind_var2_0
ind_var27_0

ind_var2_0
ind_var28_0

ind_var2_0
ind_var28

ind_var2_0
ind_var27

ind_var2_0
ind_var34_0

ind_var2_0
ind_var34

ind_var2_0
ind_var41

ind_var2_0
ind_var46_0

ind_var2_0
ind_var46

ind_var2_0
num_var13_medio_0

ind_var2_0
num_var13_medio

ind_var2_0
num_var27_0

ind_var2_0
num_var28_0

ind_var2_0
num_var28

ind_var2_0
num_var27

ind_var2_0
num_var34_0

ind_var2_0
num_var34

ind_var2_0
num_var41

ind_var2_0
num_var46_0

ind_var2_0
num_var46

ind_var2_0
saldo_var13_medio

ind_var2_0
saldo_var28

ind_var2_0
saldo_var27

ind_var2_0
saldo_var34

ind_var2_0
saldo_var41

ind_var2_0
saldo_var46

ind_var2_0
delta_imp_amort_var34_1y3

ind_var2_0
delta_imp_reemb_var33_1y3

ind_var2_0
delta_num_reemb_var33_1y3

ind_var2_0
imp_amort_var18_hace3

ind_var2_0
imp_amort_var34_hace3

ind_var2_0
imp_reemb_var13_hace3

ind_var2_0
imp_reemb_var17_hace3

ind_var2_0
imp_reemb_var33_hace3

ind_var2_0
imp_re

In [None]:
# let's check that indeed those features are duplicated
# I select a random pair from above

X_train[['ind_var2_0', 'num_var34_0']].head(10)

Unnamed: 0,ind_var2_0,num_var34_0
17967,0,0
32391,0,0
9341,0,0
7929,0,0
46544,0,0
4149,0,0
33426,0,0
3002,0,0
6974,0,0
16864,0,0


In [None]:
# let's check that indeed those features are duplicated
# I select another random pair from above

X_train[['ind_var2_0', 'ind_var2']].head(10)

Unnamed: 0,ind_var2_0,ind_var2
17967,0,0
32391,0,0
9341,0,0
7929,0,0
46544,0,0
4149,0,0
33426,0,0
3002,0,0
6974,0,0
16864,0,0


We can see, that the features are identical.
