## Duplicated features

Often datasets contain duplicated features, that is, features that despite having different names, are identical.

They are not ok for **one hot encoding**.

There is no function in Pandas to find duplicated columns, so there is a need to write a code to do so.

The method below works for both **numerical and categorical** variables.

(In this demo, I will demonstrate how to deal withduplicated features using a toy dataset.)

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

In [2]:
# load dataset

data = pd.read_csv('../dataset_1.csv')
data.shape

(50000, 301)

**note**: do feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [3]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

### Remove constant and quasi-constant

In [4]:
# create an empty list
quasi_constant_feat = []

# iterate over every feature
for feature in X_train.columns:

    # find the predominant values
    predominant = (X_train[feature].value_counts() / np.float(
        len(X_train))).sort_values(ascending=False).values[0]

    # evaluate predominant feature
    if predominant > 0.998:
        quasi_constant_feat.append(feature)

len(quasi_constant_feat)

142

In [5]:
# drop these columns from the train and test sets:

X_train.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_test.drop(labels=quasi_constant_feat, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 158), (15000, 158))

### Remove duplicated features

In [6]:
# create an empty dictionary to store the groups of duplicates
duplicated_feat_pairs = {}

# create an empty list to collect duplicated features
_duplicated_feat = []

for i in range(0, len(X_train.columns)):

    feat_1 = X_train.columns[i]

    if feat_1 not in _duplicated_feat:
    
        # create an empty list as an entry for this feature in the dictionary
        duplicated_feat_pairs[feat_1] = []

        # iterate over the remaining features of the dataset
        for feat_2 in X_train.columns[i + 1:]:

            # check if this second feature is identical to the first one
            if X_train[feat_1].equals(X_train[feat_2]):

                # if yes, append it to the list in the dictionary
                duplicated_feat_pairs[feat_1].append(feat_2)
                
                # and append it to the list with duplicated variable names only
                _duplicated_feat.append(feat_2)

In [7]:
print(_duplicated_feat)
print(len(_duplicated_feat))

['var_148', 'var_199', 'var_296', 'var_250', 'var_232', 'var_269']
6


We found 6 features that were duplicates of others.

In [8]:
duplicated_feat_pairs

{'var_4': [],
 'var_5': [],
 'var_8': [],
 'var_13': [],
 'var_15': [],
 'var_17': [],
 'var_18': [],
 'var_19': [],
 'var_21': [],
 'var_22': [],
 'var_25': [],
 'var_26': [],
 'var_27': [],
 'var_29': [],
 'var_30': [],
 'var_31': [],
 'var_35': [],
 'var_37': ['var_148'],
 'var_38': [],
 'var_41': [],
 'var_46': [],
 'var_47': [],
 'var_49': [],
 'var_50': [],
 'var_51': [],
 'var_52': [],
 'var_54': [],
 'var_55': [],
 'var_57': [],
 'var_58': [],
 'var_62': [],
 'var_63': [],
 'var_64': [],
 'var_68': [],
 'var_70': [],
 'var_74': [],
 'var_75': [],
 'var_76': [],
 'var_79': [],
 'var_82': [],
 'var_83': [],
 'var_84': ['var_199'],
 'var_85': [],
 'var_86': [],
 'var_88': [],
 'var_91': [],
 'var_93': [],
 'var_94': [],
 'var_96': [],
 'var_100': [],
 'var_101': [],
 'var_103': [],
 'var_105': [],
 'var_107': [],
 'var_108': [],
 'var_109': [],
 'var_110': [],
 'var_114': [],
 'var_117': [],
 'var_118': [],
 'var_119': [],
 'var_121': [],
 'var_123': [],
 'var_128': [],
 'var_131'

We see that for every feature, if it had duplicates, we have entries in the list, otherwise, we have empty lists. Let's explore those features with duplicates now:

In [9]:
# let's explore the number of keys in our dictionary

# we see it is 152, because 6 of the 158 were duplicates,
# so they were not included as keys

print(len(duplicated_feat_pairs.keys()))

152


In [10]:
# print the features with its duplicates

for feat in duplicated_feat_pairs.keys():
    
    if len(duplicated_feat_pairs[feat]) > 0:

        # print the feature and its duplicates:
        print(feat, duplicated_feat_pairs[feat])
        print()

var_37 ['var_148']

var_84 ['var_199']

var_143 ['var_296']

var_177 ['var_250']

var_226 ['var_232']

var_229 ['var_269']



In [11]:
# finally, remove the duplicates by retaining
# the keys of the dictionary

X_train = X_train[duplicated_feat_pairs.keys()]
X_test = X_test[duplicated_feat_pairs.keys()]

X_train.shape, X_test.shape

((35000, 152), (15000, 152))