## Duplicated features

Often datasets contain one or more features that show the same values across all the observations. This means that both features are in essence identical. In addition, it is not unusual to introduce duplicated features after performing **one hot encoding** of categorical variables, particularly when using several highly cardinal variables.

Identifying and removing duplicated, and therefore redundant features, is an easy first step towards feature selection and more easily interpretable machine learning models.

Here I will demonstrate how to identify duplicated features using the Santander Customer Satisfaction dataset from Kaggle. 

There is no function in python and pandas to find duplicated columns. I will show 2 snippets of code, one that you can apply to small datasets, and a second snippet that you can use on larger datasets. The first piece of code, is computationally costly, so your computer might run out of memory.

**Note**
Finding duplicated features is a computationally costly operation in Python, therefore depending on the size of your dataset, you might not always be able to perform it.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

## Removing duplicate features

In [15]:
# load the Santander customer satisfaction dataset from Kaggle
# I load just a few rows for the demonstration
data = pd.read_csv('Advertising.csv', nrows=15000)
data.shape

(200, 8)

In [16]:
# check the presence of null data.
# The snippets below will be able to compare nan values between 2 columns,
# so in principle missing data are not a problem.
# in any case, we see that there are no missing data in this dataset

[col for col in data.columns if data[col].isnull().sum() > 0]

[]

### Important

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [17]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['sales'], axis=1),
    data['sales'],
    test_size=0.2,
    random_state=0)

X_train.shape, X_test.shape

((160, 7), (40, 7))

Pandas has the function 'duplicated' that evaluates if the dataframe contains duplicated rows. We can use this function to check for duplicated columns if we transpose the dataframe first. By transposing the dataframe, we obtain a new dataframe where the columns are now rows, and with the 'duplicated' method we can go ahead an identify those that are duplicated. 

Once we identify them, we can remove the duplicated rows. See below.

### Code Snippet for small datasets

Using pandas transpose is computationally expensive, so the computer may run out of memory. That is why we can only use this code block on small datasets. How small will depend of your computer specifications.

In [18]:
# transpose the dataframe, so that the columns are the rows of the new dataframe
data_t = X_train.T
data_t.head()

Unnamed: 0,134,66,26,113,168,63,8,75,118,143,...,87,36,21,9,103,67,192,117,47,172
Unnamed: 0,135.0,67.0,27.0,114.0,169.0,64.0,9.0,76.0,119.0,144.0,...,88.0,37.0,22.0,10.0,104.0,68.0,193.0,118.0,48.0,173.0
TV,36.9,31.5,142.9,209.6,215.4,102.7,8.6,16.9,125.7,104.6,...,110.7,266.9,237.4,199.8,187.9,139.3,17.2,76.4,239.9,19.6
radio,38.6,24.6,29.3,20.6,23.6,29.6,2.1,43.7,36.9,5.7,...,40.6,43.8,5.1,2.6,17.2,14.5,4.1,0.8,41.5,20.1
radio1,38.6,24.6,29.3,20.6,23.6,29.6,2.1,43.7,36.9,5.7,...,40.6,43.8,5.1,2.6,17.2,14.5,4.1,0.8,41.5,20.1
Constant,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [19]:
# check if there are duplicated rows (the columns of the original dataframe)
# this is a computionally expensive operation, so it might take a while
# sum indicates how many rows are duplicated

data_t.duplicated().sum()

1

We can see that 105 columns / variables are duplicated. This means that 105 variables are identical to at least another variable within a dataset.

In [20]:
# visualise the duplicated rows (the columns of the original dataframe)
data_t[data_t.duplicated()]

Unnamed: 0,134,66,26,113,168,63,8,75,118,143,...,87,36,21,9,103,67,192,117,47,172
radio1,38.6,24.6,29.3,20.6,23.6,29.6,2.1,43.7,36.9,5.7,...,40.6,43.8,5.1,2.6,17.2,14.5,4.1,0.8,41.5,20.1


In [21]:
# we can capture the duplicated features, by capturing the
# index values of the transposed dataframe like this:
duplicated_features = data_t[data_t.duplicated()].index.values
duplicated_features

array(['radio1'], dtype=object)

In [22]:
# alternatively, we can remove the duplicated rows,
# transpose the dataframe back to the variables as columns
# keep first indicates that we keep the first of a set of
# duplicated variables

data_unique = data_t.drop_duplicates(keep='first').T
data_unique.shape

(160, 6)

We can see immediately how removing duplicated features helps reduce the feature space. We passed from 370 to 265 non-duplicated features.

In [23]:
# to find those columns in the original dataframe that were removed:

duplicated_features = [col for col in data.columns if col not in data_unique.columns]
duplicated_features 

['radio1', 'sales']

### Big datasets

Transposing a dataframe is memory costly if the dataframe is big. Therefore, we can use the alternative loop to find duplicated columns in bigger datasets.

In this case, I will use the same dataset, Santander from Kaggle, but I will load more rows. I expect to see less duplicated features, because by increasing the number of customers in the dataset, the probability of 2 customers having the same value across 2 or more features decreases. But this might as well not be the case. Let's have a look.

In [26]:
# load the dataset
data = pd.read_csv('Advertising.csv', nrows=50000)

# separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['sales'], axis=1),
    data['sales'],
    test_size=0.2,
    random_state=0)

X_train.shape, X_test.shape

((160, 7), (40, 7))

In [27]:
# check for duplicated features in the training set
duplicated_feat = []
for i in range(0, len(X_train.columns)):
    if i % 10 == 0:  # this helps me understand how the loop is going
        print(i)

    col_1 = X_train.columns[i]

    for col_2 in X_train.columns[i + 1:]:
        if X_train[col_1].equals(X_train[col_2]):
            duplicated_feat.append(col_2)

0


In [28]:
# check how many features are duplicated
print(len(set(duplicated_feat)))

1


There are less duplicated features than when I loaded a smaller sample of the dataset. This behaviour is expected. Ideally you should work over the entire dataset.

In [29]:
# let's print the list of duplicated features
set(duplicated_feat)

{'radio1'}

In [30]:
# we can go ahead and try to identify which set of features
# are identical

duplicated_feat = []
for i in range(0, len(X_train.columns)):

    col_1 = X_train.columns[i]

    for col_2 in X_train.columns[i + 1:]:

        # if the features are duplicated
        if X_train[col_1].equals(X_train[col_2]):

            #print them
            print(col_1)
            print(col_2)
            print()

            # and then append the duplicated one to a
            # list
            duplicated_feat.append(col_2)

radio
radio1



In [33]:
# let's check that indeed those features are duplicated
# I select a random pair from above

X_train[['radio', 'radio1']].head(10)

Unnamed: 0,radio,radio1
134,38.6,38.6
66,24.6,24.6
26,29.3,29.3
113,20.6,20.6
168,23.6,23.6
63,29.6,29.6
8,2.1,2.1
75,43.7,43.7
118,36.9,36.9
143,5.7,5.7


In [32]:
# let's check that indeed those features are duplicated
# I select another random pair from above

X_train[['radio', 'radio1']].head(10)

Unnamed: 0,radio,radio1
134,38.6,38.6
66,24.6,24.6
26,29.3,29.3
113,20.6,20.6
168,23.6,23.6
63,29.6,29.6
8,2.1,2.1
75,43.7,43.7
118,36.9,36.9
143,5.7,5.7


We can see, that the features are identical.

That is all for this lecture, I hope you enjoyed it and see you in the next one!