# Deep Dive into Duplicated Data

This notebook built upon [this](https://www.kaggle.com/ambrosm/tpsfeb22-01-eda-which-makes-sense) excellent notebook by AmbrosM, the [observation](https://www.kaggle.com/c/tabular-playground-series-feb-2022/discussion/305364) that many rows are duplicated from Teck Meng Wong and the [observation](https://www.kaggle.com/thexyzt/intersection-between-training-and-test-sets) that we have some rows that are the same between train and test by thexyzt.

So as the contributors above discovered, we have duplicated data in this dataset. Why? Most discussions until now assumed that it was a data quality issue, that should be simply addressed by removing the duplicates. However, I will show you in this notebook that data duplication shows key insights about the way that the data is built and should probably be kept in.

## Introduction

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from math import factorial

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-feb-2022/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-feb-2022/test.csv')

elements = [e for e in train_df.columns if e != 'row_id' and e != 'target']

# Convert the 10 bacteria names to the integers 0 .. 9
le = LabelEncoder()
train_df['target_num'] = le.fit_transform(train_df.target)

train_df.shape, test_df.shape

In [None]:
print("Number of duplicates in train data:", train_df[elements].duplicated().sum())
print("Number of duplicates in test data:" ,test_df[elements].duplicated().sum())

As we see we have many duplicates. But why? First, we must understand the resolution and error rate of the data

So, the data that we get is a transformed version, where the raw version is the time each histogram was sampled. Each row has a different resolution - it contains either 100000, 10000, 100 or 10 samples, which we can now from the divisor of this inversly transformed data. Rows with more samples have less data we can use to successfully classify them

In [None]:
def bias(w, x, y, z):
    return factorial(10) / (factorial(w) * factorial(x) * factorial(y) * factorial(z) * 4**10)

def bias_of(s):
    w = int(s[1:s.index('T')])
    x = int(s[s.index('T')+1:s.index('G')])
    y = int(s[s.index('G')+1:s.index('C')])
    z = int(s[s.index('C')+1:])
    return factorial(10) / (factorial(w) * factorial(x) * factorial(y) * factorial(z) * 4**10)

train_i = pd.DataFrame({col: ((train_df[col] + bias_of(col)) * 1000000).round().astype(int) for col in elements})
test_i = pd.DataFrame({col: ((test_df[col] + bias_of(col)) * 1000000).round().astype(int) for col in elements})
train_i

In [None]:
def gcd_of_all(df_i):
    gcd = df_i[elements[0]]
    for col in elements[1:]:
        gcd = np.gcd(gcd, df_i[col])
    return gcd

train_df['gcd'] = gcd_of_all(train_i)
test_df['gcd'] = gcd_of_all(test_i)
np.unique(train_df['gcd'], return_counts=True), np.unique(test_df['gcd'], return_counts=True)

Notice that each resolution of data is represented in roughly equal numbers. Now let's run a PCA of the data

In [None]:
for scale in np.sort(train_df['gcd'].unique()):
    # Compute the PCA
    pca = PCA(whiten=True, random_state=1)
    pca.fit(train_i[elements][train_df['gcd'] == scale])

    # Transform the data so that the components can be analyzed
    Xt_tr = pca.transform(train_i[elements][train_df['gcd'] == scale])
    Xt_te = pca.transform(test_i[elements][test_df['gcd'] == scale])

    # Plot a scattergram, projected to two PCA components, colored by classification target
    plt.figure(figsize=(6,6))
    plt.scatter(Xt_tr[:,0], Xt_tr[:,1], c=train_df.target_num[train_df['gcd'] == scale], s=1)
    plt.title(f"{1000000 // scale} decamers ({(train_df['gcd'] == scale).sum()} samples with gcd = {scale})")
    plt.show()


In the plots for higher resolution data (low gcd), we see that there are 8 clusters for each class and gcd combination. These clusters represent different error rates, with the ones farther away from the center in which points converge having smaller errors, since at higher error rates classes get less distinct. In the plots for lower resolution data, everything blends together

# The reveal
We will now recreate the same plots, but with only points that have duplicates

In [None]:
for scale in np.sort(train_df['gcd'].unique()):
    # Compute the PCA
    pca = PCA(whiten=True, random_state=1)
    pca.fit(train_i[elements][train_df['gcd'] == scale])

    # Transform the data so that the components can be analyzed
    Xt_tr = pca.transform(train_i[elements][train_df['gcd'] == scale][train_df[elements].duplicated(keep=False)])
    Xt_te = pca.transform(test_i[elements][test_df['gcd'] == scale][test_df[elements].duplicated(keep=False)])

    # Plot a scattergram, projected to two PCA components, colored by classification target
    plt.figure(figsize=(6,6))
    plt.scatter(Xt_tr[:,0], Xt_tr[:,1], c=train_df.target_num[train_df['gcd'] == scale][train_df[elements].duplicated(keep=False)], s=1)
    plt.title(f"{1000000 // scale} decamers ({train_i[elements][train_df['gcd'] == scale][train_df[elements].duplicated(keep=False)].shape[0]} samples with gcd = {scale})")
    plt.show()

Suddenly, the reason for the duplicated data reveals itself! In the first two plots showing high resolution data, we only see one cluster per class instead of 8. This is the same cluster for the lowest error rate data! If we were to sample the same bacterium with low error rates, we are bound to get some duplicates. Moreover, in the earlier plots we see every cluster has a roughly equal size, which means that the highest resolution clusters should have roughly $50000/8 = 6250$ data points in total, almost the number of duplicated data we see. Probably, low error rate data also gets duplicated in the lower resolution data, but we can't clearly see it in the PCA plots.

However, the lower resolution data have much more duplicates than can be explained by low error rates, and they seem to be all over the place. Which leads us to the second reason the data is duplicated. When the resolution is very low, we might get rows that are the same by chance alone, especially when many samples come from the same distribution, which might be low entropy.

## Conclusions

We uncovered two reasons for data duplication, low error rates and low resolution. None of these are data quality issues, so we should think twice before we remove them. However, there is still an open question. If both of these reasons were the only reason our data was duplicated, we would expect to see significantly more duplicated at the 10000 gcd level than the 1000 level, because of the much higher resolution. However, numbers are similar. Please write a comment if you have an idea why this happens.

## To remove or not to remove?

Based on these insights, when we remove duplicated we will be removing some of our highest quality data points for training. We are also removing some legitimately sampled data points that just happen to be the same as others. This will cause our machine learning algorithm to learn a skewed distribution of the data. Therefore, I recommend you do not remove duplicates. An alternate approach might be to attempt to only remove the duplicated stemming from low resolution data, to skew the learning towards high quality samples.

From my own experience, removing duplicates significantly lowered my performance on the public leaderboard and created a stronger skew towards classifying as certain classes over others.

Many people also said that our CV can no longer be trusted after duplicates were discovered. However, I'm not sure that conclusion is justified or that removing duplicates solves it. First, if we get legitimate duplicates then "data leakage" is no longer a bug in the CV strategy but rather a feature of our ML problem that will also happen between train and test. Second, we were seeing some paradoxical results after removal where test performance was better than CV performance. Third, the lower CV performance conceals the most important, hardest problem in this competition, the mutations that happened between train and test, which are the real reason the public LB scores are lower than CV scores

# Intersecting points
We turn ourselves to the second, related mystery. We seem to have 1502 data points shared between train and test, why?

In [None]:
intersection = pd.merge(train_df, test_df, on=list(test_df.columns[1:]), how="inner")
intersection.shape[0]

Let's filter to only the lowest resolution points

In [None]:
intersection[intersection['gcd'] == 10000].shape[0]

## Conclusions

All of the intersecting points are of the lowest resolution, which means they are yet more examples of low-resolution duplications. Since we know there were mutation shifts between train and test, we no longer get low-error duplications between train and test. The implication is that we can't even be sure if the labels of these points are the same as the train labels for the same points. Maybe a bacterium, after mutating, got the same low-resolution results as a different bacterium in the train data. My recommendation for the intersection therefore is to ignore it. It's a red herring and I don't see how analyzing it can lead to improvements.