# Summary

It has been [argued](https://www.kaggle.com/motloch/nov21-mislabeled-25) that about 25% of the target values in the test set have been randomly flipped.

We use the [known target values before flip](https://www.kaggle.com/c/tabular-playground-series-nov-2021/discussion/287047) to calculate flip probability in all [ten chunks discovered earlier](https://www.kaggle.com/c/tabular-playground-series-nov-2021/discussion/286731).

If the random flips really are independent of the feature values and chunks, we [would expect](https://en.wikipedia.org/wiki/Binomial_distribution) the number of flips in each chunk to be about 
$$\approx 0.25 \times 60000 = 15000,$$ 
with standard deviation 
$$\approx \sqrt{0.25 \times (1-0.25) \times 60000} = 106.$$

We observe chunk-to-chunk standard deviation of about 400. Assuming our calculation is correct, this suggest there might be a weak feature dependence to the flip probability (as the chunks have different feature distributions). This could also be driven by something related to the process of chunking - for example each chunk might have a slightly different flip probability (but independent of the features).

# Import libraries, load data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Import competition train data, both after and before the reshuffling

In [None]:
train = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')
train_old = pd.read_csv('../input/november21/train.csv')

For convenience:

In [None]:
y = train['target']
y_old = train_old['target']

As discussed [here](https://www.kaggle.com/c/tabular-playground-series-nov-2021/discussion/286731) by [@grayjay](https://www.kaggle.com/grayjay), the training data seems to be a combination of ten chunks of size 60k, each with slightly different distributions.

In [None]:
CS = 60000

# Study random flips in the train set

This is an array telling us which of the targets were flipped

In [None]:
is_flipped = (y != y_old)

Calculate how many targets were flipped in each of the chunks

In [None]:
num_flipped = np.zeros(10, dtype = int)
for i in range(10):
    num_flipped[i] = np.sum(is_flipped[i*CS:(i+1)*CS])
    print(num_flipped[i])

Basic statistics

In [None]:
mu = np.mean(num_flipped)
std = np.std(num_flipped)
print(f'Mean is {mu:.0f} and standard deviation {std:.0f}')

Parameters of the [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution) and standard deviation we would expect if the flips were randomly drawn from this distribution

In [None]:
N = CS        # number of draws
p = mu/N      # success probability
q = 1 - p     # failure probability

expected_std = np.sqrt(p*q*N)  # see e.g. Wikipedia link above
print(f'Expected standard deviation is {expected_std:.0f}')

Comparison of the observed numbers of flips in the ten training chunks with the expected mean and standard deviation (assuming a binomial distribution). The number of flips in chunks seems to fluctuate way more than it should..

In [None]:
plt.scatter(range(10), num_flipped, label = 'Training data')
plt.axhline(mu, c = 'gray', label = 'Expected (68% c.l.)')
plt.axhline(mu + expected_std, c = 'gray')
plt.axhline(mu - expected_std, c = 'gray')
plt.ylabel('Number of flips in the chunk')
plt.xlabel('Chunk number');
plt.legend(loc = 4);
plt.ylim([14000, 16000]);

There is almost a factor four difference between the observed and expected standard deviation..

In [None]:
std/expected_std