Let's assume two things.

1. **The data is generated with around 2.5% of flipped labels(https://www.kaggle.com/c/instant-gratification/discussion/94671#latest-547805).**

2. **We've created perfect or almost perfect classifier which correctly classifies all or almost all of the samples except the flipped ones.**

Now let's do some tests.

**Be careful! All of this test's results are useless if this assumption is wrong.**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm
np.random.seed(0)

# Synthetic Datasets

Since not exactly but **about** 2.5% of the labels are flipped, we will create many different datasets, then see the possible scores of the perfect (or almost perfect) classifiers and possible differences between public and private scores.

Dataset size is set to 262144, similar to our competition's test set size.

*I revised data generation code which was originally written by raddar(https://www.kaggle.com/c/instant-gratification/discussion/94671#latest-547805)*.

In [None]:
size = 262144
score_archive = []
diff_archive = []
for i in tqdm(range(10000)):
    # generate synthetic dataset
    rng = np.random.uniform(size=size)
    pred = np.array([i for i in range(size)])
    y = np.array([0 if x<0.975 else 1 for x in rng[:size//2]] + [1 if x<0.975 else 0 for x in rng[size//2:]])
    # make synthetic public set and private set
    pub_pred, pri_pred, pub_y, pri_y = train_test_split(pred, y, shuffle=True, test_size=0.5)
    # calculate public set score
    score_archive.append(roc_auc_score(pub_y, pub_pred))
    # calculate score difference between public set and private set
    diff_archive.append(roc_auc_score(pub_y, pub_pred)-roc_auc_score(pri_y, pri_pred))

 27%|██▋       | 2724/10000 [12:08<32:47,  3.70it/s]

# Score of a Perfect Classifier

In [None]:
plt.figure(figsize=(14,6))
plt.hist(score_archive, bins=100)
plt.title('distribution of roc auc of a perfect classifier')
plt.show()

In [None]:
np.mean(score_archive)

In [None]:
np.percentile(score_archive, np.arange(40, 60, 1))

Mean score of about 0.975 is similar to what raddar had observed. Also, it looks like scores approximately match our public leaderboard top scores.

# Score Difference between Public set and Private set

In [None]:
plt.figure(figsize=(14,6))
plt.hist(diff_archive, bins=100)
plt.title('distribution of roc auc diff between public and private')
plt.show()

In [None]:
np.mean(np.abs(diff_archive))

In [None]:
np.percentile(diff_archive, np.arange(40, 60, 1))

Looks like about 0.0005~0.0006 score shake up/down will happen in private set if we've made perfect (or almost perfect) classifier.
Looking at current top public leaderboard scores, there might be shake ups or downs of ~10 positions.

**But note once more that these conclusions are based on the assumption I described at the beginning.**