# Summary

If 25% of targets in the public test set were mislabelled as suggested in [notebook](https://www.kaggle.com/motloch/nov21-mislabeled-25), then a perfect classifier achieves AUC of 0.75 with standard deviation 0.00079 or 0.00155. 

The former number is obtained when exactly quarter of the samples are mislabelled.

The latter when we for each sample individually decide whether to mislabel or not (can lead to more or less mislabellings than in the previous case where the number of mislabellings is kept constant).

# Import libraries and data

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

In [None]:
dat = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')
N = int(len(dat) * 0.19)      # 0.19 for the public test set portion
print(f'Number of samples in the public test set: {N}')

# Create a mock results for a perfect classifier

Let's order the samples by how confident the classifier is in them being target == 1. As we assume the classifier is perfect (and the two classes both equally likely), the first half of the samples are target == 0, the second half is target == 1. This is "true" targets, before the mislabelling.

In [None]:
target_original = np.zeros(N)
target_original[N//2:] = 1

Results of a "perfect" mock classifier, corresponding to probability of target being 1 for individual samples. Only the order matters for AUC so we have a freedom to choose anything monotonically increasing.

In [None]:
pred = np.zeros(N)
pred[:N//2] = 0.10*np.arange(N//2)/N
pred[N//2:] = 1 - 0.10*np.arange(N//2,0,-1)/N

Check

In [None]:
plt.plot(pred, label = 'prediction from "perfect" classifier')
plt.plot(target_original, label = 'target before mislabelling')
plt.xlabel('sample number')
plt.legend();

# Simulate random mislabelling in a MCMC fashion - flip quarter of samples

NRANDOM times we randomly flip target value for quarter of the sample and calculate what the resulting ROC value is

In [None]:
NRANDOM = 1000

auc = np.zeros(NRANDOM)

for i in range(NRANDOM):
    
    if i % 100 == 0:
        print(i)
        
    # Pick quarter of the samples that we mislabel
    to_flip = np.random.choice(range(N), size = N//4, replace = False)
    
    # Create the mislabeled test set
    target_mislabelled = target_original.copy()
    target_mislabelled[to_flip] = 1 - target_mislabelled[to_flip]
    
    # Get the AUC
    auc[i] = roc_auc_score(target_mislabelled, pred)

Histogram

In [None]:
plt.hist(auc, 20);
plt.xlabel('AUC');

Statistics:

In [None]:
print(f'Mean is {np.mean(auc):.5f}, standard deviation {np.std(auc):.5f}')

# Simulate random mislabelling in a MCMC fashion - flip each sample with 25% probability

In this method, we for each sample decide whether to flip (with 25% probability) it or not. This means the number of flips can be different from N/4.

In [None]:
NRANDOM = 1000

auc = np.zeros(NRANDOM)

for i in range(NRANDOM):
    
    if i % 100 == 0:
        print(i)
        
    # For each sample, flip it with 25% probability
    flip = np.random.random(size = N) >= 0.75
    stay = 1 - flip
    
    # Create the mislabeled test set
    target_mislabelled = target_original.copy()
    target_mislabelled = target_mislabelled * stay + (1 - target_mislabelled) * flip
    
    # Get the AUC
    auc[i] = roc_auc_score(target_mislabelled, pred)

Histogram

In [None]:
plt.hist(auc, 20);
plt.xlabel('AUC');

Stats

In [None]:
print(f'Mean is {np.mean(auc):.5f}, standard deviation {np.std(auc):.5f}')