This is a fork of osciiart's notebook: https://www.kaggle.com/osciiart/public-lb-simulation

I continued the analysis to simulate private score after **two** random-value submits.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os, glob, pickle, time, gc, copy, sys
import warnings
from tqdm import tqdm
from sklearn import metrics

tqdm.pandas()
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100)

# Data loading

In [None]:
df_train = pd.read_csv("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train_labels.csv")
print('len(df_train): {}'.format(len(df_train)))
print("df_train['MGMT_value'].mean(): {:.6f}".format(df_train['MGMT_value'].mean()))
df_train.head()

In [None]:
df_test = pd.read_csv("../input/rsna-miccai-brain-tumor-radiogenomic-classification/sample_submission.csv")
print('len(df_test): {}'.format(len(df_test)))
df_test.head()

The public test data has only 87 samples and the private test was approx. four times larger.

> The private leaderboard is calculated with approximately 78% of the test data.

We can multiply the public test set to four times bigger to simulate the private set.

In [None]:
df_test = pd.concat([df_test, df_test, df_test, df_test], ignore_index = True)
print('len(df_test): {}'.format(len(df_test)))
df_test.head()

# Private LB simulation

In [None]:
# assume that positive rate of the test data is same with the train data.
num_positive = int(len(df_test)*df_train['MGMT_value'].mean())
print("num_positive: {}".format(num_positive))

In [None]:
# make true labels
y_true = np.zeros(len(df_test))
y_true[:num_positive] = 1
print("y_true.mean(): {:.6f}".format(y_true.mean()))

In [None]:
# make random prediction
y_pred = np.random.rand(len(df_test))
y_pred

In [None]:
# calculate the score
from sklearn import metrics
score = metrics.roc_auc_score(y_true, y_pred)
print("score: {:.6f}".format(score))

In [None]:
from tqdm.auto import tqdm

In [None]:
# try random prediction 100000 times.
scores = []
for i in tqdm(range(100000)):
    np.random.seed(i)
    y_pred = np.random.rand(len(df_test))
    score = metrics.roc_auc_score(y_true, y_pred)
    scores.append(score)
df_score = pd.DataFrame(scores, columns=['score'])
df_score = df_score.sort_values('score').reset_index(drop=True)
plt.hist(df_score['score'], bins=10)
plt.show()
df_score.tail()

The simulations above shows that a score of about 0.65 is almost by chance. Don't believe the public LB score too much. 

## One submission

With one all-random submit, the median LB score is 0.5 AUC. 

In [None]:
print(f'(1 random sub) percentile-50 : {np.percentile(scores, 50)}')
print(f'(1 random sub) percentile-90 : {np.percentile(scores, 90)}')
print(f'(1 random sub) percentile-95 : {np.percentile(scores, 95)}')
print(f'(1 random sub) percentile-99 : {np.percentile(scores, 99)}')
print(f'(1 random sub) percentile-99.9 : {np.percentile(scores, 99.9)}')
print(f'(1 random sub) percentile-99.99 : {np.percentile(scores, 99.99)}')

## Simulate Private score after two allowed all-random submits

Private shows the maximum score out of two selected submit.

In [None]:
import random

scores_2 = []

# pick k random scores randomly from the 1M pool
for i in tqdm(range(10000)):
    random.seed(i)
    
    scores_2.append(np.max(random.choices(scores, k=2)))

In [None]:
print(f'(2 random subs) percentile-50 : {np.percentile(scores_2, 50)}')
print(f'(2 random subs) percentile-90 : {np.percentile(scores_2, 90)}')
print(f'(2 random subs) percentile-95 : {np.percentile(scores_2, 95)}')
print(f'(2 random subs) percentile-99 : {np.percentile(scores_2, 99)}')
print(f'(2 random subs) percentile-99.9 : {np.percentile(scores_2, 99.9)}')
print(f'(2 random subs) percentile-99.99 : {np.percentile(scores_2, 99.99)}')

# Submit all random

For fun :)

In [None]:
sub = pd.read_csv('/kaggle/input/rsna-miccai-brain-tumor-radiogenomic-classification/sample_submission.csv')

# Fill with random values
mgmt_values = sub.MGMT_value.values
mgmt_values = [np.random.random() for _ in mgmt_values]
sub['MGMT_value'] = mgmt_values

# save
sub.to_csv('submission.csv',index=False)
sub.head()