# Motivation

as several people has discussd it already, data seems to be in chunks and ratio of positive labels in each chunk is different from one another. @AmbrosM has discussed a way to probe testset positive ratios per chunk in his notebook. He uses a good performing submission and replaces the output for diffirent chunks with pre-defined values, then by submitting the new outputs we can get some information about the actual ratio of positive samples in the public part of the test set.

In this notebook I'm going to show a mathematical way to estimate the ratio. I will be using a different approach when submitting probing submissions and then use the AUC score to approximate positive ratios given some assumptions.

long story short, here are the ratios (probabilities) for the 9 chunks in the test set:

**(0.51242,
 0.59540,
 0.42619,
 0.45032,
 0.54770,
 0.54590,
 0.40770,
 0.45896,
 0.55530)**

I also don't think these can help so much for making a better prediction, given the fact that 25% of targets are flipped and we have already reached 0.75 AUC, but let's have some fun with the mathematics of AUC.

Refer to these notebook and discussions for more info regarding the chunks:

https://www.kaggle.com/ambrosm/tpsnov21-012-leaderboard-probing/notebook

https://www.kaggle.com/c/tabular-playground-series-nov-2021/discussion/286731

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Probing

for the leaderboard probing part I'm using submissions with 0s,1s. I make 9 submissions in the i-th submissions I put all of the outputs for the i-th chunk to 1 and all the rest to 0s. This way will will get an idea about how much of the actual 1s are in the i-th chunk.

In [None]:
data_ts = pd.read_csv('/kaggle/input/tabular-playground-series-nov-2021/test.csv', index_col='id')
data_ts.shape

In [None]:
ch_size = 60_000
for i in range(10, 19):
    data_ts['target'] = 0
    data_ts.loc[i*ch_size:(i+1)*ch_size, 'target'] = 1
    data_ts[['target']].to_csv(f'prob_{i-10}.csv', index_label='id')

## Probing Results

Here are the AUCs for the 9 submissions:

\[0.50276, 0.52120, 0.48360, 0.48896, 0.51060, 0.51020, 0.47949, 0.49088, 0.51229\]

We already can see that there should be more positive samples in the second chunk compared to the third chunk. It should be noted these are from the public leaderboard, and the following discussion is made given the assumption that public/private leaderboard split is random, i.e. the same ratio will be found in private leaderboard.

# Mathematics of AUC

Because our submission has only 0,1 the ROC curve would look like something like this:

In [None]:
import pylab as pl

fpr = [0, 0.4, 1]
tpr = [0, 0.6, 1]

pl.plot(fpr, tpr, 'bo-')
pl.plot([0,1], [0,1], 'k:')
pl.plot([0,fpr[1]], [tpr[1]]*2, 'r:')
pl.plot([fpr[1]]*2, [0,tpr[1]], 'r:')
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('ROC Curve')
pl.text(fpr[0]+0.05, tpr[0], 'threshold>1', ha='left')
pl.text(fpr[1], tpr[1]+0.1, 'threshold=1', va='bottom', ha='center')
pl.text(fpr[2]-0.05, tpr[2], 'threshold=0', ha='right');
pl.xticks(fpr, ['0',r'$FPR_1$','1'])
pl.yticks(tpr, ['0',r'$TPR_1$','1']);

Just to remind you how the ROC curve is being calculated. For the given input vecotrs (y_true, y_pred), we know y_true is binary, and without losing generality we can assume y_pred values are in the [0,1] range. Now we would pick different threshoulds as decision threshoulds and assign 0,1 labels to the y_pred. So if we have 100 prediction values in range [0,1] when we chose the threshould as 0.2 we will end up in a vetor like [0,0,0, ..., 1,1,1] and using this new vector we can calculate True Positive Rate and False Positive Rate compared to original target (y_true).

False Posiive Rate is the number of false positives devided by total number of actual negative samples, and True Positive Rate is the number of true postives devided by the number of actual positives.

$FPR = \frac{FP}{N} = \frac{FP}{FP+TN}$

$TPR = \frac{TP}{P} = \frac{TP}{TP+FN}$

By changing the threshould we would get different values for FPR, TPR and ploting them would give us the ROC (receiver operating characteristic) curve. The area under this curve is typically called AUC (area under curve). It is woth mentioning typically the threshoulds are the unique values in y_pred vector, because for values between those the TPR, FPR would not change. 

## AUC for the probing submission files

In our example since our output has only 0s and 1s in y_pred the threshoulds would be [2, 1, 0]. The first and the last threshould would result in 0 and 1 respectively for both FPR and TPR. The main point of interest is when threshould = 1. 

Let's call FPR and TPR at thr=1 $FPR_1, TPR_1$ respectively. Given these values we can calculate the AUC as a sum of areas of a triangle and a trapeziod:

$AUC = \frac{FPR_1 \times TPR_1}{2} + \frac{TPR_1 + 1}{2} \times (1-FPR_1)$

simplifying it:

$AUC = \frac{TPR_1 - FPR_1 + 1}{2}$


for the i-th submission, we have 1 on the i-th chunk and 0s on other chunks. Assuming the i-th chunk has $r_i$ ratio of positive samples and ratio $r$ for the whole test set, we can calculate the $FPR_1, TPR_1$ for the i-th chunk:

$TPR_1(i) = \dfrac{r_i \times ChunkSize}{r \times TestSetSize} = \dfrac{r_i}{9 \times r}$

similarly for $FPR_1$:

$FPR_1(i) = \dfrac{1-r_i}{9 \times (1-r)}$

**key assumption:** assuming the whole test set would have the ration of 0.5 of 0s and 1s, i.e. $r=0.5$, we can simplify the above formulas and derive AUC to $r_i$ formula:

$TPR_1(i) = \dfrac{r_i}{4.5}, FPR_1(i) = \dfrac{1-r_i}{4.5}$

$AUC(i) = \dfrac{1}{2}(\dfrac{r_i}{4.5} - \dfrac{1-r_i}{4.5} + 1)$

Or:

$AUC(i) = \dfrac{2r_i+3.5}{9}$

In other words, the positive ratio in i-th chunk is:

$r_i = \dfrac{9 AUC(i) - 3.5}{2}$


Given this formula we can estimate the positive ratios for different chunks given the AUC of the submission files:

In [None]:
AUC_chunks = [0.50276, 0.52120, 0.48360, 0.48896, 0.51060, 0.51020, 0.47949, 0.49088, 0.51229]

ratio_finder = lambda AUC: (9*AUC - 3.5) / 2

[ratio_finder(x) for x in AUC_chunks]

It should be noted again that the estimates were made based on two main assumptions:

* The ratios are the same in public and private test-sets
* The positive ratio over all samples in test set is 0.5
