**DISCLAIMER:** This notebook will not help you better understand the competition. Kaggle Admin William Cukierski has already mentioned that they do not like it when people use this approach to gain a high leaderboard score. This creates confusion for outsiders and discourages new ML'ers from participating. Please, don't submit a perfect submission to the leaderboard. This notebook is for educational purposes only.

Extracting knowledge using leaderboard submissions
=========================================
In most Kaggle competitions, the test set is split into a public and private part. Scores reported on the leaderboard are based solely on the public part, meaning that it is not possible to obtain information about your performance on the data in the private part. This has the added effect that a perfect score on the leaderboard means you have created the perfect model. In the Data Science Bowl 2017, however, the test set is only small and the leaderboard score is determined on this whole test set. Final performance is based on a data set that will become available further into the competition. The combination of test set size and the reported scores being based on all of it mean that it is possible to find the labels of the cases in the test set. This is where [Oleg Trott](https://www.kaggle.com/olegtrott) comes in. He has obtained the number 1 spot on the leaderboard with a perfect log-loss of 0.00000 ([his post about this](https://www.kaggle.com/c/data-science-bowl-2017/forums/t/27800/my-perfect-score)).  His post sparked my interest and together with [Mark](https://markv.nl/blag/efficient-overfitting-of-training-data-kaggle-bowl) I thought of a way to accomplish the same. Let's get started!

These imports speak for themselves.

In [None]:
import numpy as np
from sklearn.metrics import log_loss

Of course, we need to test our program on a list of "true values". Since we don't have access to the true test set labels (yet), we'll generate our own truth.

In [None]:
N = 198
y_true = np.random.randint(2, size=N)

These functions are used to calculate the performance of our 'predictions' at this competitions metric: the logarithmic loss.

In [None]:
def score_submission(pred, true):
    pred = [np.max(np.min(x, 1-10**(-15)), 10**(-15)) for x in pred]
    return log_loss(true, pred, labels=[0,1])

def score_val(val, true):
    val = np.max(np.min(val, 1-10**(-15)), 10**(-15))
    return log_loss([true], [val], labels=[0,1])

The function `make_predict_at` is what we will use to generate the values in our submission. We will extract the labels for several samples at a time and is set with `n_predicts`. How many we can do at the same time depends on the amount of information we get from the leaderboard response and the maximal contribution of a single value. 

Lets have a look at the edge cases, in which we are extremely certain about our predictions (0.0 or 1.0) and either wrong or right: 

In [None]:
print("Correct:", score_val(1,1)/N)
print("Wrong:", score_val(1,0)/N)

Kaggle returns the score rounded to 5 digits, meaning that the contribution of a single value to the log-loss lies in the range of [0.00000, 0.17444]. We want to be able to separate the score into contributions from each value, for which we will have 17445 values available (since we could increment by 0.00001 and still see how it changes the score). This is equivalent to ~14 bits of information, each of which we will use to keep a single value separate from the others.

The exponent in the predicted values cancels out the logarithm in the log-loss score, making it easier to work with the score in the next part.

In [None]:
n_predicts = 15
def make_predict_at(n, n_predicts=10, N=N):
    pred = np.ones(N)*0.5 
    for i in range(n, n+n_predicts):
        pred[i] = np.exp(-0.5**(i+2))
    return pred

I removed the contribution of all 0.5's and multiplied by the number of values in the test set to get just the sum of logarithms.

In [None]:
pred = make_predict_at(0, n_predicts=n_predicts)
score = score_submission(pred, y_true)
score = score*198 - score_val(0.5, 0)*(198-n_predicts)

The nested for-loops below calculates every possible score way that our predicted values can produce and keeps track of which sum is built out of either a 'correct' or 'wrong'. 

In [None]:
sums = [[0]]
track_sum = [[]]
for i in range(n_predicts):
    old_sums = sums
    old_track = track_sum
    sums = []
    track_sum = []
    for s,t in zip(old_sums, old_track):
        sums.append(s-np.log(pred[i]))
        sums.append(s-np.log(1-pred[i]))
        
        track_sum.append(t+[1])
        track_sum.append(t+[0])

Calculating the absolute difference between the possible sums and the score we retrieved and sorting means that we will find our labels in the first entry of `sorted_sum`!

In [None]:
diff_sums = np.abs(sums-score)
sorted_sum = sorted(zip(diff_sums, track_sum))

print(sorted_sum[0][1])
print(list(y_true[:n_predicts]))

The code below repeats it 10 times, if you are not confident that this isn't a lucky shot ;)

In [None]:
for test_n in range(10):
    y_true = np.random.randint(2, size=N)
    
    pred = make_predict_at(0, n_predicts=n_predicts)
    score = score_submission(pred, y_true)
    score = score*198 - score_val(0.5, 0)*(198-n_predicts)

    sums = [[0]]
    track_sum = [[]]
    for i in range(n_predicts):
        old_sums = sums
        old_track = track_sum
        sums = []
        track_sum = []
        for s,t in zip(old_sums, old_track):
            sums.append(s-np.log(pred[i]))
            sums.append(s-np.log(1-pred[i]))

            track_sum.append(t+[1])
            track_sum.append(t+[0])
     
    diff_sums = np.abs(sums-score)
    sorted_sum = sorted(zip(diff_sums, track_sum))
 
    print(test_n)
    print(sorted_sum[0][1])
    print(list(y_true[:n_predicts]))