# Leaderboard Probing

From [that discussion](https://www.kaggle.com/c/tabular-playground-series-nov-2021/discussion/286731#1578128) we know that the competition's data consists of 19 chunks, 10 chunks in the training set and 9 in the test set. Every chunk looks homogeneous, but they have different characteristics.

We can think of these chunks as a categorical variable with 19 categories. Unfortunately, only 10 of these categories appear in the training set. For the other 9 categories, there is no labeled data.

In such a situation, we can use leaderboard probing. Leaderboard probing means that we submit predictions with the goal of getting information about the test labels. In this notebook, I'm creating 18 submission files for leaderboard probing. Every probe is derived from a baseline submission, but the predictions for one chunk are set to the minimum or maximum value of all predictions. The probe returns a lb auc score, which we can transform into a target probability per chunk by simple arithmetic.

After the probing phase, we can submit the target probability for every chunk.


In [None]:
import pandas as pd
import numpy as np
import pickle
import io
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC

In [None]:
# Read the data
train_df = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')
pure_df = pd.read_csv('../input/november21/train.csv')
test_df['chunk'] = test_df.id // 60000

In [None]:
def postprocess_separate(submission_df, test_df=None, pure_df=None):
    """Update submission_df so that the predictions for the two sides of the hyperplane don't overlap.
    
    Parameters
    ----------
    submission_df : pandas DataFrame with columns 'id' and 'target'
    test_df : the competition's test data
    pure_df : the competition's original training data
    
    From https://www.kaggle.com/ambrosm/tpsnov21-007-postprocessing
    """
    if pure_df is None: pure_df = pd.read_csv('../input/november21/train.csv')
    if pure_df.shape != (600000, 102): raise ValueError("pure_df has the wrong shape")
    if test_df is None: test_df = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')
    if test_df.shape[0] != submission_df.shape[0] or test_df.shape[1] != 101: raise ValueError("test_df has the wrong shape")

    # Find the separating hyperplane for pure_df, step 1
    # Use an SVM with almost no regularization
    model1 = make_pipeline(StandardScaler(), LinearSVC(C=1e5, tol=1e-7, penalty='l2', dual=False, max_iter=2000, random_state=1))
    model1.fit(pure_df.drop(columns=['id', 'target']), pure_df.target)
    pure_pred = model1.predict(pure_df.drop(columns=['id', 'target']))
    print((pure_pred != pure_df.target).sum(), (pure_pred == pure_df.target).sum()) # 1 599999
    # model1 is not perfect: it predicts the wrong class for 1 of 600000 samples

    # Find the separating hyperplane for pure_df, step 2
    # Fit a second SVM to a subset of the points which contains the support vectors
    pure_pred = model1.decision_function(pure_df.drop(columns=['id', 'target']))
    subset_df = pure_df[(pure_pred > -5) & (pure_pred < 0.9)]
    model2 = make_pipeline(StandardScaler(), LinearSVC(C=1e5, tol=1e-7, penalty='l2', dual=False, max_iter=2000, random_state=1))
    model2.fit(subset_df.drop(columns=['id', 'target']), subset_df.target)
    pure_pred = model2.predict(pure_df.drop(columns=['id', 'target']))
    print((pure_pred != pure_df.target).sum(), (pure_pred == pure_df.target).sum()) # 0 600000
    # model2 is perfect: it predicts the correct class for all 600000 training samples
    
    pure_test_pred = model2.predict(test_df.drop(columns=['id', 'target'], errors='ignore'))
    lmax, rmin = submission_df[pure_test_pred == 0].target.max(), submission_df[pure_test_pred == 1].target.min()
    if lmax < rmin:
        print("There is no overlap. No postprocessing needed.")
        return
    # There is overlap. Remove this overlap
    submission_df.loc[pure_test_pred == 0, 'target'] -= lmax + 1
    submission_df.loc[pure_test_pred == 1, 'target'] -= rmin - 1
    print(submission_df[pure_test_pred == 0].target.min(), submission_df[pure_test_pred == 0].target.max(),
          submission_df[pure_test_pred == 1].target.min(), submission_df[pure_test_pred == 1].target.max())


In [None]:
# Create a baseline submission which always predicts -1 or +1 and has an lb score of 0.74723
baseline = pd.DataFrame({'id': test_df.id, 'target': 0})
postprocess_separate(baseline, test_df=test_df.drop(columns='chunk'), pure_df=pure_df)


In [None]:
# Create 18 probe submissions to gather information
for chunk in range(10, 19):
    sub = baseline.copy()
    sub.loc[(test_df.chunk == chunk) & (sub.target < 0), 'target'] = -10
    sub.to_csv(f'submission_probe_{chunk}_H0.csv', index=False)
    sub = baseline.copy()
    sub.loc[(test_df.chunk == chunk) & (sub.target > 0), 'target'] = +10
    sub.to_csv(f'submission_probe_{chunk}_H1.csv', index=False)

I submitted 9 of these probe submissions and got the following probabilities:

In [None]:
p_dict = {10: 0.26245756846719176,
          17: 0.25772808586762075,
          16: 0.25038670867946144,
          13: 0.2498515790341643,
          18: 0.24863555967320816,
          11: 0.2476293839324911,
          14: 0.2448713889988128,
          12: 0.24464126228044064,
          15: 0.2418890814558059}

Now we create a new submission from these probabilities. This submission has an lb score of 0.74896.

In [None]:
sub = baseline.copy()
sub.loc[baseline.target == 1, 'target'] = 0.75
for chunk in range(10, 19):
    sub.loc[(test_df.chunk == chunk) & (sub.target < 0), 'target'] = p_dict[chunk]
sub.to_csv(f'submission_probed.csv', index=False)
sub.head(20)

We can do better than that. f93 is the single feature which best predicts the ranking of samples within a chunk. We therefore subtract a tiny amount of f93 from the per-chunk predictions to get an lb score of 0.74902. Isn't this a nice example of feature selection?

In [None]:
sub = baseline.copy()
sub.loc[baseline.target == 1, 'target'] = 0.75
for chunk in range(10, 19):
    sub.loc[(test_df.chunk == chunk) & (sub.target < 0), 'target'] = p_dict[chunk] - test_df.f93.loc[(test_df.chunk == chunk) & (sub.target < 0)] * 1e-5
sub.to_csv(f'submission_probed_f93.csv', index=False)
sub.head(20)

Now I'm going to blend this result with the result of the notebook I published a few days ago to get the final predictions:

In [None]:
other_submission = pd.read_csv('../input/tpsnov21-007-postprocessing/submission_postprocessed.csv')
sub['target'] += other_submission.target
sub.to_csv(f'submission_probed_f93_blended.csv', index=False)
sub.head(20)

What remains to be done? Probe the other nine probabilities - and you'll get a much higher score!