# Postprocessing

A few days ago, the [original train.csv of this competition](https://www.kaggle.com/criskiev/november21) was published. It shows the labels before 25 % of them were flipped. We can use these data to postprocess the output of neural networks and improve the score.

The original training file has the nice property that the two classes can be separated perfectly (i.e. with 100 % accuracy) by a single hyperplane. This same hyperplane has a meaning in the updated train.csv as well: On one side of the hyperplane, about 75 % of the samples are class 0, one the other side of the hyperplane, 75 % are class 1.

If we want to optimize a neural network's predictions for the highest possible auc score, we have to ensure that the network outputs low predictions for one side of the hyperplane and high predictions for the other side of the hyperplane, and that there isn't any overlap between the two sides.

The `postprocess_separate()` function in this notebook reads a submission file, checks if the predictions overlap and removes the overlap. It can be applied to the output of any network; for demonstration purposes we apply it to @adityasharma01's [simple NN](https://www.kaggle.com/adityasharma01/simple-nn-tps-nov-21).


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score, accuracy_score


In [None]:
def postprocess_separate(submission_df, test_df=None, pure_df=None):
    """Update submission_df so that the predictions for the two sides of the hyperplane don't overlap.
    
    Parameters
    ----------
    submission_df : pandas DataFrame with columns 'id' and 'target'
    test_df : the competition's test data
    pure_df : the competition's original training data
    
    From https://www.kaggle.com/ambrosm/tpsnov21-007-postprocessing
    """
    if pure_df is None: pure_df = pd.read_csv('../input/november21/train.csv')
    if pure_df.shape != (600000, 102): raise ValueError("pure_df has the wrong shape")
    if test_df is None: test_df = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')
    if test_df.shape[0] != submission_df.shape[0] or test_df.shape[1] != 101: raise ValueError("test_df has the wrong shape")

    # Find the separating hyperplane for pure_df, step 1
    # Use an SVM with almost no regularization
    model1 = make_pipeline(StandardScaler(), LinearSVC(C=1e5, tol=1e-7, penalty='l2', dual=False, max_iter=2000, random_state=1))
    model1.fit(pure_df.drop(columns=['id', 'target']), pure_df.target)
    pure_pred = model1.predict(pure_df.drop(columns=['id', 'target']))
    print((pure_pred != pure_df.target).sum(), (pure_pred == pure_df.target).sum()) # 1 599999
    # model1 is not perfect: it predicts the wrong class for 1 of 600000 samples

    # Find the separating hyperplane for pure_df, step 2
    # Fit a second SVM to a subset of the points which contains the support vectors
    pure_pred = model1.decision_function(pure_df.drop(columns=['id', 'target']))
    subset_df = pure_df[(pure_pred > -5) & (pure_pred < 0.9)]
    model2 = make_pipeline(StandardScaler(), LinearSVC(C=1e5, tol=1e-7, penalty='l2', dual=False, max_iter=2000, random_state=1))
    model2.fit(subset_df.drop(columns=['id', 'target']), subset_df.target)
    pure_pred = model2.predict(pure_df.drop(columns=['id', 'target']))
    print((pure_pred != pure_df.target).sum(), (pure_pred == pure_df.target).sum()) # 0 600000
    # model2 is perfect: it predicts the correct class for all 600000 training samples
    
    pure_test_pred = model2.predict(test_df.drop(columns=['id', 'target'], errors='ignore'))
    lmax, rmin = submission_df[pure_test_pred == 0].target.max(), submission_df[pure_test_pred == 1].target.min()
    if lmax < rmin:
        print("There is no overlap. No postprocessing needed.")
        return
    # There is overlap. Remove this overlap
    submission_df.loc[pure_test_pred == 0, 'target'] -= lmax + 1
    submission_df.loc[pure_test_pred == 1, 'target'] -= rmin - 1
    print(submission_df[pure_test_pred == 0].target.min(), submission_df[pure_test_pred == 0].target.max(),
          submission_df[pure_test_pred == 1].target.min(), submission_df[pure_test_pred == 1].target.max())


In [None]:
sub = pd.read_csv('../input/simple-nn-tps-nov-21/submission.csv')
postprocess_separate(sub)
sub.to_csv("submission_postprocessed.csv", index=False)
sub.head()