Mean target distribution for test chunks
==

This notebook makes use of two observations:

1. [The data is chunked](https://www.kaggle.com/c/tabular-playground-series-nov-2021/discussion/286731)
2. [The original train labels are known](https://www.kaggle.com/criskiev/november21)

We make two assumptions:

1. The initial test labels were "too easy"
2. The original labels were flipped at random, with the same random chance across both train and test sets

Imports & setup
--

Nothing interesting here.

In [None]:
import random
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px

from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score

np.random.seed(64)
random.seed(64)

folds = StratifiedKFold(5, random_state=64, shuffle=True)
sns.set(
    style='darkgrid', context='notebook', rc={
        'figure.frameon': False, 'figure.figsize': (12, 8), 'legend.frameon': False,
    }
)

Reading both old and new labels
--

We're just immediately going to center both the train and test set, and read both the new and old train labels:

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv', dtype=np.float32)
df_test = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv', dtype=np.float32)
old_y = pd.read_csv('../input/november21/train.csv', usecols=['target'], dtype=np.float32).target

features = df_train.columns[df_train.columns.str.startswith('f')]
scaler = StandardScaler().fit(pd.concat([df_train[features], df_test[features]], axis=0))
X = scaler.transform(df_train[features])
X_test = scaler.transform(df_test[features])
new_y = df_train.target

Establish baselines for LogisticRegression
==

We'll establish baselines for new and old labels using 5-fold split:

In [None]:
old_y_score = cross_val_predict(LogisticRegression(), X, old_y, method='decision_function', n_jobs=-1, cv=folds)
old_auc, old_acc = roc_auc_score(old_y, old_y_score), accuracy_score(old_y, old_y_score > 0)
new_y_score = cross_val_predict(LogisticRegression(), X, new_y, method='decision_function', n_jobs=-1, cv=folds)
new_auc, new_acc = roc_auc_score(new_y, new_y_score), accuracy_score(new_y, new_y_score > 0)
flip_chance = np.mean(new_y != old_y)

print(f'Old labels: auc={old_auc:.5f} acc={old_acc:.5f} => new labels auc={new_auc:.5} acc={new_acc:.5}, flip chance = {flip_chance:.5f}')

If test isn't too dissimilar from train, it seems fair enough to assume that a simple model should score better than 99% accuracy. 

Flip chance = 25.12%
--

With the new labels, the aggregate flip chance seems to be about 25.12%. Let's check whether that holds true across the train chunks:

In [None]:
pd.options.plotting.backend = "plotly"

pd.DataFrame({
    'chunk': np.arange(len(new_y)) // 60000,
    'flips': new_y != old_y
}).groupby('chunk').flips.mean().plot.bar(y='flips', title='flip chance by chunk')

This is actually different from what I had expected -- I was assuming the flip would be the same across all chunks. But, clearly it's not. Maybe a chance was drawn from a distribution centered around .25 for each chunk, independently?

Anyway, if the flip chance was .25, we expect new label distribution to be equal to `.25 + .5 * y` where `y` is the old one, ie before flips. Let's check how that holds up:

In [None]:
distribution = pd.DataFrame({
    'chunk': np.arange(len(new_y)) // 60000,
    'old_y': old_y,
    'new_y': new_y,
}).groupby('chunk').mean().assign(
    expected_dist=lambda df: .25 + .5 * df.old_y 
)

distribution.plot.bar(barmode='group', title='mean(target) by new, old vs formula 25% flips')

Given pre-flip labels, we can do good estimates of `mean(target)` by chunk.

Estimating test chunks label distribution
--

This is really just putting together all the steps above:

1. Fit model on old train labels
2. Predict on test set to obtain good approximation to old test labels
3. Assuming 25% flip chance, it's just a simple calculation

In [None]:
clf = LogisticRegression().fit(X, old_y)
old_y_test = clf.predict(X_test)

pd.DataFrame({
    'chunk': np.arange(len(old_y_test)) // 60000 + 10,
    'old_y': old_y_test
}).groupby('chunk').old_y.mean().plot.bar(title='Test label distribution by chunk before flip')

But we can just put all of this together in the same DataFrame -- we'll populate one with all the old labels, then flip 25% of them at random, and that should end up fairly close to what this actually is!

In [None]:
old_labels = pd.DataFrame({
    'target': np.concatenate([old_y, old_y_test])
}).assign(
    chunk=lambda df: np.arange(len(df)) // 60000
)

flip = np.random.uniform(size=len(old_labels)) < .25
df = old_labels.assign(
    new_target=lambda df: df.target.where(~flip, 1 - df.target)
)

df.groupby('chunk').mean().plot.bar(title='Label distributions by chunk', barmode='group')

Of course, there's a good chance that we haven't got the right idea about how the samples to flip were selected, because our distribution of flip chance looks wrong, compared to what we plotted above:

In [None]:
df.assign(flips=df.new_target != df.target).groupby('chunk').flips.mean().plot.bar(title='flips by chunk')

Notably, chunk 2 has only about 24% labels flipped in the real train set, which is incredibly unlikely to happen with the way I've been doing these flips. 

Expected label distributions in the test set
==

We'll write this back out in case somebody wants to use it for anything. Here's the numbers for the train set -- these are known:

In [None]:
distribution

Here are the numbers for the test set, these are assumed:

In [None]:
projected = df.loc[df.chunk > 9].rename(columns={'target': 'old_y', 'new_target': 'new_y'})
projected = projected.groupby('chunk').mean().assign(
    expected_dist=lambda df: .25 + .5 * df.old_y
)
projected

We'll write out both of these in case anybody wants to have a crack at using them for something:

In [None]:
pd.concat([distribution, projected], axis=0).reset_index().to_csv('label_distributions.csv', index=False)