## How did I get this data?

By some random chance, I downloaded `train.csv.zip` in the first hour when competition just started. For several days I didn't check notebooks and forum not to spoil a pleasure of diving into the data.

This dataset was a bizzare one from the start. I was totally buffled and finally went to forum and checked notebooks of others to understand what's wrong. My code worked differently local and on Kaggle.

After downloading data again and comparing it with my local version I realized that my `train.csv` was the original version without inversed labels.

Most likely organizers made a mistake and uploaded `train.csv.zip` different from the one in the batch data file and didn't noticed the error at the beginning(the mistake was fixed later).

Without further ado, let's dive in.

## Loading Data

Let's load some basic libraries and both train sets. I uploaded original/pure version of train.csv into a dataset on Kaggle:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split

In [None]:
DATA_PATH = '../input/tabular-playground-series-nov-2021'
PURE_DATA_PATH = '../input/november21'

In [None]:
test_dtype = {f'f{i}': 'float32' for i in range(100)}
train_dtype = {**test_dtype, 'target': 'int8'}

In [None]:
train_csv = pd.read_csv(f'{DATA_PATH}/train.csv', index_col='id', dtype=train_dtype)

In [None]:
pure_csv = pd.read_csv(f'{PURE_DATA_PATH}/train.csv', index_col='id', dtype=train_dtype)

## Comparing Datasets

On the first glance both files look pretty similar:

In [None]:
train_csv.head()

In [None]:
pure_csv.head()

Let's confirm that they differ:

In [None]:
train_csv.equals(pure_csv)

Let's verify that features are same:

In [None]:
train_csv.drop('target', axis=1).equals(pure_csv.drop('target', axis=1))

and labels differ:

In [None]:
y_train = train_csv['target']
y_pure = pure_csv['target']

y_train.equals(y_pure)

And the difference is about `25%`:

In [None]:
(y_train != y_pure).sum() / y_train.shape[0]

So, all in all, looks like Kagglers already knew the truth.

## Baseline Models

Let's rescale our data to fit models on pure and train labels. We don't need to have `X_pure`, because features are same:

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(train_csv.drop('target', axis=1))

Now the first surprise to me was this.

In [None]:
pure_model = LogisticRegression(random_state=83).fit(X_train, y_pure)

y_pure_pred = pure_model.predict_proba(X_train)[:, 1]
roc_auc_score(y_pure, y_pure_pred)

While we didn't use train/test split(we will get to it), there are `600_000` points with `100` features and a dumb `LogisticRegression` was able to successfully split this huge blob of data with `101` params. Really, not bad result!

Ok, let's see what would we get with the official version:

In [None]:
train_model = LogisticRegression(random_state=83).fit(X_train, y_train)

y_train_pred = train_model.predict_proba(X_train)[:, 1]
roc_auc_score(y_train, y_train_pred)

Pretty close to a common score on the Leaderboard.

Let's check what would be a score of pure-labels model on mixed labels:

In [None]:
roc_auc_score(y_train, pure_model.predict_proba(X_train)[:, 1])

Not a big difference. Most likely due to mixed labels, `train_model` is less confident in its predictions as `pure_model`. Let's verify.

In [None]:
plt.hist(y_pure_pred, bins=100);

In [None]:
plt.hist(y_train_pred, bins=100);

As we see, pure-data model is very confident, while mixed-data model is not at all.

## Train-Test Split

To be impeccable, let's do a simple train-test split to remove any chance for mistake. We will do `80/20` train/validation split. To make everything simpler, we will have a helper function to make all data lifting, so we can focus on the fun part.

In [None]:
def split_train_and_validate(name, X, y, test_size):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=test_size, random_state=83)
    
    model = LogisticRegression(random_state=83).fit(X_train, y_train)
    
    train_score = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
    valid_score = roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1])

    print(f'{name} Train size: {len(y_train)} - {((len(y_train) / len(y) * 100)):.2f}%')
    print(f'{name} Valid size: {len(y_valid)} - {((len(y_valid) / len(y) * 100)):.2f}%\n')
    
    print(f'{name} Train score: {train_score}')
    print(f'{name} Valid score: {valid_score}')
    
    return train_score, valid_score

Ok, and now run it with pure labels:

In [None]:
split_train_and_validate('Pure', X_train, y_pure, test_size=0.2);

That's a bit buffling. We removed `20%` of the data but the score is still perfect.

Let's remove half of the data, and verify again:

In [None]:
split_train_and_validate('Pure', X_train, y_pure, test_size=0.5);

Ok, still too good to be true. Maybe `20/80` will make it fail?

In [None]:
split_train_and_validate('Pure', X_train, y_pure, test_size=0.8);

Hmm, what about `1/99`?

In [None]:
split_train_and_validate('Pure', X_train, y_pure, test_size=0.99);

Finally on `600` training points we see some significant drop on the validation set:

In [None]:
split_train_and_validate('Pure', X_train, y_pure, test_size=0.999);

But even with `300` points the score is still very good:

In [None]:
split_train_and_validate('Pure', X_train, y_pure, test_size=0.9995);

### Mixed labels validation

To wrap up our experiment, let's check briefly mixed labels.

In [None]:
split_train_and_validate('Mixed', X_train, y_train, test_size=0.2); print("\n")
split_train_and_validate('Mixed', X_train, y_train, test_size=0.99); print("\n")
split_train_and_validate('Mixed', X_train, y_train, test_size=0.999); print("\n")
split_train_and_validate('Mixed', X_train, y_train, test_size=0.9995);

Mixed labels models are more sensitive to amount of data due to random mutations and fall short after `600` points gap.

## ROC Curves

To wrap our EDA part, let's check ROC curves for both labels.

At first pure-label model output:

In [None]:
fpr, tpr, thresholds = roc_curve(y_pure, y_pure_pred)
plt.plot(fpr, tpr);

and mixed-label model:

In [None]:
fpr, tpr, thresholds = roc_curve(y_train, y_train_pred)
plt.plot(fpr, tpr);

Actually, the curve for mixed-label model looks a bit odd, usually it does not have such a linear shape. Next time, I'll check ROC curve shape, before trying to improve my model score.

## Conclusions or why Game Over?

Apparently, this dataset was generated in too good to be any close to real dataset thing. As a consequence, all tries to improve the score with neural networks, trees, ensembles and other more complex approaches do not make any sense.

The distribution is too simple and more complex models would not give any significant boost over basic approaches. It looks like a race to overfit better by chance.

Pure training set is quite odd, because it does not leave you any room for improvement on cross-validation. How can be improved something, when you have `99.99%` score with 1000 points?

Mixed training set is the same, it just makes you believe you can do better.

## What if Remastered over Synthetic?

TPS Kaggle series is a really cool thing to tinker with tabular data. But using GANs and other data generation techniques leaves us some unrealistic taste.

But Kaggle already has dozens of cool tabular datasets accrued through the long history. What if to take these old competitions data and start a new challenge? Yes, you can go and play with it by yourself. But it feels a bit lonely without live comments and new kernels.

Also, a lot of old competitions were run when we didn't have NNs, XGB and lightGMB. What if we can do better than the old Grand Masters?

There are some tiny details to address - like prohibiting submission in the old competitions for the time of remastered one, but this is a doable thing.

## PS - Let's try pure data model?

As a final step let's make a most stupid model submission and cross our fingers. What if this is the lucky winner?

In [None]:
test_csv = pd.read_csv(f'{DATA_PATH}/test.csv', index_col='id')
sample_submission_csv = pd.read_csv(f'{DATA_PATH}/sample_submission.csv', index_col='id')

In [None]:
X_test = scaler.transform(test_csv)

y_test_pred = pure_model.predict_proba(X_test)[:, 1]

In [None]:
sample_submission_csv['target'] = y_test_pred
sample_submission_csv.head()

In [None]:
sample_submission_csv.to_csv(f'./submission.csv')