Chunk study
==

In [this discussion](https://www.kaggle.com/c/tabular-playground-series-nov-2021/discussion/286731) thread,
it's shown that the data is chunked -- there are 19 chunks, or sequences of samples have pretty sharp
boundaries to the other samples.

I've spent a few hours looking at it, and I can't find a way to exploit this information. But I
may have discovered something that somebody else could find a way to exploit, so why not make it public?

For this notebook, I'll be using a parquet dataset that I've made of the data, just to make it
faster to iterate. It contains both the train/test data in the same file, and they've both already
been scaled with `sklearn.preprocessing.StandardScaler`. Let's load the data and start looking at
it.

In [None]:
import numpy as np
import plotly.express as px
import pandas as pd
import seaborn as sns
from tqdm.notebook import tqdm
from sklearn import cluster, metrics, linear_model, model_selection, preprocessing
import xgboost

sns.set(style='darkgrid', context='notebook', rc={
    'figure.frameon': False, 'figure.figsize': (16, 12), 'legend.frameon': False
})

pd.options.plotting.backend = 'plotly'

df = pd.read_parquet('../input/tps-nov-2021-parquet/data.pq').assign(
    chunk=lambda df: df.id // 60000
).drop(columns=['id']).sample(frac=1) # shuffle

train = df.loc[df.target.notna()]
test = df.loc[df.target.isna()]

df.info()

Significantly different target distribution
==

This is mentioned in the discussion post already, let's quickly look:

In [None]:
train.groupby('chunk').target.mean().plot.bar(title='mean(y) by chunk')

I think that if we gain anything from this finding at all, it's going to come from this
fact. I still believe, from [this post](https://www.kaggle.com/c/tabular-playground-series-nov-2021/discussion/285503) that
around 25% of the data must have had its label flipped. If you start out with `x = mean(target)`, then flip 25% of the
labels, you'd expect to end up with around `y = .25 + .5 * x`.

If the labels were flipped, but the features weren't, we should see some pretty massive differences between feature
distribution and label distribution in these chunks. If we're right about that, the original target distribution
would've been more like this:

In [None]:
y = train.groupby('chunk').target.mean()
(2 * y - .5).plot.bar(title='mean(target) by chunk before label-flips')

Let's compare:

In [None]:
pd.DataFrame({
    'mean(target)_after': y,
    'mean(target)_before': (2 * y - .5),
}).plot.bar(barmode='group')

I *believe* that there could be a way to use this to identify flipped labels in the training
data set -- and also possibly to say something about the target distribution in each test chunk.

I've tried a number of approaches for this, but haven't really had any success.

Significant differences in features
==

The features differ significantly between chunks. This was already known in post that started the discussion thread linked to above, but here's one more way to easily see that.

Let's grab two chunks, and make a classifier that can predict which chunk that a sample is from:

In [None]:
feats = df.columns[df.columns.str.startswith('f')]

def train_booster(left, right):
    X, y = (
        df.loc[df.chunk.isin({left, right}), feats],
        df.loc[df.chunk.isin({left, right}), 'chunk'] == left
    )

    X_train, X_val, y_train, y_val = model_selection.train_test_split(X, y, test_size=.25, shuffle=True)

    booster = xgboost.train(
        params=dict(eta=.1, max_depth=4, objective='binary:logistic', eval_metric=['auc']),
        dtrain=xgboost.DMatrix(X_train, label=y_train),
        evals=[(xgboost.DMatrix(X_val, label=y_val), 'val')],
        num_boost_round=200,
        verbose_eval=100,
    )

    y_proba = booster.predict(xgboost.DMatrix(X_val))
    auc = metrics.roc_auc_score(y_val, y_proba)
    acc = metrics.accuracy_score(y_val, y_proba > .5)
    print(f'Separate chunk {left} from chunk {right} with auc = {auc:.5f}, acc = {acc:.5}')
    return booster, auc, acc

booster, auc, acc = train_booster(0, 1)

The booster is pretty accurate here, although not perfect. This seems to work quite well
on all chunk boundaries, and also between chunks that are far away. Let's repeat:

In [None]:
booster_2, auc, acc = train_booster(3, 9)


Also pretty solid. I've actually evaluated all chunk combinations and found that it's really
easy to get around 95% accuracy for this kind of test, and we can get up to 98% or 99% in many
cases, by tuning the classifier more.

These two boosters don't have the same idea of what the important features are:

In [None]:
fscore = booster.get_fscore()
fscore = pd.DataFrame({'feature': fscore.keys(), 'value': fscore.values()})
fscore_2 = booster_2.get_fscore()
fscore_2 = pd.DataFrame({'feature': fscore_2.keys(), 'value': fscore_2.values()})
fscore = pd.concat([fscore.assign(booster='first'), fscore_2.assign(booster='second')])
fscore.plot.bar(x='feature', y='value', facet_row='booster')

Linear models can't easily separate chunks
==

We just saw XGBoost easily cruise through separating chunks with a binary label.

How about LogisticRegression?

In [None]:
left, right = 0, 1

clf = linear_model.LogisticRegression()
X, y = (
    df.loc[df.chunk.isin({left, right}), feats],
    df.loc[df.chunk.isin({left, right}), 'chunk'] == left
)

model_selection.cross_val_score(
    clf, X, y, cv=model_selection.StratifiedKFold(5, shuffle=True)
)

It's not terrible, but clearly this is much easier for XGBoost. That's the opposite of what
I've seen on the "real problem", where I've had a lot of success with linear models, and really
struggled to make the boosters work.

Fit on one chunk and predict another
==

This one was pretty interesting to me. `LogisticRegression` does not care that much, which chunk it's
fitted on:

In [None]:
left, right = 0, 1

def X(chunk):
    return train.loc[train.chunk == chunk, feats]

def y(chunk):
    return train.loc[train.chunk == chunk, 'target']

def Xy(chunk):
    return X(chunk), y(chunk)

clf = linear_model.LogisticRegression()
clf.fit(*Xy(0)).score(*Xy(1)), clf.fit(*Xy(1)).score(*Xy(0))

This seems to bed doing around 71% accuracy regardless. It does slightly better if it gets
to see some data from both chunks:


In [None]:
model_selection.cross_val_score(
    clf, pd.concat([X(0), X(1)]), pd.concat([y(0), y(1)]),
    cv=model_selection.StratifiedKFold(2, shuffle=True)
)

But it still does really well when predicting an unseen chunk.

Here's one more example:

In [None]:
clf.fit(
    pd.concat([X(0), X(1)]),
    pd.concat([y(0), y(1)]),
)
clf.score(X(2), y(2))

In [None]:
model_selection.cross_val_score(
    clf, pd.concat([X(0), X(1), X(2)]), pd.concat([y(0), y(1), y(2)]),
    cv=model_selection.StratifiedKFold(3, shuffle=True)
)

But we can use the features to separate the chunks from one another? Does that mean that
we're using... noise to do that?

Is this why I'm not getting that much out of XGBoost in this competition? Maybe it's fitting
the noise that separates the chunks, to learn `mean(y)` for the different chunks? Because it
seems to be much more able to fit that noise, than the other models I've tried here.

Analyzing label flips in chunks
==

Recall the distribution of labels in chunk 0 earlier:

In [None]:
train.loc[train.chunk == 0, 'target'].value_counts(normalize=True)

I believe that if there was a label flip, the features should suggest there to be around
35% positive samples, not 42.5%.

In [None]:
clf = linear_model.LogisticRegression()
clf.fit(X(0), y(0))
pred = clf.predict(X(0))
np.mean(pred)

Let's check the next chunk:

In [None]:
train.loc[train.chunk == 1, 'target'].value_counts(normalize=True)

Side note: This is basically the inverse chunk, I wonder if that's a coincidence?

Anyway, I think these features should suggest around 66% positive samples:

In [None]:
clf = linear_model.LogisticRegression()
clf.fit(X(1), y(1))
pred = clf.predict(X(1))
np.mean(pred)

To me, this reinforces the notion that 25% of the labels would've been flipped. Because I
think that this LogisticRegression actually manages to fit almost all the data in the chunk.

Actually, let's do this exercise for each chunk and see how that lines up with our expectations
that the features distribute like `2 * mean(target) - .5`

In [None]:
y_mean = [
    clf.fit(X(i), y(i)).predict(X(i)).mean()
    for i in tqdm(range(train.chunk.nunique()))
]

y = train.groupby('chunk').target.mean()

pd.DataFrame({
    'mean(target)_after': y,
    'mean(target)_before': (2 * y - .5),
    'mean(logistic_regression)': y_mean
}).plot.bar(barmode='group')

Is this some huge coincidence, or does this line up with what we'd expect to see if 25%
of the labels had just been flipped?

I *have* tried to flip the labels back, using a 74.83% out of fold accurate classifier,
and the label distribution by chunk there looks almost like this.

Unfortunately, I can still see no way to leverage any of this information to find out
which labels in the test set that are affected.
