What the flip is happening?
==

The distribution of flips across chunks is different than it should be. The samples were
certainly not given a uniform 25% chance to flip.

In [None]:
%pip install -U -qq scikit-learn

import os

import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold, cross_val_predict
from sklearn.metrics import classification_report
from sklearn.neural_network import MLPClassifier
from lightgbm.sklearn import LGBMClassifier

sns.set(
    style='darkgrid', context='notebook', rc={
        'figure.frameon': False,
        'legend.frameon': False,
        'figure.figsize': (12, 8)
    }
)

old_y = pd.read_csv('../input/november21/train.csv', usecols=['target']).target.astype(np.float32)
df = pd.read_parquet(
    '../input/tps-nov-2021-parquet/data.pq'
).dropna().rename(
    columns={'target': 'after_flip'}
).assign(
    before_flip=old_y,
    chunk=lambda df: df.id // 60000,
    flipped=lambda df: df.after_flip != df.before_flip,
    flips=lambda df: df.flipped.cumsum()
)
folds = StratifiedKFold(5, shuffle=True, random_state=64)

A nice, straight line
==

At a distance, everything looks OK. Let's plot the cumulative sum of `old_target != new_target`.

This should be a fairly straight line. The slope would be close to the flip chance.

In [None]:
df.flips.plot.line(title='cumsum(flipped)');

My brain is telling me that this line is exactly straight. The plot is also helping me,
at id 400 000, it certainly looks like we've got very close to 100 000 flips and we could
easily have 125 000 flips at id 500 000.

You wouldn't believe these chunks
--

In [this](https://www.kaggle.com/c/tabular-playground-series-nov-2021/discussion/286731) post,
it's show that the data is chunked at 60 000 element borders. And I've noticed that the distribution of flip count
per chunk looks like it has a lot of variability. Going to draw this again:

In [None]:
df.groupby('chunk').flipped.sum().plot.bar(title='flip count by chunk');

The variance here is much higher than it should be for so many random draws, if they were "fair".

Zooming in on the nice straight line
--

And in fact, if we just look a bit closer, it looks even worse.

Let's plot that nice straight line from earlier again, for only the first few hundred IDs, together
with a couple of slopes of known inclination:

In [None]:
ixs = 300
df.loc[df.id < ixs, 'flips'].plot.line(label='cumsum(flipped)')
plt.plot(np.arange(ixs), .26 * np.arange(ixs), label='.26 slope')
plt.plot(np.arange(ixs), .24 * np.arange(ixs), label='.24 slope')
plt.legend();

No! Bad line! You should be between these two slopes! Go back!

Anyhow, it's random chance, right? This might happen eventually. But it seems odd to me.

The total number of flips is around 25.12% of the samples total, and we can verify that
matches the last `flips` value:

In [None]:
df.flipped.mean(), df.flips.iloc[-1] / len(df)

Seems pretty close to me!

But we were only looking at a few hundred samples, let's zoom out a bit more and look at
a thousand, surely it would converge to a .25 slope by then:

In [None]:
ixs = 1000
df.loc[df.id < ixs, 'flips'].plot.line(label='cumsum(flipped)')
plt.plot(np.arange(ixs), .26 * np.arange(ixs), label='.26 slope')
plt.plot(np.arange(ixs), .24 * np.arange(ixs), label='.24 slope')
plt.legend();

Right? Please? How about 25000?

In [None]:
ixs = 25000
df.loc[df.id < ixs, 'flips'].plot.line(label='cumsum(flipped)')
plt.plot(np.arange(ixs), .26 * np.arange(ixs), label='.26 slope')
plt.plot(np.arange(ixs), .24 * np.arange(ixs), label='.24 slope')
plt.legend();

Wait, it's going the _other way_ now? This thing isn't a straight line, it's a bunch of bananas
stuck together with tape and chewing gum.

What might make the overall behaviour easier to notice here, is to subtract the .25 slope
from this particular line and plot it again:

In [None]:
(df.flips - .2512 * df.id).plot.line();

A uniform 25.12% chance of a flip would cause an almost straight line that was oscillating
around 0 here, but this is clearly something else. Let's zoom in on the boundary between the second
and third chunks:

In [None]:
(df.flips - .2512 * df.id).iloc[100000:140000].plot.line();

And the boundary around id = 300 000 looks interesting too:

In [None]:
(df.flips - .2512 * df.id).iloc[280000:320000].plot.line();

I'm not getting any wiser, but there's some pattern here that we wouldn't expect to see
if the labels were being flipped with independent chance.

Identifying label-flipping streaks
==

There are some very long streaks here. In fact, let's check how long the streaks are.

In [None]:
last_flipped = df.flipped.shift()
# There was a change whenever the current `flipped` differs from the last
state_change = df.flipped != last_flipped
# Each sample that has the same `state_change.cumsum()` belongs to the same streak
sample_by_streak_number = state_change.cumsum()

So, what we just did, was to assign a streak number to every sample. The streak number is
incremented every time that a sample has a different flipped-state to the previous sample.

That means we can find out how many samples that were in a streak by just counting how many
samples that share streak numbers. Let's do that:

In [None]:
streak_lengths = sample_by_streak_number.value_counts()

sns.countplot(x=streak_lengths).set(
    title='Streak length volume',
    ylabel='Count of streaks', xlabel='Streak length'
);

Of course, we're looking at an awful lot of random draws here. But my gut feeling tells me we
would still not expect to see 42 samples in a row with no flipped labels unless there
was some non-random component to how the flipped labels were chosen. The long streaks aren't
incredibly uncommon.

Do the indexes of the state changes look familiar?

In [None]:
pd.set_option('max_rows', 20)
df.loc[state_change, 'id'].iloc[:20]

I've tried some variations of these numbers in integer sequence search at [oeis](https://oeis.org/),
but obviously that wasn't it.

Some more stats about streak lengths:

In [None]:
len(streak_lengths[streak_lengths > 30]), len(streak_lengths[streak_lengths > 20])

There are 22 streaks longer than 30, 375 streaks longer than 20. If we were using some
fair dice to decide whether to flip or not, I think the chance of getting a 42-long streak
(let alone 2) would be equivalent to `.75 ** 42`, which doesn't seem very likely.

How long are typically positive streaks (eg. streaks of flipped labels)?

In [None]:
pos_streak_lengths = sample_by_streak_number[
    df.flipped
].value_counts()

sns.countplot(x=pos_streak_lengths).set(title='Length of repeated flips streaks');

We actually have up to 12 in a row. `.25 ** 12` also seems like it shouldn't happen very often.

Can we predict the label flips?
==

Nothing I've tried so far indicates that we can do that based on features or the old _or_ new
labels.

Encoding `id` with 16 bits
--

I've tried with some chunk-based stats, but there's one thing I haven't tried yet, which
is to directly make use of `id`, so let's do that. We'll simply binary-encode the last 16 bits of
it and see if that can do something:

In [None]:
n_jobs = min(5, os.cpu_count())
always_no_accuracy = (1 - df.flipped.mean())
lreg = LogisticRegression(class_weight='balanced')
tree = DecisionTreeClassifier(class_weight='balanced', max_depth=12)

bit_pattern = pd.DataFrame({
    f'id_{i}': (df.id & (2 ** i)) > 0 for i in range(16)
})

cross_val_score(
    lreg, bit_pattern, df.flipped, cv=folds, n_jobs=n_jobs
)

50% accuracy with a balanced class weight, I think that means we're doing random guessing.

Let's investigate a bit more:

In [None]:
y_pred = cross_val_predict(lreg, bit_pattern, df.flipped, cv=folds, n_jobs=n_jobs)

print(classification_report(df.flipped, y_pred))

Right, we simply have 50% recall and class frequency precision. I bet we'd get the same
result using a tree?

In [None]:
cross_val_score(
    tree, bit_pattern, df.flipped, cv=folds, n_jobs=n_jobs
)

Seems about the same. Can we have higher precision for any level of recall at all?

In [None]:
from sklearn.metrics import PrecisionRecallDisplay

y_pred_proba = cross_val_predict(
    lreg, bit_pattern, df.flipped, cv=folds, n_jobs=n_jobs, method='predict_proba'
)


PrecisionRecallDisplay.from_predictions(df.flipped, y_pred_proba[:, 1]);

That doesn't seem so promising. Does the tree get anything right?

In [None]:
y_pred_proba = cross_val_predict(
    tree, bit_pattern, df.flipped, cv=folds, n_jobs=n_jobs, method='predict_proba'
)

PrecisionRecallDisplay.from_predictions(df.flipped, y_pred_proba[:, 1]);

So, the bit pattern alone probably can't help us, it's probably just random that there exists some
threshold where we have really high precision.

Encoding `id` with 18 bits
--

Let's use a wider bit pattern and try again:

In [None]:
bit_pattern = pd.DataFrame({
    f'id_{i}': (df.id & (2 ** i)) > 0 for i in range(18)
})

cross_val_score(lreg, bit_pattern, df.flipped, cv=folds, n_jobs=n_jobs)

In [None]:
cross_val_score(
    tree, bit_pattern, df.flipped, cv=folds, n_jobs=n_jobs
)

Use label and label balance as features
--

This is the same result. Let's add in two more features -- the pre-flip label, and the cumulative
sum of the pre-flip label minus the expected number of flips.

In [None]:
cum_class_balance = old_y.cumsum() - np.arange(len(old_y)) * old_y.mean()
cum_class_balance = (cum_class_balance - cum_class_balance.min()) / (cum_class_balance.max() - cum_class_balance.min())

plt.plot(cum_class_balance)
plt.title('Cumulative class balance')
plt.xlabel('id');

In [None]:
X = pd.concat([
    bit_pattern, old_y,
    cum_class_balance,
], axis=1)

cross_val_score(
    lreg, X, df.flipped, cv=folds, n_jobs=n_jobs
)

That doesn't seem to help much. Same result with the tree?

In [None]:
cross_val_score(
    tree, X, df.flipped, cv=folds, n_jobs=n_jobs
)

That seems to have harmed the tree somehow. In what way?

In [None]:
y_pred_proba = cross_val_predict(
    tree, X, df.flipped, cv=folds, n_jobs=n_jobs, method='predict_proba'
)

PrecisionRecallDisplay.from_predictions(df.flipped, y_pred_proba[:, 1])

print(classification_report(df.flipped, y_pred_proba.argmax(axis=1)))

This result doesn't seem useful, yet.

Use features as features
--

How about just adding all of the features?

In [None]:
feats = [f'f{i}' for i in range(100)]
X = pd.concat([X, df[feats]], axis=1)

cross_val_score(
    lreg, X, df.flipped, cv=folds, n_jobs=n_jobs
)

In [None]:
cross_val_score(
    tree, X, df.flipped, cv=folds, n_jobs=n_jobs
)

Also doesn't help. Let's check a booster as well, just to have that out of our system:

In [None]:
booster = LGBMClassifier(
    learning_rate=.05, n_estimators=1000, n_jobs=n_jobs, is_unbalance=True
)

y_pred_proba = cross_val_predict(
    booster, X.to_numpy(), df.flipped, cv=folds, n_jobs=1, method='predict_proba'
)
print(classification_report(df.flipped, y_pred_proba.argmax(axis=1)))

PrecisionRecallDisplay.from_predictions(df.flipped, y_pred_proba[:, 1]);

Right, so this could find 39% of the flipped labels, but only 26% of the suggestions would be right.

That's also not very useful.

Looking at flip count by id
==

One more time, let's look at flip count by id, together with the label balance for both
new and old labels in case some insight jumps out:

In [None]:
fig, (ax0, ax1, ax2) = plt.subplots(3)

(df.flips - .2512 * df.id).plot.line(ax=ax0)
ax0.set_title('Actual flips - expected flips')
ax0.set_xlabel('id')

(old_y.cumsum() - old_y.mean() * df.id).plot.line(ax=ax1)
ax1.set_title('Old labels: positive - expected positive')
ax1.set_xlabel('id')

(df.after_flip.cumsum() - df.after_flip.mean() * df.id).plot.line(ax=ax2)
ax2.set_title('New labels: positive - expected positive')
ax2.set_xlabel('id')

plt.subplots_adjust(hspace=.75);

Nope, still not seeing anything.