TPS202112 - Do you need all that data?
==

In this notebook, I will show that about 80% of the data in the training set is very uniform, and could probably be represented by *much* fewer samples. I will use a simple 3 step procedure to do this:

1. Train a model and record the out of fold probability predictions.
2. Flag the most confident Cover_Type=1 and Cover_Type=2 predictions as 'easy'
3. Show that, given 10% of the easy samples, we can achieve accuracy of over 99% on the other 90% of the easy samples.

I'll use lightgbm to do this right now, because it's convenient, and outputs probabilities and it's relatively fast to train something that isn't horrible.

In [None]:
import os
import random
import numpy as np
import pandas as pd
import lightgbm as lgbm
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_recall_curve
from tqdm.notebook import tqdm

sns.set(style='darkgrid', context='notebook', rc={
    'figure.figsize': (16, 12),
    'figure.frameon': False,
    'legend.frameon': False,
})

random.seed(64)
np.random.seed(64)
cv = StratifiedKFold(shuffle=True, random_state=64)

data_root = os.environ.get('KAGGLE_DIR', '../input')
df = pd.read_parquet(f'{data_root}/tpsdec2021parquet/train.pq')
df = df.loc[df.Cover_Type != 5]
label_encoder = LabelEncoder()
X = df.drop(columns=['Id', 'Cover_Type']).astype(np.float32).to_numpy()
y = label_encoder.fit_transform(df.Cover_Type)

X.shape, y.shape

Going to get hold of some out of fold probabilities here. It's not important that these are fantastic, so we won't be tuning anything for accuracy here, we'll do mostly defaults:

In [None]:
params = {
    'objective': 'multiclass',
    'metric': ['multi_error', 'multi_logloss'],
    'first_metric_only': True,
    'seed': 64,
    'num_class': df.Cover_Type.nunique(),
    'verbosity': -1,
}

oof_proba = np.zeros((y.shape[0], df.Cover_Type.nunique()))

for train_idx, val_idx in tqdm(cv.split(X, y), total=cv.n_splits):
    booster = lgbm.train(
        params,
        train_set=lgbm.Dataset(X[train_idx], label=y[train_idx]),
        valid_sets=lgbm.Dataset(X[val_idx], label=y[val_idx]),
        verbose_eval=20,
        num_boost_round=100,
        early_stopping_rounds=5,    
    )
    oof_proba[val_idx] = booster.predict(X[val_idx])

Now we have some probabilities. We're mostly interested in the first two classes, since they're so overrepresented. Let's plot their probability distributions:

In [None]:
probas = pd.DataFrame(
    oof_proba[:, 0:2], columns=['Cover_Type=1', 'Cover_Type=2']
).melt(var_name='class_', value_name='probability')

sns.displot(
    x=probas.probability, col=probas.class_, bins=50, kde=False
);

Note how many samples that have very high probability, especially in `Cover_Type=2`. Let's grab precision recall curves, so we can find the threshold for the classes where precision is 99.5%:

In [None]:
prec_0, _, thres_0 = precision_recall_curve(y == 0, oof_proba[:, 0])
prec_1, _, thres_1 = precision_recall_curve(y == 1, oof_proba[:, 1])

t_0 = thres_0[prec_0[:-1] > .995][0]
t_1 = thres_1[prec_1[:-1] > .995][0]

print(f'Choose threshold={t_0:.4f} for Cover_Type=1, threshold={t_1:.4f} for Cover_Type=2')

In [None]:
is_easy = (oof_proba[:, 0] > t_0) | (oof_proba[:, 1] > t_1)
is_easy.mean()

This simple classifier considers about 80% of the data to be "easy". Let's check how easy it is:

In [None]:
easy_X, easy_y = X[is_easy], y[is_easy]

X_train, X_val, y_train, y_val = train_test_split(easy_X, easy_y, shuffle=True, random_state=64, test_size=.9)

booster = lgbm.train(
    params,
    train_set=lgbm.Dataset(X_train, label=y_train),
    valid_sets=lgbm.Dataset(X_val, label=y_val),
    verbose_eval=20,
    num_boost_round=100,
    early_stopping_rounds=5,    
)

Less than 1% validation error, from training on only 10% of the data.

My conclusion is that we don't need so *many* of the easy samples. I ended up throwing out 90% of the easy samples, using a similar method to this, which reduced the size of the data from ~4 million rows to ~1 million rows. This has not harmed the CV or LB score of my models as far as I can tell, and I'm obviously experimenting much faster now.

Just to demonstrate how much harder the hard samples are, here's a similar experiment there, but training on 90% of the data:

In [None]:
hard_X, hard_y = X[~is_easy], y[~is_easy]

X_train, X_val, y_train, y_val = train_test_split(hard_X, hard_y, shuffle=True, random_state=64, test_size=.1)

booster = lgbm.train(
    params,
    train_set=lgbm.Dataset(X_train, label=y_train),
    valid_sets=lgbm.Dataset(X_val, label=y_val),
    verbose_eval=20,
    num_boost_round=100,
    early_stopping_rounds=5,    
)

This booster does not have incredible performance on the easy data:

In [None]:
np.mean(booster.predict(easy_X).argmax(axis=1) == easy_y)

But then, that would be easy to fix by including some, but not all of the easy data in our training set.

In case you'd like to try playing around with this, I'm writing out the data again, together with the `is_easy` flag, so it's easy to experiment with different sampling strategies based on this:

In [None]:
df.assign(
    is_easy=is_easy
).to_parquet('train.pq', index=False)

To load it, add this notebook as a data source to your notebook and run this code:

```python
import pandas as pd

df = pd.read_parquet('../input/tps202112-do-you-need-all-that-data/train.pq')
```

You should have the `is_easy` column there, which is going to be mostly `Cover_Type=1` and `Cover_Type=2`:

In [None]:
sns.catplot(
    data=df.assign(is_easy=is_easy).astype({'Cover_Type': 'string'}),
    x='Cover_Type', kind='count', col='is_easy', sharey=False
);