TPS202112 - Optuna torch
==

In this notebook, I rely on the downsampled training data from [TPS202112 - Downsample easy cases](https://www.kaggle.com/kaaveland/tps202112-downsample-easy-cases) to be able to run hyper-parameter search over a simple NN architecture with reasonable speed.

The input data set here, has been cut down to around 1 million samples, from 4 million -- which naturally makes it faster to train.

First, let's upgrade torch so we can use the RAdam optimizer, and get some imports out of the way:

In [None]:
import os
import getpass
import random
import tempfile

if getpass.getuser() == 'root': # kaggle
    %pip install -qU scikit-learn torch
    n_jobs = os.cpu_count()
    !cp -v ../input/optunatorchsearches/optuna.db .
    storage = 'sqlite:///optuna.db'
else:
    n_jobs = os.cpu_count() // 2 # hyper threading
    storage = 'postgresql://localhost/optuna?host=/var/run/postgresql'

import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import RobustScaler
import optuna

import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.optim import lr_scheduler
from torch.backends import cudnn
from tqdm.notebook import tqdm, trange

Next up, we'll need to set the random seeds for reproducability. 

I've already decided on a batch size to use for all my trials, so I'll also set `torch.backends.cuddn.benchmark = True`, for a speedboost. Note that if you're varying batch sizes, this might make your code slower.

We'll also set up an optuna study, and our CV splits. I'm using the exact same CV setup for all my models, so I can more easily compare results.

In [None]:
random.seed(64)
np.random.seed(64)
torch.manual_seed(64)
cudnn.benchmark = True

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=64)
study = optuna.create_study(
    storage=storage, study_name='tps202112-mlp', load_if_exists=True, direction='maximize',
)

I've made a small utility here, that I use to keep track of model weights. At the end of training, I call the `restore()` method on this object to restore the model weights to the ones that had the best validation accuracy. There are probably tons of ways of doing this, but this is simple and explicit:

In [None]:
class TrackBestWeights:

    def __init__(self, model):
        self.weights = tempfile.mktemp()
        self.best_accuracy = None
        self.model = model

    def step(self, validation_accuracy):
        if self.best_accuracy is None or validation_accuracy > self.best_accuracy:
            self.best_accuracy = validation_accuracy
            torch.save(self.model.state_dict(), self.weights)

    def restore(self):
        self.model.load_state_dict(torch.load(self.weights))

Next, we're reading in data. We're using the downsampled training data set, which contains 10% of the easy cases, and all the hard cases. See [TPS202112 - Downsample easy cases](https://www.kaggle.com/kaaveland/tps202112-downsample-easy-cases) for more detail about how that works. 

We're fitting a `RobustScaler` to _both_ train/test data here. Normally you wouldn't do this, but for kaggle TPS, I think it's fine. I tried a few other scalers, but couldn't get as good results with those.

We're also converting the whole dataset to torch tensors right away, and we're using `LabelEncoder` to ensure that our labels range from `[0, n_classes)`, which is what torch wants to have:

In [None]:
data_root = os.environ.get('KAGGLE_DIR', '../input')
sampled_cases = pd.read_parquet(f'{data_root}/tps202112-downsample-easy-cases/train.pq', columns=['Id'])
df = pd.read_parquet(f'{data_root}/tpsdec2021parquet/train_fe.pq').loc[sampled_cases.Id.to_numpy()].assign(
    Id=sampled_cases.Id.to_numpy()
)

label_encoder = LabelEncoder()
X, y = df.drop(columns=['Id', 'Cover_Type']), label_encoder.fit_transform(df.Cover_Type)
scaler = RobustScaler()
df_test = pd.read_parquet(f'{data_root}/tpsdec2021parquet/test_fe.pq')
X_test = df_test
scaler.fit(pd.concat([X, X_test]))

X_test = torch.from_numpy(scaler.transform(X_test).astype(np.float32))
X = torch.from_numpy(scaler.transform(X).astype(np.float32))
y = torch.from_numpy(y)

This notebook is going to run very slowly without a GPU. Let's check which one we've got here, if any:

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

if torch.cuda.is_available():
    !nvidia-smi
else:
    print('Running on cpu')

Here are some known good configurations I've already found on my laptop, so I'll enqueue these for optuna trials, to guide the early search:

In [None]:
known_good = [
    {
    'optim_cls': 'radam', 'weight_decay': 3.04703905856014e-05,
    'activate': 'selu', 'patience': 10, 'lr': 0.033486004840786954
},
    {'optim_cls': 'radam',
  'weight_decay': 1.4526958400830988e-05,
  'activate': 'lrelu',
  'patience': 15,
  'lr': 0.029178728238702562},
 {'optim_cls': 'radam',
  'weight_decay': 3.237655170070728e-05,
  'activate': 'relu',
  'patience': 5,
  'lr': 0.030077591918879664},
 {'optim_cls': 'adam',
  'weight_decay': 1.8181989055907467e-05,
  'activate': 'silu',
  'patience': 15,
  'lr': 0.04638419754304867},
 {'optim_cls': 'adam',
  'weight_decay': 7.148732191726605e-06,
  'activate': 'lrelu',
  'patience': 15,
  'lr': 0.02068886968901973},
 {'optim_cls': 'radam',
  'weight_decay': 8.594159652310458e-05,
  'activate': 'silu',
  'patience': 15,
  'lr': 0.037170759932989746},
 {'optim_cls': 'adamw',
  'weight_decay': 8.902236013675666e-06,
  'activate': 'lrelu',
  'patience': 15,
  'lr': 0.019625012905909813},
 {'optim_cls': 'adam',
  'weight_decay': 0.00020925505006661127,
  'activate': 'relu',
  'patience': 5,
  'lr': 0.002733050595602287},
 {'optim_cls': 'sgd',
  'momentum': 0.9671834793043608,
  'nesterov': False,
  'weight_decay': 1.1579798382953368e-06,
  'activate': 'silu',
  'patience': 15,
  'lr': 0.033480530452169636},
 {'optim_cls': 'adamw',
  'weight_decay': 0.0004631111057079639,
  'activate': 'silu',
  'patience': 15,
  'lr': 0.003572919342126105},
 {'optim_cls': 'adamw',
  'weight_decay': 1.6264423522028222e-06,
  'activate': 'relu',
  'patience': 10,
  'lr': 0.03731454868352696},
 {'optim_cls': 'sgd',
  'momentum': 0.7985899323111467,
  'nesterov': True,
  'weight_decay': 2.1503954623704132e-05,
  'activate': 'relu',
  'patience': 5,
  'lr': 0.033291598506096955}]

if len(study.trials) < 10:
    for params in known_good: 
        study.enqueue_trial(params)

Here's my model creation code. Note that `make_model` accepts parameters similar to the ones generated by optuna. `set_trial_params` returns a function that will generate a model with selected params on demand -- we want to use that, because we're doing multiple folds, and therefore need to create multiple models.

Our setup is not very advanced. We're letting optuna find the best optimizer, learning rate, activation function and patence for `ReduceLROnPlateau`. In my local testing, `optim.RAdam` consistently gave the best results, so it might make more sense to remove the other ones.

If you're tuning only a single parameter of a neural network, that parameter should probably be your learning rate. The best value for the learning rate depends on your batch size, so finding a good value is going to take less time if you decide on a batch size, and stick with that. With the kind of tabular data we have here, we can use huge batches, much bigger than what I'm doing here. But I had good results with 4096 right from the start, so I have a good idea of what the learning rate should be for this value.

In [None]:
def init(layer):
    if isinstance(layer, nn.Linear):
        nn.init.kaiming_normal_(layer.weight, nonlinearity='relu')
        nn.init.zeros_(layer.bias)

def make_model(
        device='cpu', lr=8e-3, optim_cls=optim.Adam, activate=nn.ReLU, patience=5, **optim_params
):
    model = nn.Sequential(
        nn.BatchNorm1d(X.shape[1]),
        nn.Linear(X.shape[1], 128),
        activate(),
        nn.BatchNorm1d(128),
        nn.Linear(128, 64),
        activate(),
        nn.BatchNorm1d(64),
        nn.Linear(64, 32),
        activate(),
        nn.BatchNorm1d(32),
        nn.Linear(32, 16),
        activate(),
        nn.BatchNorm1d(16),
        nn.Linear(16, len(label_encoder.classes_)),
    ).to(device)
    model.apply(init)
    optimizer = optim_cls(model.parameters(), lr=lr, **optim_params)
    tracker = TrackBestWeights(model)
    scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.2, patience=patience)
    return dict(
        net=model,
        tracker=tracker,
        scheduler=scheduler,
        optimizer=optimizer
    )

def set_trial_params(device, trial: optuna.Trial):
    optim_cls = {
        'adam': optim.Adam,
        'adamw': optim.AdamW,
        'radam': optim.RAdam,
        'sgd': optim.SGD,
    }[trial.suggest_categorical('optim_cls', ['adam', 'adamw', 'radam', 'sgd'])]
    if optim_cls == optim.SGD:
        optim_kwargs = dict(
            momentum=trial.suggest_uniform('momentum', .7, .99),
            nesterov=trial.suggest_categorical('nesterov', [False, True]),
        )
    else:
        optim_kwargs = dict()
    optim_kwargs['weight_decay'] = trial.suggest_loguniform('weight_decay', 1e-6, 1e-2)
    activate = {
        'relu': nn.ReLU,
        'lrelu': nn.LeakyReLU,
        'silu': nn.SiLU,
        'selu': nn.SELU,
    }[trial.suggest_categorical('activate', ['relu', 'lrelu', 'silu', 'selu'])]
    patience = trial.suggest_categorical('patience', [5, 10, 15])
    def return_model():
        return make_model(
            device,
            lr=trial.suggest_loguniform('lr', 5e-4, 6e-2),
            optim_cls=optim_cls,
            activate=activate, patience=patience,
            **optim_kwargs
        )
    return return_model

Here's the function we'll ask optuna to optimize for us. It's fairly big, because it contains a whole training loop and loops over the folds too. 

Ideally, we'd make this even more complex by logging more metrics, like training loss and training accuracy. We could log them on the `trial` object by doing `trial.set_user_attr('training_losses', training_losses)`.

We're setting the torch seed for each trial, to ensure the models we create are reproducible. Optionally, we'll return the out of fold predictions, the test predictions and the `cv.n_splits` neural nets we trained. Optuna won't use that, it'll use only the validation accuracy -- but later on, when we're retraining the best model we found, we'll use it.

In [None]:
def run_experiment(trial: optuna.Trial, return_preds=False, progress='folds', cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=64)):
    oof_preds = torch.zeros(len(X), len(label_encoder.classes_), device=device)
    test_preds = torch.zeros(X_test.shape[0], len(label_encoder.classes_), device=device)
    n_epochs = 140

    torch.manual_seed(64)
    make_model = set_trial_params(device, trial)
    nets = []

    if progress == 'folds':
        folds = tqdm(cv.split(X, y), total=cv.n_splits)
    else:
        folds = cv.split(X, y)
    for cv_no, (train_idx, val_idx) in enumerate(folds):
        X_train, y_train = X[train_idx].to(device), y[train_idx].to(device)
        X_val, y_val = X[val_idx].to(device), y[val_idx].to(device)

        model = make_model()

        it = range(n_epochs) if progress == 'folds' else trange(n_epochs)
        for epoch in it:
            x_order = torch.randperm(X_train.shape[0], device=device)
            x_batches = torch.split(x_order, 4096)
        
            model['net'].train()
        
            for batch_idx in x_batches:
                model['optimizer'].zero_grad()
                X_b, y_b = X_train[batch_idx], y_train[batch_idx]
                y_hat = model['net'](X_b)
                loss = F.cross_entropy(y_hat, y_b)
                loss.backward()
                model['optimizer'].step()
            
            model['net'].eval()
            accurate = 0
            loss = 0
            with torch.no_grad():
                for batch_idx in torch.split(torch.arange(y_val.shape[0], device=device), 8192):
                    X_b, y_b = X_val[batch_idx], y_val[batch_idx]
                    y_hat = model['net'](X_b)
                    loss += F.cross_entropy(y_hat, y_b, reduction='sum').item()
                    accurate += (F.softmax(y_hat, dim=1).argmax(axis=1) == y_b).sum().item()
            accurate = accurate / y_val.shape[0]
            loss = loss / y_val.shape[0]
            model['scheduler'].step(loss)
            model['tracker'].step(accurate)
            if progress != 'folds':
                it.set_description(f'acc={accurate:.4f} best_acc={model["tracker"].best_accuracy:.4f} loss={loss:.4f} ')

        model['tracker'].restore()
        nets.append(model['net'])
        with torch.no_grad():
            oof_preds[val_idx] = F.softmax(model['net'](X_val), dim=1)
            val_acc = (oof_preds[val_idx].argmax(axis=1).cpu() == y[val_idx]).float().mean().item()
            if progress == 'folds':
                folds.set_description(f'val_acc = {val_acc:.4f}')
            trial.report(
                val_acc, cv_no
            )
            test_preds += F.softmax(model['net'](X_test.to(device)), dim=1) / cv.n_splits
            if trial.should_prune():
                raise optuna.TrialPruned()
    with torch.no_grad():
        if not return_preds:
            return (oof_preds.argmax(axis=1).cpu() == y).float().mean()
        else:
            return oof_preds, test_preds, nets

Now we simply need to ask optuna to find the best parameters, and give it a timeout. I'll give it 8 hours here. On kaggle, the folds take around 100 seconds with a GPU, so we should be able to do 36 folds / hour, or about 7 trials per hour. But we're also actively pruning trials that are not impressing us after 1 fold, so we should be able to do many more than just 60 trials:

In [None]:
study.optimize(run_experiment, timeout=4 * 60 * 60)

At this point, we've found the best parameters, let's fetch the predictions and models for those:

In [None]:
oof_preds, test_preds, nets = run_experiment(study.best_trial, return_preds=True, progress='epochs', cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=64))

We will predict on the ~3 million samples we removed by downsampling, to verify that our model isn't terrible on those:

In [None]:
all_df = pd.read_parquet(f'{data_root}/tpsdec2021parquet/train_fe.pq').assign(
    Id=pd.read_parquet(f'{data_root}/tpsdec2021parquet/train.pq', columns=['Id']).Id.to_numpy()
)

all_df = all_df.loc[all_df.Cover_Type != 5]
not_predicted = all_df.loc[~all_df.Id.isin(df.Id)]

not_predicted_X = torch.from_numpy(scaler.transform(not_predicted.drop(columns=['Id', 'Cover_Type'])).astype(np.float32))
not_predicted_y = torch.from_numpy(label_encoder.transform(not_predicted.Cover_Type))
preds = torch.zeros(len(not_predicted), len(label_encoder.classes_))

with torch.no_grad():
    for net in tqdm(nets):
        net.eval()
        y_hat = torch.cat([F.softmax(net(X_b.to(device)), dim=1) for X_b in torch.split(not_predicted_X, 16384)], dim=0).cpu()
        preds += y_hat / len(nets)

(preds.argmax(dim=1) == not_predicted_y).float().mean().item()

As expected, the easy cases are easy and that's why we excluded most of them from training -- we can go much faster like this.

We should store the complete oof as well, not just the oof we've done on the samples we trained. This is a bit annoying to recombine, can't think of any better way than to sort both predictions by Id:

In [None]:
all_oof = torch.cat([preds, oof_preds.cpu()], dim=0).numpy()
i = np.arange(len(label_encoder.classes_))
col_names = [f'Cover_Type={c}' for c in label_encoder.inverse_transform(i)]

all_oof = pd.DataFrame(all_oof, columns=col_names).assign(
    Id=np.concatenate([not_predicted.Id.to_numpy(), df.Id.to_numpy()], axis=0)
)
all_oof = all_oof.sort_values(by='Id')
all_oof.drop(columns=['Id']).to_parquet('oof_mlp_proba.pq', index=False)

oof_pred_label = label_encoder.inverse_transform(all_oof.drop(columns=['Id']).to_numpy().argmax(axis=1))

oof_acc = np.mean(oof_pred_label == all_df.Cover_Type)

print(f'oof_acc = {oof_acc:.4f}')

It's worth remembering the fact that ~.86 out of fold accuracy on the downsampled data set corresponds to ~.962 accuracy out of fold accuracy on the complete data set. We're ignoring around 73.5% of the data, because we expect to be able to do 99.9% accuracy on it. So the math here is easy enough: `73.5 * 99.9 + oof_acc * (1 - 73.5)`, which comes out to 96.22% for oof_acc = 86%.

Let's store the probabilities for our test predictions so we can easily blend later:

In [None]:
test_proba = pd.DataFrame(test_preds.cpu().numpy(), columns=col_names)
test_proba.to_parquet('mlp_test_proba.pq', index=False)

Now I've stored the probabilities of both the out of fold predictions, and test predictions. Let's check if we benefit from blending with the booster from the downsampling notebook:

In [None]:
blend = pd.read_parquet('oof_mlp_proba.pq').to_numpy() + pd.read_parquet(f'{data_root}/tps202112-downsample-easy-cases/oof_proba_out.pq').to_numpy()
blend_pred = label_encoder.inverse_transform(blend.argmax(axis=1))
np.mean(blend_pred == all_df.Cover_Type)

That's an improvement, so we'll submit the blend, rather than just the NN:

In [None]:
blend = (
    pd.read_parquet('mlp_test_proba.pq').to_numpy() 
    + pd.read_parquet(f'{data_root}/tps202112-downsample-easy-cases/test_proba_out.pq').to_numpy()
)
blend_pred = label_encoder.inverse_transform(test_proba.to_numpy().argmax(axis=1))
sub = df_test[['Id']].assign(Cover_Type=blend_pred)
sub.head()

In [None]:
sub.to_csv('submission.csv', index=False)