Simple models that do OK
==

To my great joy, I quickly discovered that this time around, I won't be tuning boosters for weeks. :-)

I found a number of other promising approaches that are quick to train, and don't have as many hyper parameters to tune -- which should free up time to work on feature engineering instead. In this notebook, I've set up simple demos of how that works using cross validation, but I'm not storing any output here. I've submitted a few of these models, and they do worse on the public LB than on these CVs, but they should still mostly be around AUC `.745`, and there's even one model achieving AUC `.747` here.

There's no EDA in this notebook, only things that are plug-and-play -- I had spent some time looking at features before I started testing these models, and got reasonable results with no feature engineering. It looks like the input data is probably generated based on some TfIdf-vectors, so I went looking for models that people would normally combine with that. Towards the end, I did some simple blending experiments.

Let's get the rig out of the way first, here's my usual suspect imports, and also a scikit-learn upgrade to match what I have on my own machine:

In [None]:
%pip install -q -U scikit-learn

import random
import pandas as pd
import numpy as np
import seaborn as sns
from multiprocessing import cpu_count
from matplotlib import pyplot as plt

from sklearn import (
    metrics,
    model_selection,
    linear_model,
    pipeline,
    preprocessing,
    base
)

Setting up folds
==

I reuse the same folds across all of the models here, in order to be able to compare results.

I also set some variables I'll want to use later for checking metrics -- `y_true` is the out of fold order for my folds, I will check my model performance by calling `metrics.roc_auc_score(y_true, y_score)`.

I'm using 5 folds here, somewhat arbitrarily chosen. 

In [None]:
np.random.seed(42)
random.seed(42)
n_jobs = cpu_count()

sns.set(
    style='whitegrid',
    context='notebook',
    rc={'figure.frameon': False, 'legend.frameon': False, 'figure.figsize': (12, 8)}
)

data_root = '/kaggle/input/tabular-playground-series-nov-2021'

df = pd.read_csv(f'{data_root}/train.csv', dtype=np.float32).astype({'id': np.int32}).sample(frac=1) # shuffle
df_test = pd.read_csv(f'{data_root}/test.csv', dtype=np.float32).astype({'id': np.int32})
X_test = df_test.drop(columns=['id']).to_numpy()

ids, X, y = df['id'].to_numpy(), df.drop(columns=['id', 'target']).to_numpy(), df.target.to_numpy()

folds_idx = list(model_selection.StratifiedKFold(n_splits=5).split(X, y))
val_order = np.concatenate([val_idx for _, val_idx in folds_idx], axis=0)
y_true = y[val_order]

Making a rig to check CV score
==

This is something you could just use `sklearn.model_selection.cross_val_score()` for. But this way of doing it is useful if you plan to do something with the estimators that are trained -- for example, store their test predictions and out of fold predictions (which I do, just not in this notebook).

In [None]:
def predict(model, X):
    """We'd prefer using predict_proba, but can fall back to decision_function or predict if need be."""

    try:
        return model.predict_proba(X)[:, 1]
    except (AttributeError, ValueError) as e:
        # It doesn't have it, or it is SGD with loss='hinge', which has, but doesn't support it
        if hasattr(model, 'decision_function'):
            return model.decision_function(X)
        else:
            return model.predict(X)
        

def score_model(model: base.BaseEstimator, n_jobs=n_jobs):
    """
    Returns (auc, oof_predictions, test_predictions)
    
    auc is for oof_predictions, but it should not be *too* far off from test_predictions
    
    """
    report = model_selection.cross_validate(
        model, X, y, 
        cv=folds_idx, # we always use the same folds
        scoring='roc_auc',
        return_estimator=True, # this gives us the trained estimators in the report
        n_jobs=n_jobs,
    )
    oof, test = [], []
    
    for est, (_, val_idx) in zip(report['estimator'], folds_idx):
        oof.append(predict(est, X[val_idx]))
        test.append(predict(est, X_test))
        
    oof = np.concatenate(oof, axis=0)
    test = np.c_[test].mean(axis=0)
    return metrics.roc_auc_score(y_true, oof), oof, test

RidgeClassifier
--

Ridge Regression is supposed to work well when features are highly correlated, which should be the case for TfIdf vectors, eg. multiple terms occuring together.

In [None]:
rdg = pipeline.make_pipeline(
    preprocessing.RobustScaler(),
    linear_model.RidgeClassifierCV()
)

%time rdg_auc, rdg_oof, rdg_preds = score_model(rdg)
rdg_auc

That's a pretty good score for a model that's so fast to fit.

SGD
==

Now we can check hinge-loss SGD, which is similar to a linear SVM. SVM were state-of-the-art for natural language processing at some point, and the synthetic data is generated based on an email dataset, so it's perhaps not surprising that it works well here.

In [None]:
sgd = pipeline.make_pipeline(
    preprocessing.RobustScaler(), 
    linear_model.SGDClassifier(loss='hinge', learning_rate='adaptive', penalty='l2', alpha=1e-3, eta0=0.025)
)

%time sgd_auc, sgd_oof, sgd_preds = score_model(sgd)
sgd_auc

I tried several other approaches for SVM, but had issues getting them to work due to the number of samples we have.

These parameters weren't chosen at random, I did some hyper parameter search here for about 1 hour, based on a hunch that SGD would be great on this dataset. SGD is reasonably fast to experiment with.

Linear Discriminant Analysis
==

Doesn't get much simpler to tune than this one:

In [None]:
from sklearn import discriminant_analysis

lda = pipeline.make_pipeline(
    preprocessing.StandardScaler(), 
    discriminant_analysis.LinearDiscriminantAnalysis(),
)

%time lda_auc, lda_oof, lda_preds = score_model(lda)
lda_auc

I believe this is also a classic for NLP tasks with TfIdf vectors. I've used it for feature transformations in the past, but for this data set it works out to be a reasonably good classifier on its own, no tuning required.

Logistic Regression
==

This one isn't too bad here either:

In [None]:
lreg = pipeline.make_pipeline(preprocessing.RobustScaler(), linear_model.LogisticRegression(C=0.0015, penalty='l2'))
%time lreg_auc, lreg_oof, lreg_preds = score_model(lreg)
lreg_auc

I did a very quick hyper parameter search here, and got better scores using `RobustScaler` than `StandardScaler`.

With the very similar scores so far, it seems likely that these linear models are all finding the same solution.

Naive Bayes
==

This would've been the classic for working with email (spam detection), but unfortunately it doesn't quite get there. Maybe with some tuning, it could?

In [None]:
from sklearn import naive_bayes

nb = pipeline.make_pipeline(
        preprocessing.MinMaxScaler(feature_range=(0, 1)),
        naive_bayes.MultinomialNB()
)

%time nb_auc, nb_oof, nb_preds = score_model(nb)
nb_auc

`.725` is probably not good enough that we could include it in blending, even. It's possible that we could tune it, or get better if we tried some feature engineering?

Either way, it's blazing fast, so if we were to experiment with it, we should be able to do it rapidly.


MLP
==

This one, I would normally implement with torch or tensorflow and run it on the GPU, but for the sake of just showing that it works pretty well here, I've left the scikit-learn implementation running, since this isn't a particularly expensive NN to evaluate:

In [None]:
from sklearn import neural_network

mlp = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    neural_network.MLPClassifier(hidden_layer_sizes=(100, 50, 10), batch_size=256, early_stopping=True)
)
%time mlp_auc, mlp_oof, mlp_preds = score_model(mlp)

mlp_auc

This is the first set of parameters I attempted here, there's no reason to believe we couldn't do better if we tried some tuning, or maybe a different NN architecture than MLP.

Ensembling / blending
==

Since we left all the out of fold predictions and test predictions in memory, we can easily start experimenting with ensembling in some way at this point. We have a reasonably diverse selection of models here, we know that the folds have been repeatedly happening in the same order, so it's easy to combine them to check what our best combination might be.

We'd do that by combining the out of fold predictions we kept, then score that against `y_true`. Here's a simple example of averaging everything except Naive Bayes:

In [None]:
scaler = preprocessing.MinMaxScaler()
oof_preds = np.c_[[sgd_oof, lda_oof, lreg_oof, mlp_oof, rdg_oof]].swapaxes(0, 1)
oof_blend = np.mean(scaler.fit_transform(oof_preds), axis=1)
metrics.roc_auc_score(y_true, oof_blend)

We could also try to grid search over some simple estimators:

In [None]:
X_blend = np.c_[[sgd_oof, lda_oof, lreg_oof, mlp_oof, nb_oof, rdg_oof]].swapaxes(0, 1)

blend = pipeline.Pipeline([
    ('scaler', preprocessing.StandardScaler()),
    ('predict', linear_model.LinearRegression())
])

param_grid = {
    'scaler': [preprocessing.RobustScaler(), preprocessing.StandardScaler(), preprocessing.MinMaxScaler()],
    'predict': [
        linear_model.RidgeClassifier(), linear_model.LogisticRegression(), 
        linear_model.BayesianRidge(), linear_model.LinearRegression()
    ]
}

grid_search = model_selection.GridSearchCV(blend, param_grid, n_jobs=-1, refit=True, scoring='roc_auc', cv=folds_idx)
%time grid_search.fit(X_blend, y)
pd.DataFrame(grid_search.cv_results_)[['param_predict', 'param_scaler', 'mean_test_score']]

In [None]:
grid_search.best_score_

Or, since I also kept our estimators around, we could use the sklearn stacking ensemble, but this will retrain everything, so it'll take more time than the rest of the notebook up until this point:

In [None]:
from sklearn import ensemble

stack = ensemble.StackingClassifier(
    estimators=[('sgd', sgd), ('lda', lda), ('lreg', lreg), ('mlp', mlp), ('rdg', rdg), ('nb', nb)], 
    final_estimator=pipeline.make_pipeline(preprocessing.StandardScaler(), linear_model.LogisticRegression())
)
%time stack_auc, stack_oof, stack_preds = score_model(stack)
stack_auc

In [None]:
df_test[['id']].assign(
    target=stack_preds
).to_csv('/kaggle/working/simple_stacker.csv', index=False)

I'm sending in the submission from that stack, just to associate a score with this notebook.

Done
==

Note that this highly overestimates our LB score due to a target leak, since we're using our out of fold data to do things like early-stopping and choosing the best models, we can't also reuse the folds to estimate blend score. One way to get around that would be to split off some data from train before doing the folds, then use that data to measure the blends. It shouldn't be too hard to make that adaption.

So, where to go from here? One option is to train many models (possibly some of these), using a setup much like this, then find the best way of blending, or stacking them. Eventually, I think that's probably the direction I will be heading. There are also lots of NN architectures that could be worth a shot here, and there's almost certainly a way to tune a booster so that it could compete with these models we found. But for now, I think I'd like to try doing some work on our features. :-)