Grid search
==

It's the first half hour that you're looking a new problem. What's the first thing you should do?

Well, that's exploratory data analysis, or EDA for short. To me, that involves doing a quick and dirty search for models that will work **while** I'm doing analysis. I'll do a quick check to look for missing data, if needed, I might impute or fill some of that. Then I'll start -- with the goal of getting a promising lead for a model I should pursue later.

At this point, I barely know the data at all. But it's better to start this job now, when I'm not waiting for it to finish, than later, when I have time to work, but don't know where to take it. So I think this is often a good idea for one of the first steps. Grid searches are not only for hyper parameter tuning, they're actually not bad for finding out where you should start.

First, I'll install a more recent version of scikit-learn -- namely, I want one that has `HistGradientBoostingClassifier`. Just because that makes it super fast to set up an experiment that can check the classifiers that I'll most commonly want to try. So, here's that:

In [None]:
%pip install -U -q scikit-learn

Then, do lots of imports, set the most important seeds (for reproducible results), and read in the data. I also make sure to always use the same cross validation folds:

In [None]:
import random
import pandas as pd
import numpy as np

from sklearn import (
    linear_model,
    neural_network,
    model_selection,
    pipeline,
    preprocessing,
    svm,
    ensemble,
    metrics
)

random.seed(42)
np.random.seed(42)
folds = model_selection.StratifiedKFold(5, shuffle=True, random_state=42)

df = pd.read_csv(
    '../input/tabular-playground-series-nov-2021/train.csv', dtype=np.float32
).astype({'id': np.int32})

The next thing I do, if I know the data has no glaring, huge, problems, is that I set up a `sklearn.pipeline.Pipeline` and verify that it works by running it in a `cross_val_score`

In [None]:
clf = pipeline.Pipeline([
    ('scaler', preprocessing.StandardScaler()), # name of the step to the left
    ('clf', linear_model.LogisticRegression()), # name of the step to the left
])

X = df.drop(columns=['id', 'target'])
y = df.target

model_selection.cross_val_score(clf, X, y, cv=folds)

`LogisticRegression` with `StandardScaler` is very often a good first choice for tabular problems that are already numeric. It's fast to evaluate and can often get you decent models. With categorical features involved, I might try some kind of tree first, since it's fast to get started.

Setting up a GridSearchCV
==

Once you've a pipeline that works, it's super easy to have `sklearn.model_selection.GridSearchCV` try replacing parts of the pipeline for other parts, to find the combination of parts that work best.

This is fully automatic, and fully exhaustive. Eg. if your grid contains 3 options, with 4 choices each, that's 4 x 4 x 4 combinations that it'll search. So this could take a while. Which is a good reason to start the job early on, while you have other things to do than wait for it!

Usually you should try a few different preprocessing steps. I tried to keep the clutter out of this demo, though.

In [None]:
grid = [ # A grid can be a single dict, or a list of dicts
    dict(
        # All combinations of options in this dict will be tried together
        # 3 here -- 1 scaler times 3 different models
        scaler=[preprocessing.StandardScaler()], # name of the pipeline step!
        clf=[ # name of the pipeline step!
            # These classifiers prefer centered input, so they work well with StandardScaler
            svm.LinearSVC(), 
            linear_model.LogisticRegression(),
            neural_network.MLPClassifier(early_stopping=True, hidden_layer_sizes=(32, 8)),
        ]
    ),
    dict( 
        # You can pass as many dicts as you'd like, 
        # let's try using different scalers with SGDClassifier
        scaler=[preprocessing.StandardScaler(), preprocessing.MinMaxScaler()],
        clf=[linear_model.SGDClassifier()],
        clf__loss=['hinge', 'log'], # run SGD twice, once with hinge-loss and once with logloss
    ),
    dict(
        scaler=[None], # These two don't care about scaling, so we can run it without
        clf=[ensemble.RandomForestClassifier(), ensemble.HistGradientBoostingClassifier()]
    )
]
grid

You can use the same mechanism to pass different kinds of options to the different classifiers -- for SGD, we're trying two different loss functions here. Now we can set up the GridSearchCV estimator:

In [None]:
gridsearch = model_selection.GridSearchCV(
    clf, # the pipeline
    grid, # the grid,
    cv=folds, # the folds
    scoring='roc_auc', # optionally the scoring
    verbose=True, # If you want more output
)

And that's just a normal estimator, so we use it by calling `.fit()`. This will take a while, depending on how many models you asked it to check:

In [None]:
%time gridsearch.fit(X, y)

Once fitted, `GridSearchCV` has a convenient `cv_results_` object that can be turned into a `pd.DataFrame` and insepected:

In [None]:
pd.DataFrame(gridsearch.cv_results_)

And you can easily retrieve the best estimator it found:

In [None]:
chosen = gridsearch.best_estimator_
chosen, chosen.get_params()

And that's how I ended up spending more energy on neural networks (MLPClassifier is a simple NN), than on gradient boosters this time around.

In [None]:
%time chosen.fit(X, y)

In [None]:
df_test = pd.read_csv(
    '../input/tabular-playground-series-nov-2021/test.csv', dtype=np.float32
).astype({'id': np.int32})
X_test = df_test.drop(columns=['id'])
df_test[['id']].assign(
    target=chosen.predict_proba(X_test)[:, 1]
).to_csv('submission.csv', index=False)