# What does this notebook try to do?
I use the well-known dataset on whether we can predict what passengers survive the Titanic disaster to illustrate hyperparameter tuning. What I discuss:
* basics of how [**LightGBM**](https://lightgbm.readthedocs.io/en/latest/) and other tree-boosting approaches (like e.g. [xgboost](https://xgboost.readthedocs.io/en/latest/)) work
* hyperparameter tuning using the [**optuna** package](https://optuna.readthedocs.io/en/stable/)

Why do I primarily use LightGBM? Mostly, due to its popularity on Kaggle, which it primarily owes to its speed, which allows more extensive hyperparameter tuning due to faster iteration. Obviously, this is less of a consideration on a small dataset such as the Titanic data, but on larger data this becomes a consideration.

What I **do not** try to do
* extensive feature engineering/generating very insightful features, but rather hyperparameter tuning given a set of features
* stacking or blending multiple models to improve what a single model can achieve
* achieving a really good score on the leaderboard (see the two points above)
* explaining model predictions using [**SHAP**](https://github.com/slundberg/shap) (SHapley Additive exPlanations)

I avoid these things, because I primarily aim to achieve a clear code and explanations for hyperparameter tuning without complex code and lengthy feature engineering getting in the way.

# Load the data
First, we load the data and derive some basic features.

Note, that we did not necessarily have to impute missing data e.g. for passenger age, because LightGBM can automatically handle missing values by assigning them a split-direction at each split (as xgboost does). However, it is plausible that the imputation I used should do better, because a missing value may not have the same implication for passenger class, age and gender.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import RepeatedKFold

train = pd.read_csv('../input/titanic/train.csv')
train['train'] = True
test = pd.read_csv('../input/titanic/test.csv')
test['train'] = False

alldata = train.append(test).copy()

# Features to be created (for some of the ideas see: https://www.kaggle.com/gunesevitan/titanic-advanced-feature-engineering-tutorial/comments)
alldata['Embarked'] = alldata['Embarked'].map({'S':2, 'C':1, 'Q': 0, np.NaN: 2}).astype(np.int8) # Missing values assigned to majority class (Southampton)
alldata['FamilyMembers'] = alldata['SibSp'] + alldata['Parch']
alldata['AdjFare'] = alldata.groupby('Ticket')['Fare'].transform(lambda x: x/len(x))
alldata['AdjFare'] = alldata.groupby(['Pclass', 'FamilyMembers'])['AdjFare'].transform(lambda x: x.fillna(x.median()))
alldata['AdjFareV2'] = pd.qcut(alldata['AdjFare'], 13)
alldata['Female'] = alldata['Sex'].map({'male': 0, 'female': 1}).astype(np.int8)
alldata['Pclass_cat'] = (alldata['Pclass']-1).astype(np.int8)
alldata['Male3rd'] = (alldata['Sex'].map({'male': 1, 'female': 0}) * alldata['Pclass'].map({3:1, 1:0, 2:0})).astype(np.int8)
alldata['Adult'] = (alldata['Age']>16).astype(np.int8)
alldata['Adult'].values[alldata['Age'].isna()] = 1 # Looking at titles of people, I suspect those with missing age are mostly adult
alldata['MissingAge'] = (train['Age'].isna()*1).astype(np.int8)
alldata['NonAdult1st2nd'] = (alldata['Adult'] * alldata['Pclass'].map({3:0, 1:1, 2:1})).astype(np.int8)
alldata['Female1st2nd'] = (alldata['Female'] * alldata['Pclass'].map({3:0, 1:1, 2:1})).astype(np.int8)
alldata['Ticket_Frequency'] = alldata.groupby('Ticket')['Ticket'].transform('count')

alldata['Title'] = alldata['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
alldata['Is_Married_Woman'] = 0
alldata['Is_Married_Woman'].loc[alldata['Title'] == 'Mrs'] = 1
alldata['Title'] = alldata['Title'].replace(['Miss', 'Mrs','Ms', 'Mlle', 'Lady', 'Mme', 'the Countess', 'Dona'], 'Miss/Mrs/Ms')
alldata['Title'] = alldata['Title'].replace(['Dr', 'Col', 'Major', 'Jonkheer', 'Capt', 'Sir', 'Don', 'Rev'], 'Dr/Military/Noble/Clergy')
alldata['Title2'] = alldata['Title'].map({'Mr':0, 'Dr/Military/Noble/Clergy':1, 'Master':2, 'Miss/Mrs/Ms':3})

alldata['Deck'] = alldata['Cabin'].apply(lambda s: s[0] if pd.notnull(s) else 'M')
alldata['Deck'] = alldata['Deck'].map({'A':0, 'T':0, 'B':1, 'C':2, 'D':3, 'E':4, 'F':5, 'G':5, 'M': 6})

# Imputation of age
alldata['Age'] = alldata.groupby(['Pclass', 'Female'])['Age'].transform(lambda x: x.fillna(x.median()))

# Family member by age interaction
alldata['FamilyAge'] = alldata['FamilyMembers'] + alldata['Age']/60
# Taking a guess as to which passengers are parents
alldata['father'] = 1 * (alldata['Age']>=18) * (alldata['Parch']>0) * (alldata['Sex']=='male')
alldata['mother'] = 1 * (alldata['Age']>=18) * (alldata['Parch']>0) * (alldata['Sex']=='female')
alldata['parent'] = alldata['father'] + alldata['mother']


alldata['title_type2'] = [ any([title in Name for title in ['Capt.', 'Col.', 'Major.', 'Rev.']]) for Name in alldata['Name']]
alldata['title_type1'] = [ any([title in Name for title in ['Master.', 'Mme.', 'Dona.', 'Countess.', 'Lady.', 'Miss.', 'Mlle.']]) for Name in alldata['Name']]

alldata['title_type'] = alldata['title_type1']*1 + alldata['title_type2']*2

alldata['AgeGroup'] = 1 * (alldata['Age']<=2) + 1 * (alldata['Age']<=6) + 1 * (alldata['Age']<=17) + 1 * (alldata['Age']<=60)



In [None]:
alldata['Title'].value_counts()

In [None]:
# Lists of features to be used later
continuous_features = ['Pclass', 'Age', 'FamilyMembers', 'AdjFareV2', 'Ticket_Frequency', 'FamilyAge']
discrete_features = ['Female', 'Male3rd', 'Embarked', 'Adult', 'Title2', 'Is_Married_Woman', 'AgeGroup', 
                     'NonAdult1st2nd', 'Female1st2nd', 'parent', 'Deck']
ids_of_categorical = [0,1,2,3,4,5,6,7,8,9,10]

In [None]:
# Split all data into training and test data
train = alldata[alldata['train']==True]
test = alldata[alldata['train']==False]

This is what the data then look like:

In [None]:
pd.options.display.max_colwidth = 250
pd.options.display.max_columns = 50
train.head(20)

# Hyperparameter optimization with cross-validation
Previously, we used an absurdly simple model for illustration purposes, now, let's take this seriously and create a single LightGBM model of appropriate complexity. For that, we turn to hyperparameter optimization using the **optuna** package.

**optuna** comes with a generic ability to tune hyperparameters for any machine learning algorithm, but specifically for LightGBM there is an intergration via the **LightGBMTunerCV** function. This function implements a sensible hyperparameter tuning strategy that is known to be sensible for LightGBM by tuning the following parameters in order:
* feature_fraction
* num_leaves
* bagging_fraction and bagging_freq
* feature_fraction (again)
* regularization factors (i.e. 'lambda_l1' and 'lambda_l2')
* min_child_samples

We can either entirely rely on this tuner, or additionally run some further hyperparameter search thereafter. Here we will do the latter. Arguably, you could even skip the automatic LightGBM tuner, if you have enough of a time budget, while it is particularly attractive if you have not so much time.

How do we tune? In each training, we actually optimize the log-loss, but we pick the hyperparameters that optimize the metric of interest. For the Titanic dataset on Kaggle that's accuracy, but you might argue that accuracy is not such a great metric to use, because it's not a proper scoring function, and only "cares" about correct (survival predicted and in truth survived, or non-survival predicted and in truth did not survive) or wrong (non-survival predicted and in truth survived, or survival predicted and in truth did not survive). That leads to a rather sparse signal. I.e. a predicted probability of 0.51 is just as good as 0.999, as long as it is above 0.5 - or whatever other threshold you choose to use to predict that a passenger survived - and as long as a passenger survives. In any case, we will maximize accuracy - or in fact, minimze binary error (=1-accuracy), which is completely equivalent.

For a full list of parameters we can tweak and objective functions/metrics, see the [LightGBM documentation](https://lightgbm.readthedocs.io/en/latest/Parameters.html). Note that you should typically use the main named for a metric, e.g. 'l1' and not an alias such as 'MAE', because optuna and LightGBM seen to interact on unfortunate ways - at least in optuna version 2.2.0 with LigthGBM version 2.3.1 (used in this notebook) - if you don't.

Since this is a pretty small dataset, we can afford to operate with really low learning rates in LightGBM, which then require more trees, but tend to increase performance.

We will use 10-fold cross-validation with random splits, but depending on how the split of training and test data was done you might want to do something different. This particular approach may be particularly relevant if the training-test split was simply done randomly (like this cross-validation scheme).

In [None]:
import optuna.integration.lightgbm as lgb
import optuna

rkf = RepeatedKFold(n_splits=7, n_repeats=3, random_state=42)

params = {
        "objective": "binary",
        "metric": "binary_error",
        "verbosity": -1,
        "boosting_type": "gbdt",                
        "seed": 42
    }

X = np.array( train[discrete_features + continuous_features] )    
y = np.array( train['Survived'] ).flatten()

study_tuner = optuna.create_study(direction='minimize')
dtrain = lgb.Dataset(X, label=y)

# Suppress information only outputs - otherwise optuna is 
# quite verbose, which can be nice, but takes up a lot of space
optuna.logging.set_verbosity(optuna.logging.WARNING) 

# Run optuna LightGBMTunerCV tuning of LightGBM with cross-validation
tuner = lgb.LightGBMTunerCV(params, 
                            dtrain, 
                            categorical_feature=ids_of_categorical,
                            study=study_tuner,
                            verbose_eval=False,                            
                            early_stopping_rounds=250,
                            time_budget=19800, # Time budget of 5 hours, we will not really need it
                            seed = 42,
                            folds=rkf,
                            num_boost_round=10000,
                            callbacks=[lgb.reset_parameter(learning_rate = [0.005]*200 + [0.001]*9800) ] #[0.1]*5 + [0.05]*15 + [0.01]*45 + 
                           )

tuner.run()

Here are our interim results after the automatic tuner.

In [None]:
print(tuner.best_params)
# Classification error
print(tuner.best_score)
# Or expressed as accuracy
print(1.0-tuner.best_score)

In [None]:
# Set-up a temporary set of best parameters that we will use as a starting point below.
# Note that optuna will complain about values on the edge of the search space, so we move 
# such values a tiny little bit inside the search space.
tmp_best_params = tuner.best_params
if tmp_best_params['feature_fraction']==1:
    tmp_best_params['feature_fraction']=1.0-1e-9
if tmp_best_params['feature_fraction']==0:
    tmp_best_params['feature_fraction']=1e-9
if tmp_best_params['bagging_fraction']==1:
    tmp_best_params['bagging_fraction']=1.0-1e-9
if tmp_best_params['bagging_fraction']==0:
    tmp_best_params['bagging_fraction']=1e-9  

Now let's search more broadly with the results that the automatic tuner provided as a starting point. Annoyingly optuna does not have an automatic way of continuing the study we started. So, we re-run the best parameters we found using the **enqueue_trial** function, which allows us to force the search to automatically look as our best guesses (in this case just one based on the tuner we used above). 

Note that how we could optimize the mean binary error, but here I choose to minimize the mean binary error + the standard deviation of the binary errors across CV folds. This aims to favor hyperparameter settings that produce consistently good results.

How will optuna now select values? We use the TPE (Tree-structured Parzen Estimator) algorithm (this is the default setting, which we could explictly select using **sampler=TPESampler** in the **study.optimize()** call). For more on TPE, you can read this [NeurIPS paper](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf) or e.g. this more accessible [blog post](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f). TPE at each step constructs an approximate surrogate model for how the performance of LightGBM depends on the hyperparameters based on what performance we have got for the values we tried, so far (note: technically, we re-write this differently using Bayes rule, but let's skip those details). It then tries new hyperparameter values to test based on this model, which are picked based on which values are expected to improve our metric of interest the most.

In [None]:
import lightgbm as lgb
dtrain = lgb.Dataset(X, label=y)

# We will track how many training rounds we needed for our best score.
# We will use that number of rounds later.
best_score = 999
training_rounds = 10000

# Declare how we evaluate how good a set of hyperparameters are, i.e.
# declare an objective function.
def objective(trial):
    # Specify a search space using distributions across plausible values of hyperparameters.
    param = {
        "objective": "binary",
        "metric": "binary_error",
        "verbosity": -1,
        "boosting_type": "gbdt",                
        "seed": 42,
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
        'num_leaves': trial.suggest_int('num_leaves', 2, 512),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.1, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 0, 15),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 100),
        'seed': 1979
    }
    
    # Run LightGBM for the hyperparameter values
    lgbcv = lgb.cv(param,
                   dtrain,
                   categorical_feature=ids_of_categorical,
                   folds=rkf,
                   verbose_eval=False,                   
                   early_stopping_rounds=250,                   
                   num_boost_round=10000,                    
                   callbacks=[lgb.reset_parameter(learning_rate = [0.005]*200 + [0.001]*9800) ]
                  )
    
    cv_score = lgbcv['binary_error-mean'][-1] + lgbcv['binary_error-stdv'][-1]
    if cv_score<best_score:
        training_rounds = len( list(lgbcv.values())[0] )
    
    # Return metric of interest
    return cv_score

# Suppress information only outputs - otherwise optuna is 
# quite verbose, which can be nice, but takes up a lot of space
optuna.logging.set_verbosity(optuna.logging.WARNING) 

# We search for another 90 min (5400 seconds).
# We could instead do e.g. n_trials=1000, to try 1000 hyperparameters chosen 
# by optuna or set neither timeout or n_trials so that we keep going until 
# the user interrupts ("Cancel run").
study = optuna.create_study(direction='minimize')  
study.enqueue_trial(tmp_best_params)
study.optimize(objective, timeout=5400) 


# Visualizing the hyperparameter optimization

**optuna** also provides some nice visualizations for your optimizaton study. I show a few ones here, but the **optuna** [documentation](https://optuna.readthedocs.io/en/v1.0.0/reference/visualization.html) has the full list with examples of what each plot looks like. In practice, we might optimize for a bit and then look at these plots to narrow down the search space - for example to bagging fractions between 0.75 and 0.95, lower feature fractions, lower values of the minimum number of child samples and number of leaves from 100 to 300. We can iterate that process to potentially arrive at increasingly better solutions that may outperform just running **optuna** longer.

As we can see, the gains from more optimizations plateau somewhat after a number of trials, which does not mean that we would not try to squeeze for the very last few decimals on Kaggle, but perhaps in a practical application there is not much point in going beyond, say, a few hundred or thousand trials with different hyperparameters.

In [None]:
optuna.visualization.plot_optimization_history(study)

In [None]:
optuna.visualization.plot_slice(study)

In [None]:
optuna.visualization.plot_param_importances(study)

# Results of hyperparameter optimization

This tuning strategy in the previous section gets us to these hyper-parameters for LightGBM:

In [None]:
print(study.best_params)

The achieved CV-score is

In [None]:
# Classification error
print(study.best_value)
# Or expressed as accuracy
print(1.0-study.best_value)

Now create a dictionary with the best parameter values so that we can use those in training a final model below.

In [None]:
best_params = {
    "objective": "binary",
    "metric": "binary_error",
    "verbosity": -1,
    "boosting_type": "gbdt",
    "seed": 42} 
best_params.update(study.best_params)
best_params


# Train a tuned model that we will submit to the leaderboard

Now let's actually train it.

In [None]:
lgbfit = lgb.train(best_params,
                   dtrain,
                   categorical_feature=ids_of_categorical,
                   verbose_eval=False,                   
                   num_boost_round=training_rounds)

# Create test set predictions

In [None]:
X_test = np.array( test[discrete_features + continuous_features] )    

Now we take the predicted probabilities from LightGBM and then round them (below 0.5 we predict 0, at or above 0.5 we predict 1):

In [None]:
test['Survived'] = np.round(lgbfit.predict(X_test)).astype(np.int8)
submission = test[['PassengerId', 'Survived']]

So, our predictions now look like this:

In [None]:
submission

Now we save a *submission.csv* file, but without the index column - this is the expected format, as illustrated in the example submission file (*gender_submission.csv*).

In [None]:
submission.to_csv('submission.csv', index=False)