# Tabular Playground May 2022 - LightGBM

[LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a popular alterative to 
[XGBoost](https://xgboost.readthedocs.io/en/latest/) and 
[CatBoost](https://catboost.ai/en/docs/)
and it performs surprising well with a `0.97134` score in regression mode


# Dataset

In [None]:
import numpy  as np 
import pandas as pd 
import re
import sklearn
import scipy
import lightgbm
import catboost
import xgboost

pd.options.display.max_columns = 999
pd.options.display.max_rows    = 6

In [None]:
%%time
category_dtype = 'category'  # 'category'  # catboost doesn't like category
col_dtypes = {
  "f_00": "float32",
  "f_01": "float32",
  "f_02": "float32",
  "f_03": "float32",
  "f_04": "float32",
  "f_05": "float32",
  "f_06": "float32",
  "f_07": "int32",
  "f_08": "int32",
  "f_09": "int32",
  "f_10": "int32",
  "f_11": "int32",
  "f_12": "int32",
  "f_13": "int32",
  "f_14": "int32",
  "f_15": "int32",
  "f_16": "int32",
  "f_17": "int32",
  "f_18": "int32",
  "f_19": "float32",
  "f_20": "float32",
  "f_21": "float32",
  "f_22": "float32",
  "f_23": "float32",
  "f_24": "float32",
  "f_25": "float32",
  "f_26": "float32",
  "f_27": category_dtype,
  "f_28": "float32",
  "f_29": "int32",
  "f_30": "int32",
  "target": "int32",
}
def preprocess_df(df):
    df[['f_27_0','f_27_1','f_27_2','f_27_3','f_27_4','f_27_5','f_27_6','f_27_7','f_27_8','f_27_9','f_27_10','f_27_00']] \
        = df['f_27'].str.split('',expand=True).astype(category_dtype)
    del df['f_27']     # very high cardinality | BUGFIX: LightGBMError: bin size 672 cannot run on GPU
    del df['f_27_0']   # str.split('') adds empty columns on either side
    del df['f_27_00']  # str.split('') adds empty columns on either side
    return df

def fix_missing_columns(train_df, test_df):
    # Find all columns present in one dataframe, but not in the other
    missing_cols = (set(train_df.columns) - set(test_df.columns))  \
                 | (set(test_df.columns)  - set(train_df.columns)) 
    missing_cols -= set(["target"])
    for col in missing_cols:
        train_df[col] = train_df.get(col,0)  # add zeros column if missing
        test_df[col]  = test_df.get(col,0)   # add zeros column if missing
        
    assert set(train_df.columns) - set(test_df.columns) == set(["target"])
    return train_df, test_df

train_df = pd.read_csv('../input/tabular-playground-series-may-2022/train.csv', index_col='id', dtype=col_dtypes)
test_df  = pd.read_csv('../input/tabular-playground-series-may-2022/test.csv',  index_col='id', dtype=col_dtypes)
train_df = preprocess_df(train_df)
test_df  = preprocess_df(test_df)
fix_missing_columns(train_df, test_df)

columns = test_df.columns
X       = train_df[columns]
Y       = train_df['target']
X_train, X_valid, Y_train, Y_valid = sklearn.model_selection.train_test_split(X, Y, test_size=0.2, random_state=42)
X_test  = test_df[columns]

display('train_df')
display( train_df.info(verbose=True, memory_usage="deep") )
display( train_df )
display('test_df')
# display( test_df.info(verbose=True, memory_usage="deep") )
display( test_df )

# LightGBM

In [None]:
%%time
import warnings
warnings.filterwarnings("ignore")

best_rmse     = 9999999999
best_params   = {}
best_lightgbm = None

def train_lightgbm(parameters, default_params):    
    # global best_rmse, best_params, best_model
    # DOCS: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html
    model = lightgbm.train(
        {
            **default_params,
            **parameters,
        },
        train_set  = lightgbm.Dataset(X_train, label=Y_train),
        valid_sets = lightgbm.Dataset(X_valid, label=Y_valid),
        num_boost_round       = 5000,
        early_stopping_rounds = 100,
        verbose_eval          = False,
    )
    rmse = sklearn.metrics.mean_squared_error(Y_valid, model.predict(X_valid), squared=False)
    
    print(f'rmse: {rmse:.5f} | parameters: {parameters}')
    return rmse, model
    
    
# NOTE: Reusing Hyperparamters from TPS Jan 2021
for seed in [42]:
    # for boosting in ['gbdt', 'goss', 'dart']:                     # 
    # for max_depth in [1,2,4,6,8,10,12,16,32,64,-1]:               # 
    # for tree_learner in ['serial', 'feature', 'data', 'voting']:  # was: no effect
    # for extra_trees in [True, False]:                             # was: no effect
    # for learning_rate in [0.001, 0.01, 0.1, 0.5, 0.9]:            # 
    # for max_bin in [64,128,256], # ,512,1024,2048]:               # gpu max_bin = 255
    # for num_leaves in [32, 64, 128, 256, 512, 1024, 2048, 4096]:  # 

    # DOCS: https://github.com/microsoft/LightGBM/blob/master/docs/Parameters.rst
    # DOCS: https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
    default_params = {
        'device':         'gpu',   
        'boosting_type':  'gbdt',  # default
        'objective':      'regression',
        'metric':         'rmse',
        'learning_rate':   0.1,                     
        'max_depth':       16,
        'max_bin':         256-1,  # gpu max_bin = 255
        'num_leaves':      64-1,
        'seed':            42,
        'verbose':         -1,
    }
    parameters = {
        # 'boosting_type':   boosting,
        # 'max_depth':       max_depth, 
        # 'tree_learner':    tree_learner,
        # 'extra_trees':     extra_trees,
        # 'learning_rate':   learning_rate,
        # 'max_bin':         max_bin-1,
        # 'num_leaves':      num_leaves-1,
    }
    rmse, model = train_lightgbm(parameters, default_params)

    if rmse < best_rmse:
        best_rmse     = rmse
        best_params   = parameters
        best_lightgbm = model

print()
print(f'BEST rmse: {rmse:.5f} | parameters: {best_params} | model: {best_lightgbm}')

In [None]:
%%time
prediction_X_train = best_lightgbm.predict(X_train)
prediction_X_valid = best_lightgbm.predict(X_valid)

# Submission

In [None]:
X_test

In [None]:
%%time 
predictions = best_lightgbm.predict(X_test)
predictions

In [None]:
scipy.stats.describe(predictions)

For some reason, the lightgbm predictions don't come out as integers, so lets round them

In [None]:
submission_df = pd.read_csv('../input/tabular-playground-series-may-2022/sample_submission.csv', index_col='id')
submission_df['target'] = predictions
submission_df.to_csv('submission.csv')
!head submission.csv

# Further Reading

This notebook is part of a series exploring Tabular Data:

[Titanic](https://www.kaggle.com/competitions/titanic)
- [Profilereport EDA](https://www.kaggle.com/code/jamesmcguigan/titanic-profilereport-eda)

[Spaceship Titanic](https://www.kaggle.com/competitions/spaceship-titanic)
- [Profilereport EDA](https://www.kaggle.com/code/jamesmcguigan/titanic-profilereport-eda)
- 0.69932 - [XGBoost](https://www.kaggle.com/code/jamesmcguigan/spaceship-titanic-xgboost)

[Tabular Playground - Jan 2021](https://www.kaggle.com/c/tabular-playground-series-jan-2021)
- 0.72746 / 0.72935 - [scikit-learn Ensemble](https://www.kaggle.com/jamesmcguigan/tabular-playground-scikit-learn-ensemble)
- 0.71552 / 0.71659 - [Fast.ai Tabular Solver](https://www.kaggle.com/jamesmcguigan/fast-ai-tabular-solver)
- 0.70317 / 0.70426 - [XGBoost](https://www.kaggle.com/jamesmcguigan/tabular-playground-xgboost)
- 0.70011 / 0.70181 - [LightGBM](https://www.kaggle.com/jamesmcguigan/tabular-playground-lightgbm)

[Tabular Playground - Feb 2021](https://www.kaggle.com/c/tabular-playground-series-feb-2021)
- 0.84452 - [PyCaret2 AutoML Regression](https://www.kaggle.com/jamesmcguigan/tps-pycaret2-automl-regression)

[Tabular Playground - May 2022](https://www.kaggle.com/c/tabular-playground-series-feb-2021)
- 0.97134 - [LightGBM](https://www.kaggle.com/jamesmcguigan/tps-may-2022-lightgbm-regression)
- [LGBM + XGB + CB - Regression](https://www.kaggle.com/jamesmcguigan/tps-may-2022-lgbm-xgb-cb-regression)
- [LGBM + XGB + CB - Classification](https://www.kaggle.com/jamesmcguigan/tps-may-2022-lgbm-xgb-cb-classification)

If you found this notebook useful or learnt something new, then please upvote!