# **Introduction**

I created simple KFold and LightGBM notes for the Titanic competition.
Based on my reflections from that competition, I have created a new note.

This time, I use sklearn's OneHotEncoder instead of pandas' get_dummies for the category variable encoding.


Old notebook is here
https://www.kaggle.com/code/sasakic/titanic-simple-lgbm-kfold


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn.preprocessing as sp

import lightgbm as lgb

plt.style.use('ggplot')
import warnings
warnings.filterwarnings('ignore')

# **Checking data**

It contains a lot of missing values.

In [None]:
train = pd.read_csv('../input/spaceship-titanic/train.csv')
test = pd.read_csv('../input/spaceship-titanic/test.csv')

In [None]:
train.head()

In [None]:
train.dtypes

In [None]:
print(train.isnull().sum())

In [None]:
print(test.isnull().sum())

# **Preparing Features**

In [None]:
fillvalues = {
    'HomePlanet':'',
    'CrypoSleep':False,
    'Cabin':'',
    'Destination':'',
    'Age':0,
    'VIP':False,  
    'RoomService':0,
    'FoodCourt':0,
    'ShoppingMall':0,
    'Spa':0,
    'VRDeck':0
}

def prepare_features(passengers):
    df = passengers.copy()
    df = df.drop(['PassengerId','Name'], axis=1)

    df['CryoSleep'] = df['CryoSleep'].astype('bool')
    df['VIP'] = df['VIP'].astype('bool')
    
    objcols = df.select_dtypes(include=object).columns
    
    df.fillna(fillvalues)
    df[objcols] = df[objcols].astype('category')
    
    return df


I used skleran for one-hot encoding.
Therefore, if categorical data not included in the train side existed in the test side, they would be dropped during prediction.

The accuracy may be lower, but it is a more practical method, I think.
There is no guarantee that all categories will be included in the training data.

In [None]:
train_ft = prepare_features(train)
train_ft.drop('Transported', axis=1, inplace=True)

catcols = train_ft.select_dtypes(include='category').columns
availcols = train_ft.select_dtypes(exclude='category').columns

enc = sp.OneHotEncoder(sparse=False, handle_unknown='ignore')

train_feats = pd.concat([train_ft[availcols], 
                         pd.DataFrame(enc.fit_transform(train_ft[catcols]), columns=enc.get_feature_names(catcols))],
                axis=1)


test_ft = prepare_features(test)
test_feats = pd.concat([test_ft[availcols], 
                        pd.DataFrame(enc.transform(test_ft[catcols]), columns=enc.get_feature_names(catcols))],
                axis=1)

# **Create Fold**

Parameters are not optimized.

In [None]:
cfg = {
    'TARGET' : 'target',
    'N_FOLDS' : 5,
    'RANDOM_STATE': 529,
    'N_ESTIMATORS' : 50_000,
    'LEARNING_RATE': 0.1
}

train_passes = train['PassengerId'].unique()

In [None]:
train_fold = pd.DataFrame(train[['PassengerId','Transported']])
kf = KFold(n_splits=cfg['N_FOLDS'],
           shuffle=True,
           random_state=cfg['RANDOM_STATE'])

# Create Folds
fold = 1
for tr_idx, val_idx in kf.split(train_passes):
    fold_passes = train_passes[val_idx]
    train_fold.loc[train_fold['PassengerId'].isin(fold_passes), 'fold'] = fold
    fold += 1
train_fold['fold'] = train_fold['fold'].astype('int')
train_fold['fold'].value_counts()

# **LGBM Classification**

In [None]:
FEATURES = train_feats.columns.values
TARGET =  ['Transported']
train_feats = pd.concat([train_feats, train_fold], axis=1)

submission_df = test[['PassengerId']].copy()

In [None]:
regs = []
fis = []

for fold in range(1, 6):
    print(f'===== Running for fold {fold} =====')
    # Split train / val
    X_tr = train_feats.query('fold != @fold')[FEATURES]
    y_tr = train_feats.query('fold != @fold')[TARGET]
    X_val = train_feats.query('fold == @fold')[FEATURES]
    y_val = train_feats.query('fold == @fold')[TARGET]
    print(X_tr.shape, y_tr.shape, X_val.shape, y_val.shape)

    reg = lgb.LGBMClassifier(n_estimators=cfg['N_ESTIMATORS'],
                            learning_rate=cfg['LEARNING_RATE'],
                            objective='binary',
                            metric=['binary_logloss'],
                            importance_type='gain'
                            #importance_type='split'
                           )
    reg.fit(X_tr, y_tr,
            eval_set=(X_val, y_val),
            early_stopping_rounds=500,
            verbose=200,
           )

    fold_preds = reg.predict(X_val,
                             num_iteration=reg.best_iteration_)
    train_fold.loc[train_fold['fold'] == fold, 'preds'] = fold_preds

    fold_score = mean_absolute_error(
        train_fold.query('fold == 1')['Transported'],
            train_fold.query('fold == 1')['preds']
    )

    fi = pd.DataFrame(index=reg.feature_name_,
                 data=reg.feature_importances_,
                 columns=[f'{fold}_importance'])

    fold_test_pred = reg.predict(test_feats,
                num_iteration=reg.best_iteration_)
    submission_df[f'pred_{fold}'] = fold_test_pred.astype(bool)
    print(f'Score of this fold is {fold_score:0.6f}')
    regs.append(reg)
    fis.append(fi)

In [None]:
score = mean_absolute_error(train_fold['Transported'], train_fold['preds'])
print(f'Out of fold score {score:0.6f}')

# **Make Submission**

In [None]:
submission_df.set_index('PassengerId')
submission_df

In [None]:
pred_cols = [c for c in submission_df.columns if c.startswith('pred_')]
submission_df['Transported'] = submission_df[pred_cols].mode(axis=1)

submission_df[['PassengerId','Transported']].to_csv('submission.csv', index=False)