# Spaceship Titanic - XGBoost

The goto method for most tabular data problems is [XGBoost](https://xgboost.readthedocs.io/en/latest/), so lets see how it performs on this leaderboard

# Dataset

In [None]:
import re
import numpy  as np 
import pandas as pd 
from pandas import Categorical
from xgboost import XGBRegressor, XGBClassifier
import sklearn

pd.options.display.max_rows = 6

In [None]:
train_df = pd.read_csv('../input/spaceship-titanic/train.csv', index_col='PassengerId')
test_df  = pd.read_csv('../input/spaceship-titanic/test.csv',  index_col='PassengerId')

def enhance(df):
    for col in ['HomePlanet', 'Cabin', 'Destination', 'Name']:    
        df[col] = df[col].astype('category')        
    for col in ['CryoSleep', 'VIP']:    
        df[col] = df[col].fillna(False).astype(bool)
    for col in ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']:
        # Normalizing ints improves score from 0.50783 -> 0.69932
        # FillNA(mean) -> FillNA(0) reduces score 0.69932 -> 0.65583 
        # Log Normalize Ints has exact same score as linear normaliztion
        df[col] = df[col].fillna(train_df[col].mean())               # Fill NA with mean 
        df[col] = df[col] / train_df[col].max()                    # Normalize to range [0-1]
        # df[col] = np.log(df[col]+1) / np.log(train_df[col]+1).max()  # Log Normalize Ints
        # df[col] = df[col].astype(int)    
        
    # Splitting FirstName + Surname reduces score 0.69932 -> 0.50713
    # df['FirstName'] = df['Name'].str.split(' ', 1).str[0].astype('category')
    # df['LastName']  = df['Name'].str.split(' ', 1).str[-1].astype('category')
    # del df['Name']
    
    # Split Cabin info into Deck/Num/Side
    # Reduces score 0.69932 -> 0.51765 | don't understand, this should improve score
    # df['Cabin/Deck'] = df['Cabin'].str.split('/', 2).str[0].astype('category')
    # df['Cabin/Num']  = df['Cabin'].str.split('/', 2).str[1].astype('category')
    # df['Cabin/Side'] = df['Cabin'].str.split('/', 2).str[2].astype('category')    
    return df

train_df = enhance(train_df)
test_df  = enhance(test_df)

columns = test_df.columns
X       = train_df[columns]
Y       = train_df['Transported']
X_train, X_valid, Y_train, Y_valid = sklearn.model_selection.train_test_split(X, Y, test_size=0.01, random_state=42)
X_test  = test_df[columns]

display('train_df')
display( train_df )
# display('test_df')
# display( test_df )

# XGBoost

To start with, lets just try out the XGBoost default settings

Documention
- https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
- https://xgboost.readthedocs.io/en/latest/parameter.html

In [None]:
xgb = XGBClassifier(
    n_jobs=-1,
    verbosity=0,
    random_state=42,
    tree_method="gpu_hist", 
    enable_categorical=True
)
xgb.fit(
    X_train, Y_train, 
    eval_set=[(X_valid, Y_valid)],
    verbose=False
)

In [None]:
rmse = sklearn.metrics.mean_squared_error(Y_valid, xgb.predict(X_valid), squared=False)
print(rmse)

# Submission

In [None]:
predictions   = xgb.predict(X_test)

submission_df = pd.read_csv('../input/spaceship-titanic/sample_submission.csv', index_col='PassengerId')
submission_df['Transported'] = predictions.astype(bool)
submission_df.to_csv('submission.csv')
!head submission.csv

# Further Reading

This notebook is part of a series exploring the [Tabular Playground](https://www.kaggle.com/c/tabular-playground-series-jan-2021)
- 0.72935 - [scikit-learn Ensemble](https://www.kaggle.com/jamesmcguigan/tabular-playground-scikit-learn-ensemble)
- 0.71423 - [Fast.ai Tabular Solver](https://www.kaggle.com/jamesmcguigan/fast-ai-tabular-solver)
- 0.70426 - [XGBoost](https://www.kaggle.com/jamesmcguigan/tabular-playground-xgboost)