This is a relatively simple data imputation and processing notebook with an Optuna tuned XGBoost model. This notebook is the result of a lot of earlier efforts of working with the data and trying different models. 

Since only 2% of the data is missing in both the train and test files, extensive imputation work didn't add much value to the score. So using the average for the continuous data and the mode for the categorical data worked the best to fill in the missing data. 

In all the models I tried, creating a total expenses category which summed up all the other expense categories had the highest impact on the the predictions. None of the other data manipulations had nearly as much of an effect. 

In [1]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import xgboost as xgb
from sklearn.utils import shuffle

In [2]:
test = pd.read_csv('test.csv')
sample = pd.read_csv('sample_submission.csv')
train = pd.read_csv('train.csv')

Since both datasets are relatively similar in their structure and missing data, it is easier to merge them together for the data imputation and processing work. 

In [3]:
train_test = pd.concat([train, test], ignore_index=True)

Creating a total expenses category as a sum of all the other expenses ended up having a strong impact on the various models I tried. 

In [4]:
Expenses_columns = ['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']
train_test['Expenses'] = train_test.loc[:,Expenses_columns].sum(axis=1)

I saw on a few notebooks where people set the total expenses to zero if someone was in cryosleep. This made sense logically and actually did have a positive impact on improving the score. 

In [5]:
train_test.loc[:,['CryoSleep']]=train_test.apply(lambda x: True if x.Expenses == 0 and pd.isna(x.CryoSleep) else x,axis =1)

Separating the Group out of the PassengerID and breaking out the Cabin category into its component parts had a positive impact on the score. 

In [6]:
train_test.loc[:,['Group']] = train_test.PassengerId.apply(lambda x: x[0:4] )
train_test[['Deck', 'Number', 'Side']] = train_test['Cabin'].str.split('/', expand=True)

In [7]:
train_test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Expenses,Group,Deck,Number,Side
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,0.0,1,B,0,P
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,736.0,2,F,0,S
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,10383.0,3,A,0,S
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,5176.0,3,A,0,S
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,1091.0,4,F,1,S


Using the PermutationImportance function from eli5 in earlier notebooks showed only a few categories had strong predictive values on the outcome. So only including the columns listed below and then also removing some of the extra columns created from one hot encoding in the "drop_list" further down below, had the most positive impact on the score. 

In [8]:
num_cols = ['ShoppingMall','FoodCourt','RoomService','Spa','VRDeck','Expenses','Age']
cat_cols = ['CryoSleep','Deck','Side','VIP','HomePlanet','Destination', ]
transported=['Transported']

In [9]:
train_test = train_test[num_cols+cat_cols+transported].copy()

Since the missing data was only about 2% of the total data in both the train and test datasets, this simple imputater operation had the most beneficial effect on the score. 

In [10]:
num_imp = SimpleImputer(strategy='mean')
cat_imp = SimpleImputer(strategy='most_frequent')

In [11]:
train_test[num_cols] = pd.DataFrame(num_imp.fit_transform(train_test[num_cols]),columns=num_cols)
train_test[cat_cols] = pd.DataFrame(cat_imp.fit_transform(train_test[cat_cols]),columns=cat_cols)

Using one hot encoding for the categorical data was the most effective approach I could find, especially since most of the categorical data ended up not being used. 

In [12]:
ohe = OneHotEncoder (handle_unknown='ignore',sparse = False)
temp_train = pd.DataFrame(ohe.fit_transform(train_test[cat_cols]),columns=ohe.get_feature_names_out())
train_test = train_test.drop(cat_cols,axis=1)
train_test = pd.concat([train_test,temp_train],axis=1)

In [13]:
train_test

Unnamed: 0,ShoppingMall,FoodCourt,RoomService,Spa,VRDeck,Expenses,Age,Transported,CryoSleep_False,CryoSleep_True,...,Side_P,Side_S,VIP_False,VIP_True,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0.0,0.0,0.0,0.0,0.0,0.0,39.000000,False,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,25.0,9.0,109.0,549.0,44.0,736.0,24.000000,True,1.0,0.0,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,0.0,3576.0,43.0,6715.0,49.0,10383.0,58.000000,False,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,371.0,1283.0,0.0,3329.0,193.0,5176.0,33.000000,False,1.0,0.0,...,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,151.0,70.0,303.0,565.0,2.0,1091.0,16.000000,True,1.0,0.0,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12965,0.0,0.0,0.0,0.0,0.0,0.0,34.000000,,0.0,1.0,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
12966,17.0,847.0,0.0,10.0,144.0,1018.0,42.000000,,1.0,0.0,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
12967,0.0,0.0,0.0,0.0,0.0,0.0,28.771969,,0.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
12968,0.0,2680.0,0.0,0.0,523.0,3203.0,28.771969,,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


To reset the train and test datasets and create the X and y components. 

In [14]:
train = train_test[train_test['Transported'].notnull()].copy()
train.Transported =train.Transported.astype('int')
test = train_test[train_test['Transported'].isnull()].drop("Transported",axis=1)

In [15]:
X = train.drop('Transported',axis=1)
y = train.Transported

As mentioned above, dropping columns that did not have a strong predictive impact improved the score. 

In [16]:
drop_list=['ShoppingMall','Age','CryoSleep_True','HomePlanet_Earth','HomePlanet_Europa',
'VIP_True','HomePlanet_Mars','Destination_PSO J318.5-22','VIP_False',
'Destination_55 Cancri e','FoodCourt','Destination_TRAPPIST-1e']

In [17]:
X=X.drop(drop_list,axis=1)
test=test.drop(drop_list,axis=1)

I used Optuna to find the optimal parameters for the XGBoost model with this data. 

In [18]:
params_xgb_best= {'lambda': 3.0610042624477543, 
             'alpha': 4.581902571574289, 
             'colsample_bytree': 0.9241969052729379, 
             'subsample': 0.9527591724824661, 
             'learning_rate': 0.06672065863100594, 
             'n_estimators': 730,
             'max_depth': 5, 
             'min_child_weight': 1, 
             'num_parallel_tree': 1}

In earlier notebooks I found that not splitting the train dataset into test and validation subsets created better results. This may be a result of the rather small and simple dataset in this contest. 

In [19]:
X,y = shuffle(X,y, random_state=42)
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

In [20]:
pred_xgb_best = (xgb.XGBClassifier(**params_xgb_best).fit(X,y)).predict(test)
sample['Transported'] = pred_xgb_best

#This converts the numbers to True/False values
sample['Transported']=sample['Transported']>0.5
sample.to_csv('submit_xgb_best_data.csv', index=False)