# Introduction

I am a student from Indonesia that is currently learning Machine Learning and Data Science.

In this code, I want to show how I approach this problem.

Spaceship Titanic is what we can say as Titanic - Machine Learning from Disaster 2.0.

It has some similiarities with that competition, the background, the features, etc.

# Importing Library

So we start by Importing necessary library

In [None]:
import numpy as np # Dealing with array manipulation
import pandas as pd # Data manipulation
from matplotlib import pyplot as plt # Making graph
plt.rcParams['figure.figsize'] = (17, 11)
plt.style.use('seaborn-darkgrid')
import seaborn as sns # Same as matplotlib, but more simple
sns.set_style('darkgrid')

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder, LabelEncoder # Preprocessing the data
from sklearn.model_selection import KFold, RepeatedKFold, RepeatedStratifiedKFold, StratifiedKFold, train_test_split # To split the data for train and validation
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, balanced_accuracy_score # Metrics (To measure how accurate our predictions are)

from catboost import CatBoostClassifier # The algorithm we use

from warnings import filterwarnings, simplefilter
filterwarnings('ignore') # To shut the warnings (Red-y thing but not an Error)
simplefilter('ignore')
import gc # Garbage collector, to free memory
gc.enable()
from tqdm.auto import tqdm # To give us a progress bar
from IPython.display import clear_output # To clear the output when it's too much

# Reading the Data

In [None]:
train = pd.read_csv('../input/spaceship-titanic/train.csv') # Train, the data we have
test = pd.read_csv('../input/spaceship-titanic/test.csv') # Test, the data we have to predict
sub = pd.read_csv('../input/spaceship-titanic/sample_submission.csv') # Sample submission, how we should submit our predictions

# Taking the Group out of the PassengerId column

We're gonna take the group of the people from the PassengerId to use that to impute missing value

> People in a group are often family members, but not always.

That's quoted from Spaceship Titanic data description. Which means that same group tend to have same "something" and we'll see what this something is

In [None]:
train[['Group', 'Id']] = train['PassengerId'].str.split('_', expand = True)
test[['Group', 'Id']] = test['PassengerId'].str.split('_', expand = True)
train.drop(['PassengerId'], axis = 1, inplace = True)
test.drop(['PassengerId'], axis = 1, inplace = True)

We concatenate train and test because there are some people with same group but separated by train and test split

In [None]:
data = train.append(test).reset_index(drop = True)
data = data.sort_values('Group') # We sort by group so the imputing would be easire
data

# Making columns that tell if some columns has NAN Value

So before we impute the missing value, we give some **mark** for the missing value

Because we're not sure if our imputer is going to be accurate, so we add an indicator

In [None]:
data_na = data.isna().astype(int)
data_na.drop(['Transported', 'Group', 'Id'], axis = 1, inplace = True) # Drop these columns because none of these columns have NaN value
data = data.join(data_na, rsuffix = '_nan')
data

# Making a new DataFrame for the Group that is not alone

So here we are, we'll start imputing by taking group that has more than one person

In [None]:
new_df = pd.DataFrame()
for i, r in tqdm(data.iterrows(), total = len(data)) :
    try :
        if data.iloc[i, 13] == data.iloc[i+1, 13] :
            new_df = new_df.append(data.iloc[i])
            new_df = new_df.append(data.iloc[i+1])
    except :
        print('End of the row.')
display(new_df)

I have to drop any duplicatijng value because my code above isn't effective and efficient (not good).

You guys may fix it so it'll be more effective and efficient

In [None]:
new_df = new_df.drop_duplicates().sort_values('Group') # Again, we sort by group to make things right
new_df

As you can see, before we drop_duplicate the data, we had 7380 rows. Now we have 5825, indicating that there WERE duplicated values.

Now, we'll make a list of columns that we will impute

In [None]:
inspected_col = ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Name', 'VIP']
inspected_col_spend = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

I make one for categorical columns and one for continuous columns.

And also, we take the unique value of the group because we're going to do an iteration on it

In [None]:
group = new_df['Group'].unique()
group

# The Imputing

And now the time comes, this is how the imputing works

In [None]:
for col in tqdm(inspected_col) : # Make an iteration on the wanted columns
    for g in tqdm(group) : # Make an iteration (again) on the unique value
        try :
            tofill = data.loc[data['Group'] == g, col].mode()[0] # We'll try imputing the NaN value on those group by taking the mode
        except : # We add except syntax here in case in one group, everything is NaN
            print(f'in {col} One Group has all NAN Value, inputing most frequent')
            tofill = data[col].mode()[0] # If a group doesn't have any value (All NaN), then we'll fill with the mode of the whole data
        data.loc[data['Group'] == g, col] = data.loc[data['Group'] == g, col].fillna(tofill) # We apply the change to our "data" DataFrame
        new_df.loc[data['Group'] == g, col] = new_df.loc[data['Group'] == g, col].fillna(tofill) # And new_df as well

This one goes the same

In [None]:
for col in tqdm(inspected_col_spend) : # Make an iteration on the wanted columns (Continuous column)
    for g in tqdm(group) : # Make an iteration on the group
        try :
            tofill = data.loc[data['Group'] == g, col].min() # We try imputing the NaN value with the minimal value in the group
                                                             # You actually could try using .mean(), .max(), .median() whatever you want
        except : # As before, if one group is just filled with NaN value
            print(f'in {col} One Group has all NAN Value, imputing most frequent')
            tofill = data[col].median() # We impute that group with the median of the whole data.
                                        # Again, you could use mean, max, median, whatever you want.
        data.loc[data['Group'] == g, col] = data.loc[data['Group'] == g, col].fillna(tofill) # Apply the change to variable data
        new_df.loc[data['Group'] == g, col] = new_df.loc[data['Group'] == g, col].fillna(tofill) # And new_df

In [None]:
new_df.isna().sum() # Checking the NaN value on the new_df DataFrame

Great, all the columns that we wanted to impute has no longer NaN value

In [None]:
data.isna().sum()

However, in `data`, tnere are still some NaN value

This is because data has NaN values in the person that is alone in their group.

In this case, we'll just use some simple imputing method

In [None]:
data

Before we impute those value, I want to split the Cabin and the Name first

In [None]:
data[['deck', 'num', 'side']] = data['Cabin'].str.split('/', expand = True) # We split the cabin to be 3 part, deck, num, and side
data[['FirstName', 'LastName']] = data['Name'].str.split(' ', expand = True) # We split the name to be 2 part, First and Last Name
data.drop(['Cabin', 'FirstName', 'Name', 'Group', 'Id'], axis = 1, inplace = True) # We drop the unnecessary column
data['num'] = data['num'].astype(float) # We make num as a float (Cause they are high-cardinality)

We make a variable group of columns where one is continuous and the other is categorical

In [None]:
cont = data.select_dtypes(float).columns.tolist()
cat = [col for col in data.columns if not col in cont and not col == 'Transported']

# Simple Imputing

We use the regular imputing method. 

* We fill continuous with its median
* And categorical with its mode

In [None]:
for col in cont :
    data[col].fillna(data[col].median(), inplace = True)

for col in cat :
    data[col].fillna(data[col].mode()[0], inplace = True)

# Making a "Relatives" Feature

This feature just tells how many relatives people have by their Last Name

In [None]:
tqdm.pandas()
data['Relatives'] = data['LastName'].progress_apply(lambda x : sum(x == data['LastName']) - 1)
data.drop(['LastName'], axis = 1, inplace = True)
cat.remove('LastName')
data

# Label Encoding

This is just making the categorical data type to be integer, so It'll be readable to our Machine Learning Algorithm. No big deal!

In [None]:
for col in cat :
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])

# Total Spending

We also make a feature that is just the summation of every bill that each passenger paid.

In [None]:
spending = ['RoomService', 'FoodCourt', 'ShoppingMall', 'VRDeck', 'Spa']
data['TotalSpend'] = data[spending].sum(axis = 1)

And this feature basically to indicate whether each passenger is Adult or not

In [None]:
data['Adult'] = (data['Age'] > 18).astype(int)
cat = cat + ['Adult']

We sort the index cause we sorted by the Group before, we do this so we'll be able to split the data to be train and test again

In [None]:
data = data.sort_index()
data

# Re-splitting the variable "data"

As we have done many things with the concatenate of train and test, we separate them again

In [None]:
train = data.loc[0:len(train)-1]
test = data.loc[len(train):len(train) + len(test)].reset_index(drop = True)

# Checking for NaN Values

In [None]:
train.isna().sum()

In [None]:
test.isna().sum()

Hooray! No more NaN values.

`test` DataFrame has Transported as NaN value because that's what we are trying to predict

So we'll drop them.

In [None]:
test.drop(['Transported'], axis = 1, inplace= True)

Now, we pop the Transported columns on `train`, just to make things easier

In [None]:
y = train.pop('Transported').astype(int)
y

# KFold

What I am using here is a RepeatedKFold, I'm not just using KFold because I want my algorithm to have different combination of train-and-validation so that it won't have a high variance

In [None]:
kf = RepeatedKFold(n_splits = 10, n_repeats = 50, random_state = 27)

I'm using 10 folds and 50 repeats here, so we'll do 500 train-test which sounds bonker xD

But whatever you guys want to use tho, it's okay...

Note : I am using default setting of CatBoost. Because when I tried using Hyper-Parameter-optimized CatBoost with Optuna, it did not do well :(

In [None]:
scores = []
test_preds = []
for i, (t, v) in tqdm(enumerate(kf.split(train)), total = 500) :
    xtrain = train.iloc[t, :] # Take the indices that KFold return as train set
    xval = train.iloc[v, :] # Take the indices that KFold return as validation set
    xtest = test.copy() # Make a copy of the test
    ytrain = y.iloc[t] # Take the indices that KFold return as y-train set
    yval = y.iloc[v] # Take the indices that KFold return as y-validation set
    
    model = CatBoostClassifier( # Define the model
        iterations = 2000, # How many iterations we should do
        random_state = 0, # The Reproducibility
        verbose = 0, # Progress bar (in case you want to see)
        boost_from_average = True, # The initial guess will start from the average of the data
        eval_metric = 'Accuracy', # Measurement that we use to stop the training
        cat_features = cat # The categorical feature
    )
    model.fit(
        xtrain, ytrain, # We fit the model to xtrain and ytrain
        early_stopping_rounds = 1000, # If there's not any improvement in 1000 iterations, the model will stop
        use_best_model = True, # After the iterations stop, the model will use the best score
        eval_set = (xval, yval) # The validation to stop the training
    ) 
    yhat = model.predict(xval) # We predict the validation set
    score = balanced_accuracy_score(yval, yhat) # And then measure it with accuracy metrics
    scores.append(score) # We add the score to the list of scores that is defined outside the KFold iterations
    ypred = model.predict_proba(xtest)[:, 1] # Predicting the probability of xtest (Cause we'll take the mean of those prediction)
    test_preds.append(ypred) # Append the prediction to test_preds variable
    print(f'FOLD {i} : {score}')
    del xtrain, xval, xtest, ytrain, yval, model # We delete all the variable to free our memory
    gc.collect() # using gc to free memory

After that, we see the mean and the standard deviation of our scores in validation.

In [None]:
print(np.mean(scores), np.std(scores))

Update :

Here, I just want to mess with the distribution of the score, is it normal or not.

In [None]:
_ = sns.distplot(scores)
plt.axvline(np.median(scores), color = 'red')
plt.annotate('Median', (np.median(scores) + .0001, 30), (np.median(scores) + .005, 32), arrowprops = {'color' : 'black', 'arrowstyle' : 'simple'})

plt.axvline(np.mean(scores), color = 'red')
plt.annotate(
    'Mean', (np.mean(scores) - .0001, 27), (np.mean(scores) - .005, 24), arrowprops = {'color' : 'black', 'arrowstyle' : '->'}
)
plt.show()

Turns out it's a normal distribution.

Now, I'll try make a PDF out of it

In [None]:
# The PDF of Normal Distribution
def normalcurve(data) :
    data = np.array(data)
    z = (data - np.mean(data)) / np.std(data)
    numerator = np.exp(-.5*z**2)
    denominator = np.std(data)*np.sqrt(2*np.pi)
    return numerator/denominator

In [None]:
score_pdf = normalcurve(scores)

In [None]:
_ = sns.lineplot(x = scores, y = score_pdf)

DANG!!! That's a perfect curve right there...

We can approximate our score by taking mean +/- (2 * std). (95 % value that is possible)

Our worst score could be :

In [None]:
worst = np.mean(scores) - 2 * np.std(scores)
print('Worst Score Approximation :', worst)

And our best score could be :

In [None]:
best = np.mean(scores) + 2 * np.std(scores)
print('Best Score Approximation :', best)

Note : They are not actually **worst** or **best**, they're just approximation.

The chance of the scores could beyond those is low, but not zero. If you ask me, it's around 5%

And note that our likelihood could reach the value of 30 because the range of our value is sooo narrow.

In the graph, it's only shown between 0.79 and 0.85. And the narrower the range of the value, the taller the curve.

# Submission

## OVERFITTING ALERT !1!!1!

Know that we have 500 predictions for the test, so we average them with `np.mean(*, axis = 0)` numpy syntax

And I want to make the distribution in the submission will be the SAME as the distribution in test set

But... how?

This is what called Leaderboard Probing and many kagglers have done this (I recall last year's TPS April people do it to have good scores)

So, I've accidentally submitted a bunch of ones and get 0.50689

And because the metric is **accuracy**, I know that **0.50689** of the test set will be ones, rest is zero.

So, let's do that.
Let's do this nasty thing

We'll make a pandas Series that has regular threshold to determine whether it's True or False (0.5)

In [None]:
regular = pd.Series(np.round(np.mean(test_preds, axis = 0)).astype(bool)) # You can just do this by using np.round
regular

And we'll see the distribution

In [None]:
print(regular.value_counts() / len(regular))

We can adjust the distribution in our submission so that it'll match with the test distribution (0.50689 and 0.48311) by adjusting the threshold

In [None]:
thresh = .512 # Adjustable!
our_preds = np.where(np.mean(test_preds, axis = 0) > thresh, True, False) 
#Basically saying 'average the fold prediction, if the value is more than thresh, set it to True. If not, then False'
our_preds_series = pd.Series(our_preds) # We wrap it in pandas Series so we can count the distribution
print(our_preds_series.value_counts() / len(our_preds_series)) # And show our distribution!

Code above is the code that shows the distribution in the submission. If you want to make it the same as train's distribution, just adjust the `thresh` variable!

In this case, mine is around 0.512

And now, we just input the series to the sample submission as in format!

In [None]:
sub['Transported'] = our_preds_series
sub.to_csv('submission.csv', index = False)
sub

In [None]:
print('HUGE OVERFITTING!')

Note : I said that our worst score could be 0.799 something and best score 0.846 by plotting the normal curve above.

This is a different story with our threshold changing, cause in KFold we only used `predict` method, not `predict_proba`. And by default, any algorithm will take 0.5 as the threshold. So those worst/best scores approximation won't apply here.

And I truly apologize if my explanation is confusing :".