# Spaceship Titanic
### A kaggle competition
In this notebook we will have a take at the [Spaceship Titanic competition](https://www.kaggle.com/competitions/spaceship-titanic) by Kaggle. 
We are provided with a dataset of fake data about passengers on a spaceship that disappeared in space. Half of its passenger vanished in an other dimension.
We are tasked with predicting the transportation of passengers to an other dimension.  
The data is split in two parts : train and test. The test set has no Transported category. We're supposed to make our predictions on the test set, make a csv of our test predictions and upload it to kaggle, which will evaluate our test predictions and give us a score.

# Table of Contents

## Contents

1. [Imports](#Imports)
2. [Data visualization](#Data-visualization)
3. [Feature engineering](#Feature-engineering)
4. [Feature selection](#Feature-selection)
5. [Encoding feature values](#Encoding-feature-values)
6. [Imputing NaN values](#Imputing-NaN-values)
7. [Modeling](#Modeling)
8. [Results](#Results)

# Imports

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate, cross_val_score
from sklearn.feature_selection import  chi2, SelectKBest
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

# Data visualization

We'll first visualize the data before manipulating the data.  
Let's open the dataset. We'll combine both the training and the test data, so that our data manipulations and feature engineering applies to both the datasets.

In [None]:
df_train = pd.read_csv('../input/spaceship-titanic/train.csv')
df_test = pd.read_csv('../input/spaceship-titanic/test.csv')
df = pd.concat([df_train,df_test])
df

In [None]:
df.dtypes

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.dtypes

Let's check the distributions of the transported classes in relation to the features

In [None]:
target_value_counts = df['Transported'].value_counts()
target_value_counts

In [None]:
plt.bar(['True', 'False'], target_value_counts)
plt.show()

The count of True and False values are around the same. The target labels are balanced.

We check the number of Transported people according to numerical features

In [None]:
columns = ['Age', 'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(20,17))
fig.tight_layout(pad=4)
for i,col in enumerate(columns):
    sns.histplot(data=df[[col,'Transported']].dropna(axis=0), x=col, hue="Transported", ax=axs[i%3,i//3], bins=10)
    #pos = df[df['Transported'] == 1][col]
    #neg = df[df['Transported'] == 0][col]
    #axs[i%3,i//3].hist(pos, label="Positive")
    #axs[i%3,i//3].hist(neg, label="Negative")
    #axs[i%3,i//3].legend()
    #axs[i%3,i//3].set_title(col)

plt.show()

In [None]:
sns.boxplot(data=df[['Age', 'Transported']].dropna(axis=0), y='Age', x='Transported', width=0.5)

A higher percentage of children were transported among the Transported people. No real difference with ShopppingMall, (will maybe be removed with Feature Selection)

In [None]:
columns = ['HomePlanet','CryoSleep','Destination','VIP']
fig, axs = plt.subplots(nrows=4, ncols=2, figsize=(10,17))
fig.tight_layout(pad=10)
for i,col in enumerate(columns):
    axs[i%4,0].set_title(col+" Transported")
    df[df['Transported'] == True][col].value_counts().plot(kind='bar', ax=axs[i%4,0])
    
    axs[i%4,1].set_title(col+" Not Transported")
    df[df['Transported'] == False][col].value_counts().plot(kind='bar', ax=axs[i%4,1])

plt.show()

People in cryosleep were very less likely to be transported  
VIP didn't change the proportions of transported people. Though slightly more Not Transported people were VIP.

# Feature engineering

In [None]:
df.reset_index()

In [None]:
df = df.reset_index().drop('index', axis=1)

In [None]:
df.head()

In [None]:
df['Group'] = df['PassengerId'].str.split('_', expand=True)[0].astype(str)

In [None]:
df['Cabin']

In [None]:
df[['Deck','CabinNumber','Side']] = df['Cabin'].str.split('/', expand=True)

In [None]:
df.head()

In [None]:
df['CabinNumber'] = df['CabinNumber'].astype(int, errors='ignore')

In [None]:
df['LastName'] = df['Name'].str.split(' ', expand=True)[1]

In [None]:
df.shape

In [None]:
df.set_index('PassengerId', inplace=True)

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.dtypes

Then we check for outliers.

We then remap the outlier values to the 99th quantile.

In [None]:
def quantile_remap(df):
    quantile_values = df[["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]].quantile(0.95)
    
    for col in ["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]:
        num_values = df[col].values
        threshold = quantile_values[col]
        num_values = np.where(num_values > threshold, threshold, num_values)
        print(num_values)
        df[col] = num_values

In [None]:
quantile_remap(df)

We can recheck for outliers

We create three more categories that sums the regular spendings and the luxury ones.

In [None]:
df["Regular"] = df["FoodCourt"] + df["ShoppingMall"] 
df["Luxury"] = df["RoomService"] + df["Spa"] + df["VRDeck"]

We'll also count the number of people in each group and the number of family members

In [None]:
n_group_members = df['Group'].value_counts().reset_index()
n_group_members.columns = ['Group', 'N_group_members']
n_group_members

In [None]:
n_family_members = df['LastName'].value_counts().reset_index()
n_family_members.columns = ['LastName', 'N_family_members']
n_family_members

In [None]:
df

In [None]:
df = df.reset_index().merge(n_family_members, how = 'left', on = ['LastName'])
df = df.merge(n_group_members, how = 'left', on = ['Group'])
df = df.set_index('PassengerId')
df

In [None]:
df['TotalSpendings'] = df[['Luxury','Regular']].sum(axis=1)
df

Cabin and Name are useless now. So we drop them.

In [None]:
df = df.drop(['Cabin', 'Name'],axis=1)
df

In [None]:
df.dtypes

We have many object types in our dataframe. We need to convert these values.  
Let's start by converting the cabin number.

In [None]:
df['CabinNumber'] = df['CabinNumber'].astype(int, errors='ignore')
df['CabinNumber']

# Feature selection

When we visualized the data. We saw some variables had the same distribution for the same output. So they would not influence the output.

# Encoding feature values

We will now encode the string values to numerical values, using :
- the LabelEncoder for the label (Transported column)
- the OrdinalEncoder for non numerical values
  
OrdinalEncoder is usually used with data that can be ordered. In my case, because I want to use a RandomForest, I figured I can allow myself to use OrdinalEncoder for every categorical feature.

In [None]:
label_enc = LabelEncoder()
ord_enc = OrdinalEncoder()
oh_enc = OneHotEncoder(drop='first', sparse_output=False)

LabelEncoder encodes Nan values. However we need to keep them, in order to test them at the end. So we'll add an other category that will encode nan as well.

In [None]:
label_enc.fit([True,False])

In [None]:
label_enc.classes_, label_enc.transform(label_enc.classes_)

In [None]:
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
))
df['Transported'].notna().sum(), df['Transported'].isna().sum()

Then we'll encode and transform with the ordinal encoder to encode the categorical values into numbers. Because I plan on making a Random Forest, I allow myself to use the ordinal encoder also for unordered values, as they'll just be grouped in each node and compared

In [None]:
df[['HomePlanet','CryoSleep','Destination','LastName','Deck','Side','VIP']] = ord_enc.fit_transform(df[['HomePlanet','CryoSleep','Destination','LastName','Deck','Side','VIP']])

We finally split the dataframe back into the test and train sets.

# Imputing NaN values

In [None]:
df[['TotalSpendings','CryoSleep']].groupby('CryoSleep').value_counts()

People in CryoSleep couldn't spend money on the starship, so NaN values in the luxuries columns will be 0

In [None]:
df[df['Age'] < 13 ][['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']].sum()

Children couldn't spend money either, so it goes the same for them.

In [None]:
df["RoomService"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0.0, df["RoomService"].astype(float))
df["FoodCourt"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0.0, df["FoodCourt"].astype(float))
df["ShoppingMall"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0.0, df["ShoppingMall"].astype(float))
df["Spa"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0.0, df["Spa"].astype(float))
df["VRDeck"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0.0, df["VRDeck"].astype(float))

We split the data back into test and train tests

In [None]:
df_test = df[df['Transported'] == 2]
df_test

In [None]:
df_train = df[df['Transported'] != 2]

In [None]:
df_train.shape

We drop the transported column in the test dataframe

In [None]:
df_test = df_test.drop('Transported',axis=1)

Let's split the data into our features and our target.

In [None]:
x_train = df_train.drop('Transported', axis=1)
y_train = df_train['Transported']

Using kNNImputer, we can impute NaN values using the kNN algorithm.

In [None]:
imp = KNNImputer()

In [None]:
df_train.columns

In [None]:
cols_wo_class = list(df_train.columns)
cols_wo_class.remove('Transported')

df[cols_wo_class] = imp.fit_transform(df[cols_wo_class])

# Modeling

We will now build our random forest, and perform a grid search to find the best parameters for it. I decided to search for optimal values for the max depth, criterion and class weight parameters.

In [None]:
df_test = df[df['Transported'].isna()]
df_train = df[df['Transported'].notna()]
(df_train.shape, df_test.shape)

In [None]:
x_train = df_train.drop('Transported', axis=1)
y_train = df_train['Transported']

In [None]:
df_test = df_test.drop('Transported', axis=1)

## Random Forest

In [None]:
rfc = RandomForestClassifier(n_jobs=-1, random_state=0)

In [None]:
scores = cross_val_score(rfc, x_train, y_train, scoring='accuracy', n_jobs=-1)
print(scores)
print(np.mean(scores))

Around 74% of accuracy. Not bad. Let's improve it with a grid search.

In [None]:
gs = GridSearchCV(
    rfc,
    {
        "max_depth":[1, 10, 50, 100, None],
        "criterion":["gini", "entropy", "log_loss"],
        "class_weight":["balanced", "balanced_subsample", None],
        "n_estimators":[5, 50, 100, 150]
    },
    scoring="accuracy",
    n_jobs=-1
)

In [None]:
#THIS CELL TAKES A LONG TIME TO RUN
gs.fit(x_train,y_train)
gs.score(x_train,y_train)

In [None]:
gs.best_params_

In [None]:
best_rf = gs.best_estimator_

In [None]:
score = cross_val_score(best_rf, x_train, y_train, scoring='accuracy', n_jobs=-1)
print(score)
np.mean(score)

Accuracy of 78%. We managed to improve our random forest. We can now try with an analogous algorithm : Boosted trees

## LGBM

In [None]:
from lightgbm import LGBMClassifier

In [None]:
booster = LGBMClassifier(random_state=0, n_jobs=-1)

In [None]:
scores = cross_val_score(booster, x_train, y_train, scoring='accuracy', n_jobs=-1)
print(scores)
print(np.mean(scores))

We got 66%, which is worse than the random forest. Let's try grid searching.

In [None]:
gs = GridSearchCV(
    booster,
    {
        "n_estimators":[100,50,150],
        "learning_rate":[0.1, 1e-2, 1e-3],
        #"num_leaves":[2,10,30],
        "objective":["binary"]
    },
    scoring="accuracy",
    n_jobs=-1
)

In [None]:
#This cell takes a minute to run
gs.fit(x_train,y_train)
gs.score(x_train,y_train)

In [None]:
gs.best_params_

In [None]:
best_booster = gs.best_estimator_

In [None]:
scores = cross_val_score(booster, x_train, y_train, scoring='accuracy', n_jobs=-1)
print(scores)
print(np.mean(scores))

In [None]:
pred = best_booster.predict(df_test)

# Results

We'll now write our results to a csv file and upload it to Kaggle

In [None]:
results = pd.DataFrame(np.vectorize(lambda x:bool(x))(pred), index=df_test.index, columns=['Transported'])
results.to_csv('../working/submission.csv')
results

TODO : PCA, Chi2, FS, DNN