# Intro
Welcome to the [Spaceship Titanic](https://www.kaggle.com/c/spaceship-titanic/overview) competition.

![](https://storage.googleapis.com/kaggle-competitions/kaggle/34377/logos/header.png)

<font size="4"><span style="color: royalblue;">Please vote the notebook up if it helps you. Feel free to leave a comment above the notebook. Thank you. </span></font>

# Libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings("ignore")

# Path

In [None]:
path = "/kaggle/input/spaceship-titanic/"
os.listdir(path)

# Load Data

In [None]:
train_data = pd.read_csv(path+'train.csv')
test_data = pd.read_csv(path+'test.csv')
samp_subm = pd.read_csv(path+'sample_submission.csv')

# Functions
We define some helper functions.

In [None]:
def plot_bar_transported(data, feature, rot=False):
    """ Compare the distribution between transported and not transportedd """
    
    df_not_survived = data[data['Transported']==0]
    df_survived = data[data['Transported']==1]
    
    survived_label = df_survived[feature].value_counts().sort_index()
    dict_survived = dict(zip(survived_label.keys(), ((100*(survived_label)/len(df_survived.index)).tolist())))
    survived_names = list(dict_survived.keys())
    survived_values = list(dict_survived.values())
    
    not_survived_label = df_not_survived[feature].value_counts().sort_index()
    dict_not_survived = dict(zip(not_survived_label.keys(), ((100*(not_survived_label)/len(df_not_survived.index)).tolist())))
    not_survived_names = list(dict_not_survived.keys())
    not_survived_values = list(dict_not_survived.values())
    
    fig, axs = plt.subplots(1, 2, figsize=(9, 3), sharey=True)
    
    axs[0].bar(survived_names, survived_values, color='yellowgreen')
    axs[1].bar(not_survived_names, not_survived_values, color='sandybrown')
    axs[0].grid()
    axs[1].grid()
    axs[0].set_title('Transported')
    axs[1].set_title('Not Transported')
    axs[0].set_ylabel('%')
    if(rot==True):
        axs[0].set_xticklabels(survived_names, rotation=45)
        axs[1].set_xticklabels(not_survived_names, rotation=45)
    plt.show()

# Overview
In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

**train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
* PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
* CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* Destination - The planet the passenger will be debarking to.
* Age - The age of the passenger.
* VIP - Whether the passenger has paid for special VIP service during the voyage.
* RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* Name - The first and last names of the passenger.
* Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [None]:
print('Number train samples:', len(train_data.index))
train_data.head()

**test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

In [None]:
print('Number test samples:', len(test_data.index))
test_data.head()

**sample_submission.csv** - A submission file in the correct format.
* PassengerId - Id for each passenger in the test set.
* Transported - The target. For each passenger, predict either True or False.

In [None]:
samp_subm.head()

# Exploratory Data Analysis

## Target Label

In [None]:
train_data['Transported'].value_counts()

## Feature CryoSleep

In [None]:
plot_bar_transported(train_data, 'CryoSleep')

## Feature Age

In [None]:
plot_bar_transported(train_data, 'Age')

## Feature Spa

In [None]:
#plot_bar_transported(train_data, 'Spa')

# Prepare Data

## Handle Missing Values

In [None]:
cols_with_missing_train = [col for col in train_data.columns if train_data[col].isnull().any()]
cols_with_missing_test = [col for col in test_data.columns if test_data[col].isnull().any()]
print('train columns with missing data:', cols_with_missing_train)
print('test columns with missing data:', cols_with_missing_test)

We fill missing values with the most frequent value of the feature:

In [None]:
for col in cols_with_missing_train:
    if col=='Age':
        fill = train_data[col].mean()
    else:
        fill = train_data[col].value_counts().index[0]
    train_data[col] = train_data[col].fillna(fill)
    test_data[col] = test_data[col].fillna(fill)

## Split Cabin
We extract the feature deck, num and side of the cabin:


In [None]:
def extract_deck(s):
    return s.split('/')[0]

def extract_num(s):
    return s.split('/')[1]

def extract_side(s):
    return s.split('/')[2]

train_data['Deck'] = train_data['Cabin'].apply(extract_deck)
train_data['Num'] = train_data['Cabin'].apply(extract_num)
train_data['Side'] = train_data['Cabin'].apply(extract_side)

test_data['Deck'] = test_data['Cabin'].apply(extract_deck)
test_data['Num'] = test_data['Cabin'].apply(extract_num)
test_data['Side'] = test_data['Cabin'].apply(extract_side)

## Encode
We encode the categorical data:

In [None]:
data = pd.concat([train_data[test_data.columns], test_data])
features_cat = ['HomePlanet', 'Destination', 'Deck', 'Side']
for feature in features_cat:
    data[pd.get_dummies(data[feature], prefix=feature).columns] = pd.get_dummies(data[feature], prefix=feature)

## Feature Name

In [None]:
def extract_last_name(s):
    return s.split(' ')[-1]

data['LastName'] = data['Name'].apply(extract_last_name)

dict_names = data['LastName'].value_counts().to_dict()

def same_name(s):
    return dict_names[s]-1

data['SameName'] = data['LastName'].apply(same_name)

In [None]:
#data['Alone'] = np.where(data['SameName']==0, 1, 0)
#data['Reedall'] = np.where(data['LastName']=='Reedall', 1, 0)

## Feature Age

In [None]:
def age_group(s):
    if s == 0:
        return -1
    elif (s > 0) & (s <= 13):
        return 1
    elif (s > 13) & (s <= 20):
        return 2
    elif (s > 20) & (s <= 30):
        return 3
    elif (s > 30) & (s <= 40):
        return 4
    elif (s > 40) & (s <= 50):
        return 5
    elif (s > 50) & (s <= 60):
        return 6
    elif (s > 60) & (s <= 70):
        return 7
    elif (s > 70) & (s <= 80):
        return 8
    
data['AgeGroup'] = data['Age'].apply(age_group)

#data['Childreen'] = np.where(data['Age']<=14, 1, 0)

## Drop Feature

We drop some features which we not want to use:

In [None]:
features_drop = features_cat+['Name', 'PassengerId', 'Cabin', 'LastName', 'Age']
data.drop(features_drop, axis=1, inplace=True)

We cast the feature Num to integer:

In [None]:
data['Num'] = data['Num'].astype('int')

## Split Data
We split the data too train, validation and test data:

In [None]:
X = data[:len(train_data)]
y = train_data['Transported']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.01, random_state=2022)
X_test = data[len(train_data):]

# Model
We use a simple XGB classifier:

In [None]:
param_grid = {'n_estimators': [10, 25, 50, 75, 100],
              'learning_rate': [0.2, 0.15, 0.1, 0.05],
              'eval_metric': ['mlogloss']}
grid = GridSearchCV(XGBClassifier(), param_grid=param_grid, cv=5, scoring='accuracy')
grid.fit(X, y)
best_params = grid.best_params_
print('Best score of cross validation: {:.2f}'.format(grid.best_score_))
print('Best parameters:', best_params)

In [None]:
model = XGBClassifier()
model.set_params(**best_params)
model.fit(X_train, y_train)

Validate training:

In [None]:
y_val_pred = model.predict(X_val)
print('Validation Score:', accuracy_score(y_val, y_val_pred))

Predict test data:

In [None]:
y_test = model.predict(X_test)

# Analyse Training

In [None]:
importance = model.feature_importances_
fig = plt.figure(figsize=(10, 8))
x = X_train.columns.values
plt.barh(x, 100*importance)
plt.title('Feature Importance', loc='left')
plt.xlabel('Percentage')
plt.grid()
plt.show()

# Export

In [None]:
samp_subm['Transported'] = y_test
samp_subm['Transported'].value_counts()

In [None]:
samp_subm.to_csv('submission.csv', index=False)