# Spaceship Titanic

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, we are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

pd.options.display.max_columns = 99

## Train Dataset

In [None]:
train = pd.read_csv('../input/spaceship-titanic/train.csv')
len(train)

In [None]:
train.head()

In [None]:
train.info()

In [None]:
train.describe().T

## Test Dataset

In [None]:
test = pd.read_csv('../input/spaceship-titanic/test.csv')
len(test)

In [None]:
test.info()

In [None]:
test.describe().T

From both the train and test dataset, we see that there are missing values in all the attributes that will need to be imputed.

In summary, there are **8693** passengers in the train dataset and **4277** passengers in the test dataset.

For the train dataset, out of 8693 passengers, 4378 (~50%) of passengers were transported to the alternate dimension. So this dataset is quite balanced.

## Summary of Attributes

* PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
* CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* Destination - The planet the passenger will be debarking to.
* Age - The age of the passenger.
* VIP - Whether the passenger has paid for special VIP service during the voyage.
* RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* Name - The first and last names of the passenger.
* Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [None]:
combine = pd.concat([train, test], axis = 0, ignore_index = True)
len(combine)

## Preliminary Feature Engineering

In [None]:
# creating new column for groupId
combine['groupId'] = combine['PassengerId'].str[:4]

In [None]:
combine['id'] = combine['PassengerId'].str[5:].astype(int)

In [None]:
# create new columns for first and last name
combine[['firstname', 'lastname']] =  pd.DataFrame(combine['Name'].str.split(expand = True))

In [None]:
# create deck, num, and side
combine[['deck', 'num', 'side']] = pd.DataFrame(combine['Cabin'].str.split('/', expand = True))

## Imputing Missing Values

In [None]:
combine.isnull().sum()

### Filling Luxury Amenities N.A. with 0 and Calculate Total Spending

In [None]:
luxamen = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
combine[luxamen] = combine[luxamen].fillna(0)

In [None]:
combine['totalspend'] = combine[luxamen].sum(axis = 1)

In [None]:
col_to_drop = combine.columns

### HomePlanet

In [None]:
hp = combine.copy()

In [None]:
hp['HomePlanet'].unique()

In [None]:
hp['Destination'].unique()

Here we see that there are only three values for HomePlanet. There could be few ways to infer the home planet.
* Group number reference
* Destination - could either be leaving their home planet, or returning to home planet
* Name - certain first/last name associated with home planet

In [None]:
hp[hp['HomePlanet'].isnull()]

First, we try using the groupId as a reference, with the assumption that passengers who travel together are from the same home planet.

In [None]:
def fill_hp(x):
    
    if pd.isnull(x['HomePlanet']): # check if HomePlanet for the row entry is missing
        gid = x['groupId'] # get the groupId for the row entry with missing HomePlanet
        df = hp[hp['groupId'] == gid] # index the dataframe to only select entries with the same groupId
        if (len(df) > 1) & (df['HomePlanet'].notnull().sum() > 0):
            return_hp = df['HomePlanet'].mode()[0]
            return return_hp
        
    else:
        return x['HomePlanet']

In [None]:
hp['HomePlanet'].isnull().sum()

In [None]:
hp['HomePlanet'] = hp.apply(fill_hp, axis = 1)

In [None]:
hp['HomePlanet'].isnull().sum()

Next, we try to identify the home planet using the last name, with the assumption that different home planets have different ethnic last name. We will also try the first name after for those without matching last name.

In [None]:
def fill_hp_name(x):
    
    if pd.isnull(x['HomePlanet']): # check if HomePlanet for the row entry is missing
        name = x['lastname'] # get the lastname for the row entry with missing HomePlanet
        df = hp[hp['lastname'] == name] # index the dataframe to only select entries with the same groupId
        if (len(df) > 1) & (df['HomePlanet'].notnull().sum() > 0):
            return df['HomePlanet'].mode()[0]
        
    else:
        return x['HomePlanet']

In [None]:
hp['HomePlanet'] = hp.apply(fill_hp_name, axis = 1)

In [None]:
hp['HomePlanet'].isnull().sum()

In [None]:
def fill_hp_firstname(x):
    
    if pd.isnull(x['HomePlanet']): # check if HomePlanet for the row entry is missing
        name = x['firstname'] # get the lastname for the row entry with missing HomePlanet
        df = hp[hp['firstname'] == name] # index the dataframe to only select entries with the same groupId
        if (len(df) > 1) & (df['HomePlanet'].notnull().sum() > 0):
            return df['HomePlanet'].mode()[0]
        
    else:
        return x['HomePlanet']

In [None]:
hp['HomePlanet'] = hp.apply(fill_hp_firstname, axis = 1)

In [None]:
hp['HomePlanet'].isnull().sum()

In [None]:
hp[hp['HomePlanet'].isnull()]

Here we see that for all of the remaining passengers, they all have a common destination of TRAPPIST-1e. We will assume that they are from Earth as TRAPPIST-1e is the most popular destination for passengers from Earth.

In [None]:
hp.groupby('Destination')['HomePlanet'].value_counts()

In [None]:
hp['HomePlanet'] = hp['HomePlanet'].fillna('Earth')

In [None]:
hp['HomePlanet'].isnull().sum()

In [None]:
hp = hp[['PassengerId', 'HomePlanet']]

### Cabin

In [None]:
cb = combine.copy()

In [None]:
cb['num'].unique()

Since we are not quite able to infer the cabin number from other data, we will fill in an arbitrary number to denote that the passenger might not have been assigned a cabin. As the largest cabin number is 1890, we will use a arbitrary 9999. For deck and side, we will use a arbitrary Z.

In [None]:
cb['num'] = cb['num'].fillna('9999')

In [None]:
cb['num'] = cb['num'].astype(int)

In [None]:
cb['deck'].unique()

In [None]:
cb['deck'] = cb['deck'].fillna('Z')
cb['side'] = cb['side'].fillna('Z')

In [None]:
cb1 = cb.copy()

In [None]:
cb = cb[['PassengerId', 'deck', 'num', 'side']]

### VIP

Here we assume that the VIP status would be correlated to how much money is spent.

In [None]:
vip = combine.copy()

In [None]:
vip.head()

In [None]:
vip_y = vip[vip['VIP'] == True]

In [None]:
vip_y['totalspend'].median()

In [None]:
vip_n = vip[vip['VIP'] == False]

In [None]:
vip_n['totalspend'].median()

In [None]:
def fill_vip(x):
    if pd.isnull(x['VIP']):
        if x['totalspend'] >= 2743:
            return True
        else:
            return False
        
    else:
        return x['VIP']

In [None]:
vip['VIP'] = vip.apply(fill_vip, axis = 1)

In [None]:
vip['VIP'].isnull().sum()

In [None]:
vip = vip[['PassengerId', 'VIP']]

### CyroSleep

In [None]:
sns.boxplot(data = combine, x = 'CryoSleep', y = 'totalspend')

From the boxplot, we can see that for those who chose to be put in animated sleep, their spending is zero. This make sense as they would be sleeping throughout the journey.

We also assume that thosee without a cabin will not be able to be put in animated sleep.

In [None]:
cs = combine.copy()

In [None]:
cs['CryoSleep'].isnull().sum()

In [None]:
def fill_sleep(x):
    if pd.isnull(x['CryoSleep']):
        if x['totalspend'] > 0:
            return False
        elif pd.isnull(x['Cabin']):
            return False
    else:
        return x['CryoSleep']

In [None]:
cs['CryoSleep'] = cs.apply(fill_sleep, axis = 1)

In [None]:
cs['CryoSleep'].isnull().sum()

As there are not further reasonable assumptions, we decide to fill in the remaining missing values with False (represented by larger proportion of passengers).

In [None]:
cs['CryoSleep'] = cs['CryoSleep'].fillna('False')

In [None]:
cs['CryoSleep'].isnull().sum()

In [None]:
def conv_false(x):
    if x == 'False':
        return False
    else:
        return x

In [None]:
cs['CryoSleep'] = cs['CryoSleep'].apply(conv_false)

In [None]:
cs = cs[['PassengerId', 'CryoSleep']]

### Destination

In [None]:
dest = combine.copy()

In [None]:
dest['Destination'].isnull().sum()

First, we assume that groups travelling together will have the same destination.

In [None]:
def fill_dest(x):
    
    if pd.isnull(x['Destination']): # check if HomePlanet for the row entry is missing
        gid = x['groupId'] # get the groupId for the row entry with missing HomePlanet
        df = dest[dest['groupId'] == gid] # index the dataframe to only select entries with the same groupId
        if (len(df) > 1) & (df['Destination'].notnull().sum() > 0):
            return df['Destination'].mode()[0]
        
    else:
        return x['Destination']

In [None]:
dest['Destination'] = dest.apply(fill_dest, axis = 1)

In [None]:
dest['Destination'].isnull().sum()

In [None]:
dest[dest['Destination'].isnull()]

In [None]:
dest.groupby('HomePlanet')['Destination'].value_counts()

As there are not further reasonable assumptions, we decide to fill in the remaining missing values with the most popular destination (TRAPPIST-1e).

In [None]:
dest['Destination'] = dest['Destination'].fillna('TRAPPIST-1e')

In [None]:
dest = dest[['PassengerId', 'Destination']]

### Age

In [None]:
sns.boxplot(data = combine, x = 'deck', y = 'Age')

In [None]:
age = cb1.copy()

In [None]:
age.head()

In [None]:
age['Age'] = age['Age'].fillna(age.groupby('deck')['Age'].transform('median'))

In [None]:
age = age[['PassengerId', 'Age']]

## Combining the Datasets

In [None]:
col_to_drop

In [None]:
combine = combine[['PassengerId','RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
                  'Name', 'Transported', 'groupId', 'id', 'firstname', 'lastname','totalspend']]

In [None]:
final = pd.merge(combine, hp, how = 'left', on = 'PassengerId')
final = pd.merge(final, cb, how = 'left', on = 'PassengerId')
final = pd.merge(final, vip, how = 'left', on = 'PassengerId')
final = pd.merge(final, cs, how = 'left', on = 'PassengerId')
final = pd.merge(final, dest, how = 'left', on = 'PassengerId')
final = pd.merge(final, age, how = 'left', on = 'PassengerId')

## Additional Feature Engineering

### Mean, Median, Std of Spending

In [None]:
final.head()

In [None]:
final['mean_spend'] = final[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].mean(axis = 1)
final['median_spend'] = final[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].median(axis = 1)
final['std_spend'] = final[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].std(axis = 1)

### Spending Pattern

In [None]:
final['food'] = final['RoomService'] + final['FoodCourt']
final['luxury'] = final['ShoppingMall'] + final['Spa'] + final['VRDeck']

### HomePlanet Destination Relationship

In [None]:
final['Destination'].unique()

In [None]:
def home_dest(x):
    if (x['HomePlanet'] == 'Europa') & (x['Destination'] == 'TRAPPIST-1e'):
        return 'EUTR'
    elif (x['HomePlanet'] == 'Europa') & (x['Destination'] == 'PSO J318.5-22'):
        return 'EUPS'
    elif (x['HomePlanet'] == 'Europa') & (x['Destination'] == '55 Cancri e'):
        return 'EU55'
    elif (x['HomePlanet'] == 'Earth') & (x['Destination'] == 'TRAPPIST-1e'):
        return 'EATR'
    elif (x['HomePlanet'] == 'Earth') & (x['Destination'] == 'PSO J318.5-22'):
        return 'EAPS'
    elif (x['HomePlanet'] == 'Earth') & (x['Destination'] == '55 Cancri e'):
        return 'EA55'
    elif (x['HomePlanet'] == 'Mars') & (x['Destination'] == 'TRAPPIST-1e'):
        return 'MATR'
    elif (x['HomePlanet'] == 'Mars') & (x['Destination'] == 'PSO J318.5-22'):
        return 'MAPS'
    elif (x['HomePlanet'] == 'Mars') & (x['Destination'] == '55 Cancri e'):
        return 'MA55'
    

In [None]:
final['home_dest'] = final.apply(home_dest, axis = 1)

### Group Size

In [None]:
group_size = final.groupby('groupId').size().reset_index().rename({0: 'groupsize'}, axis = 1)

In [None]:
final = pd.merge(final, group_size, how = 'left', on = 'groupId')

### Family Size

In [None]:
family = final.groupby(['groupId', 'lastname']).size().reset_index().rename({0: 'familysize'}, axis = 1)

In [None]:
final = pd.merge(final, family, how = 'left', on = ['groupId', 'lastname'])

In [None]:
final['familysize'] = final['familysize'].fillna(1)

### Alone

In [None]:
def alone(x):
    if x == 1:
        return 1
    else:
        return 0

In [None]:
final['alone'] = final['familysize'].apply(alone)

### Encoding for Categorial Columns

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
final['HomePlanet'] = le.fit_transform(final['HomePlanet'])
final['deck'] = le.fit_transform(final['deck'])
final['num'] = le.fit_transform(final['num'])
final['side'] = le.fit_transform(final['side'])
final['Destination'] = le.fit_transform(final['Destination'])
final['VIP'] = le.fit_transform(final['VIP'])
final['home_dest'] = le.fit_transform(final['home_dest'])
final['CryoSleep'] = le.fit_transform(final['CryoSleep'])

In [None]:
train_final = final[:8693]
test_final = final[8693:]

In [None]:
train_final.columns

In [None]:
def conv_binary(x):
    if x:
        return 1
    else:
        return 0

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
from sklearn.metrics import roc_auc_score

# setting up train test data using k-fold cross validation
from sklearn.model_selection import KFold

kf = KFold(n_splits = 10, shuffle = True, random_state = 1)

features = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa',
            'VRDeck',  'id', 'totalspend', 'HomePlanet', 'deck', 'num', 'side', 'VIP',
            'CryoSleep', 'Destination', 'Age', 'mean_spend', 'median_spend',
            'std_spend', 'food', 'luxury', 'home_dest', 'groupsize', 'familysize',
            'alone']

X = train_final[features]
y = train_final['Transported']
y = y.apply(conv_binary)

rfc = RandomForestClassifier(random_state = 1)
boost = xgb.XGBClassifier(n_estimators = 350, learning_rate = 0.1, max_depth = 8, colsample_bytree = 0.8, gamma = 6,
                          random_state = 1, verbosity = 0, use_label_encoder=False)

model = boost

k_fold_auc = []
for train_index, test_index in kf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test).astype(int)
    auc = roc_auc_score(y_test, y_pred)
    k_fold_auc.append(auc)

final_auc = np.mean(k_fold_auc)
print(final_auc)

In [None]:
len(model.feature_importances_)

In [None]:
(model.feature_importances_)

In [None]:
final_model = xgb.XGBClassifier(n_estimators = 350, learning_rate = 0.1, max_depth = 8, colsample_bytree = 0.8, gamma = 6,
                          random_state = 1, verbosity = 0, use_label_encoder=False)

final_model.fit(X, y)

In [None]:
y_pred = final_model.predict(test_final[features]).astype(int)

In [None]:
def conv_TF(x):
    final = []
    for i in x:
        if i == 1:
            final.append(True)
        else:
            final.append(False)
    return final

In [None]:
y_pred = conv_TF(y_pred)

In [None]:
submission = pd.DataFrame({"PassengerId": test_final["PassengerId"], "Transported": y_pred})
submission.to_csv('./submission.csv', index=False)