### Dataset Description

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

### File and Data Field Descriptions:

    <>train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
    <>PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is  their number within the group. People in a group are often family members, but not always.
    <>HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
    <>CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
    <>Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
    <>Destination - The planet the passenger will be debarking to.
    <>Age - The age of the passenger.
    <>VIP - Whether the passenger has paid for special VIP service during the voyage.
    <>RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
    <>Name - The first and last names of the passenger.
    <>Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
    
test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

sample_submission.csv - A submission file in the correct format.
PassengerId - Id for each passenger in the test set.
Transported - The target. For each passenger, predict either True or False.

In [1]:
# Import Libraries

# Data wrangling
import pandas as pd
import numpy as np
from collections import Counter

# Data visualisation
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

In [13]:
# Import Files
train = pd.read_csv("train.csv")
test = pd.read_csv('test.csv')

In [3]:
# Inspect
train.head(3)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False


In [None]:
test.head(3)

In [14]:
## a function that extract cabin grp and side 
def cabin_split(x):
    ## try and exception is used to navigate through the nan values
    try:
        u= x.split('/')
        return str(u[0] + u[2])
    except AttributeError as e:
        return x

train["cabin_grp"] = train.Cabin.apply(cabin_split)
test["cabin_grp"] = test.Cabin.apply(cabin_split) 


In [None]:
train.head(2)

In [None]:
test.head(2)

In [None]:
train.isnull().sum().sort_values(ascending=False)

In [None]:
test.isnull().sum().sort_values(ascending=False)

## EXPLOROTARY DATA ANALYSIS


### Categorical Variables : HomePlanet, CryoSleep, Destination,VIP


##### 1. HomePlanet

In [None]:
train.HomePlanet.value_counts(dropna=False)

In [None]:
#Mean "Transported" by HomePlanet

train[['HomePlanet', 'Transported']].groupby('HomePlanet', as_index=False).mean().sort_values(by= 'Transported', ascending=False)

# Europa, Mars, Earth

In [None]:
sns.barplot(x = 'HomePlanet', y ='Transported', data = train)
plt.ylabel('Transported Probability')
plt.title('Transported Probability by Home Planet')

# Comment: Female passengers are more likely to survive

##### 2. Destination

In [None]:
train['Destination'].value_counts(dropna=False)

In [None]:
# Mean "Transported" by Destination
train[["Destination", "Transported"]].groupby('Destination', as_index=False).mean().sort_values(by="Transported", ascending=False)

# 55 Cancri e, PSO J318.5-22, TRAPPIST-1e

In [None]:
sns.barplot(x="Destination", y="Transported", data=train)
plt.ylabel('Transported Probability')
plt.title('Transported Probability by Destination')

##### 3. CryoSleep

In [None]:
train.CryoSleep.value_counts(dropna=False)

In [None]:
# Mean "Transpored" by CryoSleep 

train[['CryoSleep', 'Transported']].groupby("CryoSleep", as_index=False).mean().sort_values(by="Transported", ascending=False)

# True, False

In [None]:
sns.barplot(x="CryoSleep", y="Transported", data=train)
plt.xlabel("Transported Probability")
plt.ylabel("Transported Probability by CryoSleep")

##### 4. VIP

In [None]:
train.VIP.value_counts(dropna=False)

In [None]:
# Mean "Transported" by VIP 

train[['VIP', 'Transported']].groupby("VIP", as_index=False).mean().sort_values(by="Transported", ascending=False)

# False, True

In [None]:
sns.barplot(x="VIP", y="Transported", data=train)
plt.xlabel("Transported Probability")
plt.ylabel("Transported Probability by VIP")

### FILL MISSING TEST VALUES

##### Due to the EDA done earlier by Rasheed, he defined the functions below for filling missing values

In [19]:
def fill_missing_1(data, target_column: str, cond_column1: str, cond_column2: str, cond_value1: str, cond_value2, fill):
    common= data[target_column].isna()
    condition= [(data[cond_column1]>= cond_value1) & (data[cond_column2]== cond_value2) & (common)]
    fill_with= [fill]
    data[target_column]= np.select(condition, fill_with, default= data[target_column].values)

# For Shopmall and VIP sujected to Age 12 and 20 respectively
def fill_missing_2(data, target_column: str, cond_column: str, cond_value:int, fill):
    common= data[target_column].isna()
    cond= [(data[cond_column] <= cond_value) &(common)]
    fill_with= [fill]
    data[target_column]= np.select(cond, fill_with, default= data[target_column].values)

def fill_missing_3(data, target_column: str, cond_column1: str, cond_column2: str, cond_value1: str, cond_value2, fill):
    common= data[target_column].isna()
    condition= [(data[cond_column1]== cond_value1) & (data[cond_column2]== cond_value2) & (common)]
    fill_with= [fill]
    data[target_column]= np.select(condition, fill_with, default= data[target_column].values)

def fill_missing_4(data, target_column: str, cond_column1: str,  cond_value1: str, fill):
    """
       Argumnet data: dataframe, target_column: column to be filled, cond_column1 , cond_column2, fill: value 
    """
    common= data[target_column].isna()
    condition= [(data[cond_column1]== cond_value1)  & (common)]
    fill_with= [fill]
    data[target_column]= np.select(condition, fill_with, default= data[target_column].values)

In [20]:
# Check if both train and test datas as same number of unique carbon_grp values

len(train.cabin_grp.unique()) == len(test.cabin_grp.unique())

True

In [21]:
### At ages greater than 40 and cabin_grp AP,BP, BS, CS , CP HomePlanet is Europa
### At ages greater than 40  and cabin_grp GS, GP homeplanet is Earth
for grp in ["AP","BP", "BS", "CS" , "CP", "GS", "GP"]:
    if grp in ["GS", "GP"]:
        fill_missing_1(train, 'HomePlanet', "Age", 'cabin_grp', 40, grp, 'Earth')
        fill_missing_1(test, 'HomePlanet', "Age", 'cabin_grp', 40, grp, 'Earth')
    else:
        fill_missing_1(train, 'HomePlanet', "Age", 'cabin_grp', 40, grp, 'Europa')
        fill_missing_1(test, 'HomePlanet', "Age", 'cabin_grp', 40, grp, 'Europa')


In [22]:
# For Shopmall and VIP sujected to Age 12 and 20 respectively
for data in [train, test]:
    fill_missing_2(data, 'ShoppingMall', 'Age', 12, 0)
    fill_missing_2(data, 'VIP', 'Age', 20, False)

In [23]:
for data in [train, test]:
    fill_missing_3(data, 'HomePlanet', 'cabin_grp', 'Destination', 'ES', 'TRAPPIST-1e', 'Mars')
    fill_missing_3(data, 'HomePlanet', 'cabin_grp', 'Destination', 'ES', 'PSO J318.5-22', 'Earth')
    fill_missing_3(data, 'HomePlanet', 'cabin_grp', 'Destination', 'ES', '55 Cancri e', 'Europa')
    fill_missing_3(data, 'HomePlanet', 'cabin_grp', 'Destination', 'ES', '55 Cancri e', 'Europa')
    fill_missing_3(data, 'HomePlanet', 'cabin_grp', 'Destination', 'DS', '55 Cancri e', 'Europa')
    fill_missing_3(data, 'HomePlanet', 'cabin_grp', 'Destination', 'DP', '55 Cancri e', 'Europa')


    fill_missing_4(data, 'HomePlanet', 'cabin_grp', 'AS', 'Europa')
    fill_missing_4(data, 'HomePlanet', 'cabin_grp', 'AP', 'Europa')
    fill_missing_4(data, 'HomePlanet', 'cabin_grp', 'BS', 'Europa')
    fill_missing_4(data, 'HomePlanet', 'cabin_grp', 'BP', 'Europa')
    fill_missing_4(data, 'HomePlanet', 'cabin_grp', 'CS', 'Europa')
    fill_missing_4(data, 'HomePlanet', 'cabin_grp', 'CP', 'Europa')
    fill_missing_4(data, 'HomePlanet', 'cabin_grp', 'TP', 'Europa')
    fill_missing_4(data, 'HomePlanet', 'cabin_grp', 'FS', 'Earth')
    fill_missing_4(data, 'HomePlanet', 'cabin_grp', 'GS', 'Earth')
    fill_missing_4(data, 'HomePlanet', 'cabin_grp', 'GP', 'Earth')
    fill_missing_4(data, 'HomePlanet', 'cabin_grp', 'EP', 'Earth')

In [24]:
for data in [train, test]:
    data['HomePlanet']= data['HomePlanet'].fillna('Mars')

    fill_missing_4(data, 'CryoSleep', 'cabin_grp', 'BS', True)
    fill_missing_3(data, 'CryoSleep', 'cabin_grp', 'Destination', 'GP', '55 Cancri e', True )
    fill_missing_3(data, 'CryoSleep', 'cabin_grp', 'Destination', 'GS', '55 Cancri e', True )
    ## fill the remaining missing values with False
    data['CryoSleep'] = data['CryoSleep'].fillna(False)

    ## fill VIP the misiing values with False
    data['VIP']= data['VIP'].fillna(False)

    ## fill Destination with TRAPPIST-1e
    data['Destination']= data['Destination'].fillna('TRAPPIST-1e')


## Drop rows with no cabin_grp
train = train[train.cabin_grp.notnull()]

In [25]:
train.isnull().sum().sort_values(ascending=False)

Name            198
ShoppingMall    186
VRDeck          184
Spa             181
FoodCourt       178
RoomService     177
Age             175
PassengerId       0
HomePlanet        0
CryoSleep         0
Cabin             0
Destination       0
VIP               0
Transported       0
cabin_grp         0
dtype: int64

In [26]:
test.isnull().sum().sort_values(ascending=False)

FoodCourt       106
Spa             101
Cabin           100
cabin_grp       100
Name             94
ShoppingMall     92
Age              91
RoomService      82
VRDeck           80
PassengerId       0
HomePlanet        0
CryoSleep         0
Destination       0
VIP               0
dtype: int64

In [30]:
for data in [train, test]:
    for col in ['Spa', 'VRDeck', 'ShoppingMall', 'RoomService', 'Age', 'FoodCourt']:
        data[col]= data.groupby(['HomePlanet','cabin_grp', 'Destination'])[col].apply(lambda x: x.fillna(x.median()))

In [32]:
test.isnull().sum().sort_values(ascending=False)

Cabin           100
Age             100
RoomService     100
FoodCourt       100
ShoppingMall    100
Spa             100
VRDeck          100
cabin_grp       100
Name             94
PassengerId       0
HomePlanet        0
CryoSleep         0
Destination       0
VIP               0
dtype: int64

In [33]:
test.shape

(4277, 14)