# Spaceship Titanic

The Problem

Welcome to the year 2912; data science skills are needed to solve a cosmic mystery. 
The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.
While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly to an alternative dimension using records recovered from the spaceship’s damaged computer system.
Help save them and change history!


In [None]:
pip install pycaret

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pycaret.datasets import get_data
from pycaret.classification import *
import missingno
import warnings
warnings.filterwarnings('ignore')

## Data Preparation

In [None]:
test = pd.read_csv("../input/spaceship-titanic/test.csv")
train = pd.read_csv("../input/spaceship-titanic/train.csv")

In [None]:
train.head(3)

* **train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
* **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
* **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* **Destination** - The planet the passenger will be debarking to.
* **Age** - The age of the passenger.
* **VIP** - Whether the passenger has paid for special VIP service during the voyage.
* **RoomService, FoodCourt, ShoppingMall, Spa, VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* **Name** - The first and last names of the passenger.
* **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

In [None]:
test.head(3)

In [None]:
train.info()

### Missing Value Analysis

Missing Data and Missing Data pattern:
We will analyse the distribution of missing data in the train and test dataset.

In [None]:
train.isnull().sum().sort_values(ascending=False)/len(train)

In [None]:
missingno.matrix(train,figsize=(10,5), fontsize=9)
plt.title("Missing Value Distribution in Train Dataset");

The matrix above shows that we have missing data pattern all columns except in PassengerId and target variable Transported. We can also see that the missing data are not localized to the same set of observations. Though each column has less than 3% missing value individually, we cannot drop all rows with missing data using dropna() without losing substantial info from the dataset. The missing data will be imputed with the most effective method feature by feature basis and will be handled in the wrangle function.

In [None]:
test.isnull().sum().sort_values(ascending=False)/len(test)

In [None]:
missingno.matrix(test,figsize=(10,5), fontsize=9)
plt.title("Missing Value Distribution in Test Dataset");

### Identify & Handle Outliers

In [None]:
train.describe()

In [None]:
fig,ax = plt.subplots(1,2, figsize=(15,6))
sns.boxplot(x= train["Age"], ax=ax[0])
plt.hist(train["Age"]);

In [None]:
fig,ax = plt.subplots(1,2, figsize=(15,6))
sns.boxplot(x= train["RoomService"], ax=ax[0])
plt.hist(train["RoomService"]);

In [None]:
fig,ax = plt.subplots(1,2, figsize=(15,6))
sns.boxplot(x= train["FoodCourt"], ax=ax[0])
plt.hist(train["FoodCourt"]);

In [None]:
fig,ax = plt.subplots(1,2, figsize=(15,6))
sns.boxplot(x= train["ShoppingMall"], ax=ax[0])
plt.hist(train["ShoppingMall"]);

In [None]:
fig,ax = plt.subplots(1,2, figsize=(15,6))
sns.boxplot(x= train["Spa"], ax=ax[0])
plt.hist(train["Spa"]);

In [None]:
fig,ax = plt.subplots(1,2, figsize=(15,6))
sns.boxplot(x= train["VRDeck"], ax=ax[0])
plt.hist(train["VRDeck"]);

Looking at the data carefully, we noticed that not a lot of passengers spent money on luxury services. The few that spent on such services look like outliers in the dataset but for this analysis the extreme values will not removed. A number of passenger typically spend huge sums on services when aboard cruise ships and similar trend exists here. We will examine if this set of passengers have other attributes in common


### Identify Features with High Cardinality and Multicolinearity

In [None]:
train.select_dtypes("object").nunique()

Analyzed the data for cardinality. Cabin column can be split to expose port and starboard side of the ship which can definitely impact evacuation. PassengerId and Name have high cardinality and hence be dropped. A firther look at the name column to determine the gender of the passenger was done but no good method identified to generalize the gender based on the name.

In [None]:
correlation = train.select_dtypes("number").corr()
correlation

In [None]:
sns.heatmap(correlation);

There is no strong correlations between any two feature variables. Thus, we will not be dropping any column for multicilinearity issue in this analysis.

A wrangle function is defined to centralize the data cleaning process.

In [None]:
def wrangle(filepath):
    # Read csv file into dataFrame
    df = pd.read_csv(filepath)
    
    # Split column Cabin into Deck,Num and Side
    df["deck"] = df["Cabin"].str.split("/", expand = True)[0]
    df["Num"] = df["Cabin"].str.split("/", expand = True)[1]
    df["Side"] = df["Cabin"].str.split("/", expand = True)[2]
    df.drop(columns="Cabin",inplace=True)
    
    # Create Passenger Group for pax traveling as a group
    df["PaxInGroup"] = df["PassengerId"].str.split("_",expand=True)[0].astype(int).duplicated(keep=False)
    
    # Fill missing data
    df['Age'].fillna(df['Age'].mean(), inplace=True)
    df['Spa'].fillna(df['Spa'].median(), inplace=True)
    df['VRDeck'].fillna(df['VRDeck'].median(), inplace=True)
    df['RoomService'].fillna(df['RoomService'].median(), inplace=True)
    df['FoodCourt'].fillna(df['FoodCourt'].median(), inplace=True)
    df['ShoppingMall'].fillna(df['ShoppingMall'].median(), inplace=True)
    df = df.fillna(df.agg(lambda x: pd.Series.mode(x)[0], axis=0))
    
   
    # Convert bool to 0,1
    df["CryoSleep"] = df["CryoSleep"].astype(bool)
    df["VIP"] = df["VIP"].astype(bool)
    
    # Drop multicolinearity column
    
    # Drop columns with high cardinality
    df.drop(columns=["Name","Num"],inplace=True)
    
    return df

In [None]:
data = wrangle("../input/spaceship-titanic/train.csv")
data.head(3)

In [None]:
data.drop(columns="PassengerId",inplace=True)

In [None]:
data.info()

## Exploring Relationships Between Categorical Features

In [None]:
data.select_dtypes("object").nunique()

In [None]:
sns.countplot(data= data, x="deck")
plt.title("Distribution of Pax by Deck");

In [None]:
sns.countplot(hue="Transported", x="deck", data=data)
plt.title('Number of passengers transported by deck');

In [None]:
sns.countplot(data= data, x="Side");

In [None]:
sns.countplot(hue="Transported", x="Side", data=data)
plt.title('Number of passengers transported by Side');

In [None]:
sns.countplot(data= data, x="Destination");

In [None]:
sns.countplot(hue="Transported", x="Destination", data=data)
plt.title('Number of passengers transported by Destination');

In [None]:
sns.countplot(data= data, x="HomePlanet");

In [None]:
sns.countplot(hue="Transported", x="HomePlanet", data=data)
plt.title('Number of passengers transported by HomePlanet');

In [None]:
sns.countplot(data= data, x="CryoSleep");

In [None]:
sns.countplot(hue="Transported", x="CryoSleep", data=data)
plt.title('Number of passengers transported by CryoSleep');

In [None]:
sns.countplot(data= data, x="VIP");

In [None]:
sns.countplot(hue="Transported", x="VIP", data=data)
plt.title('Number of passengers transported by VIP');

In [None]:
sns.countplot(data= data, x="PaxInGroup");

In [None]:
sns.countplot(hue="Transported", x="PaxInGroup", data=data)
plt.title('Distribution of Pax in Groups transported');

Various relationships exist between the plotted categorical features and target. 
* More passengers were transported from Deck B, C & G. Deck B and C have a relatively higher ratio of passengers transported compared to passengers on other decks. 
* Pax on the Port side of the ship were transported in higher number than thos on the Starboard. 
* Passengers whose destinations was 55 Cancri e saw a better chance of being transported compared to other destinations.
* Passengers from HomePlanet Europa got transported to the alternate dimension in higher ratio relative to the other other HomePlanets.
* Cryosleep seems to be a big factor in weather a passenger get transported or not. More than 75% of passengers in CryoSleep got transported.
* Finally we saw a higher chance of passengers that travel with a group(mostly friends and family) to be transported compared to those without companions. 

## Exploring Relationships Between Numerical Features

In [None]:
data.corr()

In [None]:
sns.heatmap(data.corr().round(decimals=2), annot=True);

The strongest relationship exists between the CryoSleep feature and the target. Most of the other relationships are weak(looking at the plots and ccorrelation coefficients).

In [None]:
data["Transported"].value_counts(normalize=True).plot(
    kind="bar", xlabel="Transported", ylabel="Relative Frequency", title="Distribution Passengers Transported to Alt Dimension"
);

As we can see the target "Transported" shows that dataset is evenly distributed each one being almost 50% of the entire dataset and most classification models perform well when working with balanced data.

## Modeling

### Initialize PyCaret Environment

In [None]:
class_space = setup(data = data, target = 'Transported', train_size = 0.8,normalize = True, session_id = 3934)

In [None]:
get_config("X").head()

In [None]:
compare_models(sort='Accuracy')

In [None]:
model_cat = create_model('catboost', verbose = False)

params = {'iterations': np.arange(100, 1000, 100),
        'max_depth': np.arange(1, 10),
        'learning_rate': np.arange(0.01, 1, 0.01),
        'random_strength': np.arange(0.1, 1.0, 0.1),
        'l2_leaf_reg': np.arange(1, 100),
        'border_count': np.arange(1, 256)}

tuned_model = tune_model(model_cat, optimize = 'Accuracy', fold = 10,
            tuner_verbose = False, search_library = 'scikit-optimize',
            custom_grid = params, n_iter = 50)

### Making Predictions

Prediction will be made on the validation data (based on the train_test_split) in the PyCaret environment.

In [None]:
predictions = predict_model(tuned_model)

In [None]:
final_model = finalize_model(tuned_model)
save_model(final_model, 'catboost_classification_model')

In [None]:
plot_model(final_model, plot = "ks")

#### Predictions on the test dataset

In [None]:
test_data = wrangle("../input/spaceship-titanic/test.csv")
test_data.head(3)

In [None]:
test_predictions = predict_model(final_model, data=test_data)
test_predictions.head(3)

In [None]:
submission_final = test_predictions[["PassengerId","Label"]]
print(submission_final.shape)
submission_final.head()

In [None]:
submission_final.to_csv("submission_final.csv", index=False)