# Spaceship Titanic Prediction

In this project, our task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. 

## File and Data Field Descriptions

- train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
    - PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
    - HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
    - CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
    - Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
    - Destination - The planet the passenger will be debarking to.
    - Age - The age of the passenger.
    - VIP - Whether the passenger has paid for special VIP service during the voyage.
    - RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
    - Name - The first and last names of the passenger.
    - Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
- test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.
- sample_submission.csv - A submission file in the correct format.
    - PassengerId - Id for each passenger in the test set.
    - Transported - The target. For each passenger, predict either True or False.


First of all, let's prepare libraries for analysis and then open files 'train.csv'. 

## Library importing

In [None]:
# Preparing Libraries 

import warnings
warnings.filterwarnings('ignore')

import pandas as pd 
import numpy as np

pd.options.display.max_columns = 200
import cufflinks as cf
cf.go_offline(connected = True)

import plotly.express as px 
import plotly.graph_objects as go 
import plotly.offline as pyo
pyo.init_notebook_mode() 

## Dataset importing

In [None]:
train_df = pd.read_csv('../input/spaceship-titanic/train.csv', index_col = 'PassengerId')
test_df = pd.read_csv('../input/spaceship-titanic/test.csv', index_col = 'PassengerId')

train_df.head(1)

## Glimpse dataset

To get more details, we need to make function for viewing statistics of dataset. Our function 'data_glimpse(df)' shows dataset preview, column information, missing data, unique data, describe table and info table. 

In [None]:
def missing(df) : 
    """
    This function shows number of missing values and its percetages 
    """
    missing_number = df.isnull().sum().sort_values(ascending = False)
    missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending = False)
    missing_values = pd.concat([missing_number, missing_percent], axis = 1, keys = ['Missing_number', 'Missing_percent'])
    return missing_values 

def categorize(df) :
    """
    This function shows number of features by dtypes.
    Result of function is not always accruate because this result estimate dtypes before preprocessing.
    """
    Quantitive_features = df.select_dtypes([np.number]).columns.tolist()
    Categorical_features = df.select_dtypes(exclude = [np.number]).columns.tolist()
    Discrete_features = [col for col in Quantitive_features if len(df[col].unique()) < 10]
    Continuous_features = [col for col in Quantitive_features if col not in Discrete_features]
    print(f"Quantitive feautres : {Quantitive_features} \nDiscrete features : {Discrete_features} \nContinous features : {Continuous_features} \nCategorical features : {Categorical_features}\n")
    print(f"Number of quantitive feautres : {len(Quantitive_features)} \nNumber of discrete features : {len(Discrete_features)} \nNumber of continous features : {len(Continuous_features)} \nNumber of categorical features : {len(Categorical_features)}")
    
def unique(df) : 
    """
    This function returns table storing number of unique values and its samples.
    """
    tb1 = pd.DataFrame({'Columns' : df.columns, 'Number_of_Unique' : df.nunique().values.tolist(),
                       'Sample1' : df.sample(1).values.tolist()[0], 'Sample2' : df.sample(1).values.tolist()[0], 
                       'Sample3' : df.sample(1).values.tolist()[0],
                       'Sample4' : df.sample(1).values.tolist()[0], 'Sample5' : df.sample(1).values.tolist()[0]})
    return tb1    

def data_glimpse(df) :   
    
    # Dataset preview 
    print("1. Dataset Preview \n")
    display(df.head())
    print("-------------------------------------------------------------------------------\n")
    
    # Columns imformation
    print("2. Column Information \n")
    print(f"Dataset have {df.shape[0]} rows and {df.shape[1]} columns")
    print("\n") 
    print(f"Dataset Column name : {df.columns.values}")
    print("\n")
    categorize(df)
    print("-------------------------------------------------------------------------------\n")
    
    # Basic imformation table 
    print("3. Missing data table : \n")
    display(missing(df))
    print("-------------------------------------------------------------------------------\n")
    
    print("4. Number of unique value by column : \n")
    display(unique(df))
    print("-------------------------------------------------------------------------------\n")
    
    print("5. Describe table : \n")
    display(df.describe())
    print("-------------------------------------------------------------------------------\n")
    
    print(df.info())
    print("-------------------------------------------------------------------------------\n")

In [None]:
data_glimpse(train_df)

There are 8693 passengers in spaceship titanic with their 13 information. We can see there are 6 continuous features(‘Age’, ‘RoomService’, ‘FoodCourt’, ‘ShoppingMall’, ‘Spa’, ‘VRDeck’) and 8 categorical features(‘PassengerId’, ‘HomePlanet’, ‘CryoSleep’, ‘Cabin’, ‘Destination’, ‘VIP’, ‘Name’, ‘Transported’). 

There are some missing values except columns ‘PassengerId’, ‘Transported’. We need to track why missing values occurs and need to impute them. 

And when we saw result of unique table, there are only three home planets and three destinations.

## Data cleaning for analysis

In data cleaning section, our goal is cleaning dataset, parsing values to get more insight in our dataset and also reducing data size.

What we need to do saw the result of data_glimpse() is below : 
- parsing column 'Cabin' into Deck, DeckNumber, Side column
- drop column 'Name'
- filling missing values 

### Parsing column 'cabin'

In [None]:
# Parsing value from cabin

def parsing_from(dataset, idx) : 
    return dataset['Cabin'].str.split('/').str[idx]

train_df['Deck'] = parsing_from(train_df, 0)
train_df['DeckNumber'] = parsing_from(train_df, 1)
train_df['Side'] = parsing_from(train_df, 2)

# Check results

train_df.head(1)

In [None]:
# Drop exisiting column

train_df.drop(['Cabin'], axis = 1, inplace = True)

train_df.columns

In [None]:
# Apply process to test_df 

test_df['Deck'] = parsing_from(test_df, 0)
test_df['DeckNumber'] = parsing_from(test_df, 1)
test_df['Side'] = parsing_from(test_df, 2)
test_df.drop(['Cabin'], axis = 1, inplace = True)

### Drop column 'Name'

In [None]:
train_df.drop(['Name'], axis = 1, inplace = True)

# Apply process to test_df 

test_df.drop(['Name'], axis = 1, inplace = True)

### fill missing values

In [None]:
# automate layout for figure object

def fig_layout(title, xaxis, yaxis) : 
    fig.update_layout(
    {
        "title": {
            "text": title,
            "x": 0.5,
            "y": 0.9,
            "font": {
                "size": 15
            }
        },
        "xaxis": {
            "title": xaxis,
            "showticklabels":True,
            "tickfont": {
                "size": 9            
            }
        },
        "yaxis": {
            "title": yaxis,
            "tickfont": {
                "size": 10                
            }
        },
        "template":'plotly_dark'
    }
    )
    

In [None]:
# plot for viewing missing values 

missing_val = train_df.isnull().sum().sort_values(ascending = False)

fig = go.Figure()

fig.add_trace(
    go.Bar(
        x = missing_val.index,
        y = missing_val,
        text = missing_val
    )
)

title = "<b>Count of missing values by features</b>"
xaxis = "Variables"
yaxis = "Count of missing values"

fig_layout(title, xaxis, yaxis)
fig.show()

Without our target variable 'Transported' and index 'Passenger Group' and 'Passenger Number', features in dataset have about 190 missing values. That amount is about only 2% of whole dataset size. We will impute missing values with most frequency values using SimpleImputer().

In [None]:
# We will drop 'Transported' from train_df to apply SimpleImputer

y = train_df['Transported']
train_df = train_df.drop(columns = ['Transported'], axis = 1)

print(f"Shape of train dataset : {train_df.shape}")
print(f"Shape of test dataset : {test_df.shape}")

In [None]:
# Fit SimpleImputer to train dataset and apply it to test dataset

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')

train_df = pd.DataFrame(imputer.fit_transform(train_df), columns = train_df.columns, index = train_df.index)
test_df = pd.DataFrame(imputer.transform(test_df), columns = test_df.columns, index = test_df.index)

# Change type object to numeric

train_df[['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 
          'Spa', 'VRDeck', 'DeckNumber']] = train_df[['Age', 'RoomService', 'FoodCourt', 
                                                       'ShoppingMall', 'Spa', 'VRDeck', 'DeckNumber']].apply(pd.to_numeric)
test_df[['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 
          'Spa', 'VRDeck', 'DeckNumber']] = test_df[['Age', 'RoomService', 'FoodCourt', 
                                                       'ShoppingMall', 'Spa', 'VRDeck', 'DeckNumber']].apply(pd.to_numeric)

In [None]:
print(f"Missing values of train_df : {train_df.isnull().sum().sum()}")
print(f"Missing values of test_df : {test_df.isnull().sum().sum()}")

In [None]:
# Concat target to train_df

train_df = pd.concat([train_df, y], axis = 1)
train_df.head(1)

We finished filling missing values with most frequent values. 

## Explore dataset 

After parsing and cleaning dataset, now we need to explore dataset to process outlier and missing values. To do this, we need to explore dataset with visualization or statistics. We will see visualizations of missing values, statistics and plots of each columns.

### Statistics and Visualizaitons of each columns

To find how we fill missing values with and is there any outlier in numerical features, we need to see statistics and visualization of each columns. First of all, we need to devide columns into continuous and categorical features.

- continuous_fea = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']  
- categorical_fea = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Deck Number', 'Side']
- target_fea = 'Transported'

Let's see distribution of target feature first and look at independent variables. 

In [None]:
fig = px.histogram(train_df, "Transported", color = "Transported")

title = "Histogram of Target feature"
xaxis = "Transported"
yaxis = "Count"
fig_layout(title, xaxis, yaxis)
fig.show()

We have balanced target variables which false is 4315 and true is 4378.

### Continuous feature 

In [None]:
continuous_fea = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'] 
categorical_fea = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Side']

train_df[continuous_fea].describe()

The standard deviation of 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck' is too big, so we need to deal with those outliers. 

In [None]:
# See distplot of two continuous features 'Age' and 'RoomService'

import plotly.figure_factory as ff 

fig = ff.create_distplot([train_df['Age']], ['Age'], bin_size = 5, 
                         curve_type = 'normal')

title = "<b>Distplot of Age</b>"
xaxis = "Age"
yaxis = "%"
fig_layout(title, xaxis, yaxis)

fig.show()

As we saw at the desribe table, most age in spaceship is 28 years old. 

In [None]:
# Age by target

fig = px.histogram(train_df, x = 'Age', color = 'Transported')

title = "<b>Distplot of Age by target</b>"
xaxis = "Age"
yaxis = "%"
fig_layout(title, xaxis, yaxis)

fig.show()

In [None]:
fig = ff.create_distplot([np.sqrt(train_df['RoomService'])], ['RoomService'], bin_size = 10, 
                         curve_type = 'normal')

title = "<b>Distplot of RoomService</b>"
xaxis = "RoomService"
yaxis = "%"
fig_layout(title, xaxis, yaxis)

fig.show()

In [None]:
from scipy.stats import skew

for fea in continuous_fea : 
    print(f"Skewness of {fea} is {skew(np.array(train_df[fea]))}")

Diferrence of min and max values of 'RoomService' is so high that i need to apply np.sqrt() to view distplot. When we check skewness of other features of continuous_fea, without Age, all the other features's skewness is higher than 7. We will impute outliers using IQR range.

### Process Outlier

In [None]:
def impute_outlier(col) : 
    Q1 = train_df[col].quantile(0.25) 
    Q3 = train_df[col].quantile(0.75)
    IQR = Q3 - Q1
    
    return Q3 + (1.5*IQR)

In [None]:
outlier_fea = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'] 

for fea in outlier_fea : 
    max_val = impute_outlier(fea)
    train_df[fea] = train_df[fea].map(lambda x : max_val if x > max_val else x)
    # Apply result to test dataset
    test_df[fea] = test_df[fea].map(lambda x : max_val if x > max_val else x)   

In [None]:
fig = ff.create_distplot([np.sqrt(train_df['RoomService'])], ['RoomService'], bin_size = 3, 
                         curve_type = 'normal')

title = "<b>Distplot of RoomService</b>"
xaxis = "RoomService"
yaxis = "%"
fig_layout(title, xaxis, yaxis)

fig.show()

Now outlier is replaced by $Q3 + 1.5 \times IQR$. 

### Categorical features

we will see univariate and coparision by target feature count plot of categorical features. 

- See univariate count plots of each color
- See count plots of each color by target 

In [None]:
from plotly.subplots import make_subplots

def create_count_plot(fea) : 
    grouped_df = train_df.groupby(fea).size().reset_index()
    grouped_df.columns = [fea, 'Count']
    
    grouped_df_target = train_df.groupby([fea, 'Transported']).size().reset_index()
    grouped_df_target.columns = [fea, 'Transported', 'Count']
    
    fig = make_subplots(rows=1, cols=2)

    fig.add_trace(go.Bar(
        x = grouped_df[fea],
        y = grouped_df["Count"],
        name = fea 
    ), row = 1, col = 1)
    
    for trans in train_df['Transported'].unique() : 
        plot_df = grouped_df_target[grouped_df_target['Transported'] == trans]
        fig.add_trace(go.Bar(
            x = plot_df[fea],
            y = plot_df["Count"],
            name = f"Transported {trans}"
        ), row = 1, col = 2)
        
    fig.update_layout(
    {
        "title": {
            "text": f"Countplots of {fea}",
            "x": 0.5,
            "y": 0.9,
            "font": {
                "size": 15
            }
        },
        "yaxis": {
            "title": "Count",
            "tickfont": {
                "size": 10                
            }
        },
        "template":'plotly_dark'
    }
    )  
    
    fig.update_xaxes(title_text=fea, row=1, col=1)
    fig.update_xaxes(title_text=fea, row=1, col=2)
        
    fig.show()

In [None]:
# Distribution of "HomePlanet"

create_count_plot("HomePlanet")

Earth has the most numerous passengers in space titanic, following Europa and Mars. By the way, Earth also has the most numerous passengers didn't transported while Europa has the most numerous passengers transported.

In [None]:
# Distribution of "CryoSleep"

create_count_plot("CryoSleep")

Interestingly, the passengers who is in CryoSleep transported more than those who isn't in CryoSleep.

In [None]:
# Distribution of "Destination"

create_count_plot("Destination")

In [None]:
create_count_plot("VIP")

There isn't a big gap between passengers who have VIP or not. 

In [None]:
create_count_plot("Deck")

Passengers who uses deck B and C have been trasported more than other decks. 

In [None]:
create_count_plot("Side")

# Modeling 

## Preprocessing 

Before we make a model for prediction, we need to choose some features improving our model's accuracy. First of all, we will divid target variable from train_df.

In [None]:
# Divide target features from train_df and change values with 0 and 1

y_train = np.where(train_df['Transported'] == True, 1, 0)
X = train_df.drop(columns = ['Transported'], axis = 1)
X_test = test_df

print(f"Size of each table : y = {y.shape}, X = {X.shape}, X_test = {X_test.shape}")

### Feature Selection

We will use correlation metrics to choose vairables from continuous features.

In [None]:
train_df['Transported'] = np.where(train_df['Transported'] == True, 1, 0)
corr_fea = continuous_fea
corr_fea.append('Transported')
corr = train_df[corr_fea].corr()
mask = np.triu(np.ones_like(corr, dtype = bool))
df_mask = corr.mask(mask)

fig = ff.create_annotated_heatmap(z=df_mask.round(3).to_numpy(), 
                                  x=df_mask.columns.tolist(),
                                  y=df_mask.columns.tolist(),
                                  colorscale=px.colors.diverging.RdBu,
                                  hoverinfo="none",
                                  showscale=True, ygap=1, xgap=1
                                 )

fig.update_xaxes(side="bottom")

fig.update_layout(
    title_text='Heatmap', 
    title_x=0.5, 
    width=1000, 
    height=1000,
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    xaxis_zeroline=False,
    yaxis_zeroline=False,
    yaxis_autorange='reversed',
    template='plotly_dark'
)

for i in range(len(fig.layout.annotations)):
    if fig.layout.annotations[i].text == 'nan':
        fig.layout.annotations[i].text = ""
    fig.layout.annotations[i].font.size = 10

fig.show()

In [None]:
# OneHotEncoding

from sklearn.preprocessing import OneHotEncoder

OH_encoder = OneHotEncoder(handle_unknown = 'ignore', sparse = False)

OH_encoder.fit(X[categorical_fea])

OH_X_train = pd.DataFrame(OH_encoder.transform(X[categorical_fea]))
OH_X_train.index = X.index

OH_X_test = pd.DataFrame(OH_encoder.transform(X_test[categorical_fea]))
OH_X_test.index = X_test.index

num_X = X.drop(categorical_fea, axis = 1)
num_X_test = X_test.drop(categorical_fea, axis = 1)

X = pd.concat([num_X, OH_X_train], axis = 1)
X_test = pd.concat([num_X_test, OH_X_test], axis = 1)

# MinMaxScaling 

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler() 

X = scaler.fit_transform(X)
X_test = scaler.transform(X_test)

print(f"Size of each table : y_train = {y.shape}, X = {X.shape}, X_test = {X_test.shape}")

# Divide dataset into train, valid dataset

from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size = 0.8, test_size = 0.2)

We will use Logistic Regression, KNN, Decision Tree, RandomForest. we will measure accuracy of model using ROC score, F1-Score, Confusion matrix. 

In [None]:
# Importing libraries for metrics

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, roc_curve, accuracy_score, f1_score, roc_auc_score


# Confusion matrix 

def accuracy_plots(y_valid, preds, yscore) : 
    
    fig = make_subplots(rows=1, cols=2)
    
    # Confusion matrix
    
    z = confusion_matrix(y_valid, preds)
    x = ['Negative', 'Positive']
    y = ['False', 'True']    
    
    fig.add_trace(go.Heatmap(
        z=z,
        x=x,
        y=y,
        text = z,
        reversescale = True
    ), row = 1, col = 1)
    
    # Roc-Auc curve
    
    fpr, tpr, thresholds = roc_curve(y_valid, yscore) 
    
    fig.add_trace(go.Scatter(
        x = fpr, y = tpr,
        fill = 'tozeroy'
    ), row = 1, col = 2)
    
    fig.update_layout(
    {
        "title": {
            "text": f"Confusion Matrix and ROC Curve",
            "x": 0.5,
            "y": 0.9,
            "font": {
                "size": 15
            }
        },
        "template":'plotly_dark'
    }
    )  
    
    fig.update_xaxes(title_text="Confusion Matrix", row=1, col=1)
    fig.update_yaxes(autorange="reversed", row = 1, col = 1) 
    fig.update_xaxes(title_text="Roc Curve", row=1, col=2)
    
    fig.show()
    
    
# Accuracy Metrics

def metrics(estimators, params, X_train, y_train, X_valid, y_valid) : 
    metrics = []
    
    for name, model in estimators.items() : 
        grid_model = GridSearchCV(model, params[name], cv = 3)
        grid_model.fit(X_train, y_train)
        best_model = grid_model.best_estimator_
        preds = best_model.predict(X_valid)
        yscore = best_model.predict_proba(X_valid)[:, 1]
        print(f"Hyperparameter tuning of model : {grid_model.best_estimator_}")
            
        scores = {}
        scores['accuracy_score'] = accuracy_score(y_valid, preds)
        scores['f1_score'] = f1_score(y_valid, preds)
        scores['roc_auc_scor'] = roc_auc_score(y_valid, yscore)
        
        metrics.append(scores) 
        accuracy_plots(y_valid, preds, yscore)
        
    metrics_df = pd.DataFrame(metrics, index = ['Logisitc Regression', 'Support Vector Machine', 'Random Forest', 'Gradient Boosting Classifier', 'LGBM Classifier'])
        
    return metrics_df 

In [None]:
# Model importing

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier

lgr = LogisticRegression(random_state = 0)
svc = SVC(random_state = 0, probability = True)
rfc = RandomForestClassifier(random_state = 0)
gbc = GradientBoostingClassifier(random_state = 0)
lgbm = LGBMClassifier(random_state = 0) 

estimators = { 
    'LGR' : lgr,
    'SVC' : svc,
    'RFC' : rfc,
    'GBC' : gbc,
    'lgbm' : lgbm
}

params = {
    'LGR' : {
        'C' : [0.01, 0.1, 0.5, 1],
        'penalty' : ['l1', 'l2', 'elastic']
    },
    'SVC' : {
        'C' : [0.1 ,1 ,2, 5],
        'kernel' : ['linear', 'poly'],
        'degree' : [2, 3]
    },
    'RFC' : {
        'n_estimators' : [50, 100, 150, 200, 250],
        'max_depth' : [5, 6, 7]   
    },
    'GBC' : { 
        "n_estimators": range(50, 100, 25), 
        "max_depth": [4, 5, 10]
    },
    'lgbm' : {
        'num_leaves':[5, 10, 15], 
        'min_child_samples':[10,15, 20],
        'max_depth':[5, 8, 10],
        'reg_alpha':[0.01,0.03]

    }
}

In [None]:
metrics(estimators, params, X_train, y_train, X_valid, y_valid) 

## Conclusion

The model have best score of accuracy is Gradient Boosting Classifier. To predict test dataset, we will use Gradient Boosting Classifier. 

In [None]:
final_model = GradientBoostingClassifier(max_depth=4, n_estimators=75, random_state=0)
final_model.fit(X_train, y_train)
predicts = final_model.predict(X_test)

submissions = pd.DataFrame({'PassengerId' : test_df.index,
                            'Transported' : predicts})
submissions['Transported'] = np.where(submissions['Transported'] == 1, True, False) 
submissions.to_csv('./submissions.csv', index = False)