# **Spaceship Titanic Passenger Destiny**
## **Table of Contents**

  * [Data Set Information: Spaceship Titanic Data Set](#Information)<br></br>
  * [Load & Explore the Spaceship Titanic dataset](#Dataset)<br></br>
  * [Data Preprocessing and Visualization](#Preprocessing)<br></br>
  * [Model Development & Evaluation](#Model)<br></br>
  * [Prediction](#Prediction)<br></br>
    


### **Import Libraries & Primary modules**

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt 

## **Dataset Information: Spaceship Titanic Data Set**<a name="DatasetInformation"></a>

Spaceship Titanic was enroute with about 13,000 passengers from three different points of origin, Earth, Europa and Mars. The three different destinations recorded for this voyage were as follows:<br>
TRAPPIST-1 e around forty light-years away; 55 Cancri e, also known as Janssen, orbits a star called Copernicus only 41 light years away; And finally the farthest, PSO J318.5-22, afree-Floating Exoplanet Found 80 Light-Years from Earth.<br>
On its maiden voyage, on course of this ill-fated journey, the Spaceship Titanic collided with a spacetime anomaly. The aim of this project is to predict whether a passenger travelling on this spaceship was transported to an alternate dimension or not from the data of ~8700 passengers of known fate. The fate of the rest of one-third is still a mystery. Using machine learning algorithms this project tries to predict what really happened to those passengers whose fate was still buried from historical records. <br>

**File and Data Field Descriptions**<br>
1. Passenger with known fate: Personal records for about two-thirds (~8700) of the passengers, to be used as training data
  
    * PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
    * HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
    * CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
    * Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
    * Destination - The planet the passenger will be debarking to.
    * Age - The age of the passenger.
    * VIP - Whether the passenger has paid for special VIP service during the voyage.
    * RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
    * Name - The first and last names of the passenger.
    * Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
    
2. Passengers with unknown fate: Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. The task is to predict the value of Transported for the passengers in this set.

Transported - The target. For each passenger, predict either True or False.

## **Load & Explore the Spaceship Titanic datasets**<a name = "Dataset"></a>

In [None]:
df = pd.read_csv('../input/spaceship-titanic/train.csv')
df.head()

In [None]:
df_test = pd.read_csv('../input/spaceship-titanic/test.csv')
df_test.head()

In [None]:
print(df.shape)
print(df_test.shape)

In [None]:
df.info(), df_test.info()

## **Data Preprocessing and Visualization**<a name = 'Preprocessing'></a>
### **Missing data cleaning**

In [None]:
df.isnull().sum(), df_test.isnull().sum()

The missing values can be easily spotted through sns.heatmap 

In [None]:
plt.figure(figsize = (12,6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap = 'cividis');

In [None]:
plt.figure(figsize = (12,6))
sns.heatmap(df_test.isnull(), yticklabels=False, cbar=False, cmap = 'plasma');

#### **Handling Cabin Column**
The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard. The missing values are replaced by a new value 'X/0/x'. The filled 'Cabin column is then split into 'deck', 'num' and 'side'.

In [None]:
df['Cabin']= df['Cabin'].fillna(value = 'X/0.0/x')
df_test['Cabin']= df_test['Cabin'].fillna(value = 'X/0.0/x')
df[['deck', 'num', 'side']]= df['Cabin'].str.split('/', expand = True)
df_test[['deck', 'num', 'side']]= df_test['Cabin'].str.split('/', expand = True)

#### **Filling up age column:** 
It seems better to fill the age column with the mean age per VIP class. Use the average age values based on VIP for Age.

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='VIP',y='Age',data=df,palette='winter');

In [None]:
df.groupby('VIP').mean()

In [None]:
df_test.groupby('VIP').mean()

In [None]:
df['Age'].mean()

For Train set the mean age of VIP paasengers are found to be 38 and in test set it is 35. So define different functions for both train and test set. The mean age of all passengers and also those in non -VIP list is found to be 29. So define function accordingly. 

In [None]:
def impute_age(cols):
    Age = cols[0]
    VIP = cols[1]   
    if pd.isnull(Age):
        if VIP == True:
            return 38
        elif VIP == False:
            return 29
        else:
            return 29
    else:
        return Age

In [None]:
def impute_age_test(cols): 
    Age = cols[0]
    VIP = cols[1]
    if pd.isnull(Age):
        if VIP == True:
            return 35
        elif VIP == False:
            return 29
        else:
            return 29
    else:
        return Age

In [None]:
#Now apply those function!
df['Age'] = df[['Age','VIP']].apply(impute_age,axis=1)
print(df['Age'].isnull().sum())
df_test['Age'] = df_test[['Age','VIP']].apply(impute_age_test,axis=1)
print(df_test['Age'].isnull().sum())

The missing categorical values are filled as 'Unknown' and numerical values as 0, except for the age. 

In [None]:
df[df.select_dtypes(include=['object']).columns] = df[df.select_dtypes(include=['object']).columns].fillna('Unknown')
df = df.fillna(0.0)
df_test[df_test.select_dtypes(include=['object']).columns] = df_test[df_test.select_dtypes(include=['object']).columns].fillna('Unknown')
df_test = df_test.fillna(0.0)

In [None]:
df.isnull().values.any()

In [None]:
df_test.isnull().values.any()

### **Converting Categorical Features**

Encoders require their input to be uniformly strings or numbers. 'CryoSleep' and'VIP' columns contain Boolean values. Convert them to string data type before going in for encoding. 

In [None]:
convert_dict = {'CryoSleep': str,
                'VIP': str
               }
df = df.astype(convert_dict)
df_test = df_test.astype(convert_dict)

In [None]:
X = df.drop(["PassengerId", "Cabin", "Transported"], axis = 1)
y = df["Transported"]
titanic_test = df_test.drop(["PassengerId", "Cabin"], axis = 1)

print(X.shape)
print(y.shape)
print (titanic_test.shape)

OrdinalEncoder from scikit learn, which allows multi-column encoding can be used to convert categorical features in numerical data type.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
X[X.select_dtypes(include=['object']).columns] = enc.fit_transform(X[X.select_dtypes(include=['object']).columns])
titanic_test[titanic_test.select_dtypes(include=['object']).columns] = enc.fit_transform(titanic_test[titanic_test.select_dtypes(include=['object']).columns])

In [None]:
X.info(), titanic_test.info()

### **Train Test Split**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

### **Standardize the Variables**

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
titanic_test = scaler.transform(titanic_test)

## **Model Development** <a name = 'Model'></a>
The following algorithms can be used:

* K Nearest Neighbor(KNN)
* Random Forest Classifier
* Support Vector Machine - Linear Kernel
* Support Vector Machine - rbf Kernel
* XGBoost Classifier
* Gradient Boosting Classifier
* Artificial Neural Network

### **Import the required packages**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score, precision_score
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score

In [None]:
Models=[("KNN",KNeighborsClassifier(n_neighbors=7, weights= 'uniform')),
        ("Random Forest",RandomForestClassifier(n_estimators = 100, criterion = "entropy", max_depth= 13, random_state = 0)),
        ("Random Forest",RandomForestClassifier(n_estimators = 100, criterion = "gini", max_depth= 15, random_state = 0)),
        ("SVM_Linear",svm.SVC(kernel='linear')),
        ("SVM_rbf",svm.SVC(kernel='rbf')),
        ("XGB", XGBClassifier(booster = 'dart', learning_rate=0.05, n_estimators=50, objective='binary:logistic',  use_label_encoder=False,  disable_default_eval_metric = True)), 
        ("XGB", XGBClassifier(booster= 'gbtree', learning_rate=0.05, n_estimators=100, objective='binary:logistic',  use_label_encoder=False,  disable_default_eval_metric = True)),
        ("Gradient Boosting", GradientBoostingClassifier(criterion = 'friedman_mse', max_depth = 2, n_estimators= 500))]

Model_output=[]
for name,model in Models:
    yhat=model.fit(X_train, y_train).predict(X_test)
    Train_set_Accuracy = accuracy_score(y_train, model.predict(X_train))
    F1_score = f1_score(y_test, yhat, average='weighted')
    Accuracy = accuracy_score(y_test, yhat)
    Model_output.append((name, Train_set_Accuracy, Accuracy, F1_score))
    final_Report=pd.DataFrame(Model_output, columns=['Algorithm','Train_set_Accuracy', 'Accuracy','F1-score'])
        
Parameter= [("k", 7),("criterion_entropy"), ("criterion_gini"),("Linear"),("RBF"), ("dart"), ("gbtree"),('criterion_friedman_mse')]
final_Report['Parameter'] = Parameter
final_Report = final_Report[['Algorithm','Parameter', 'Train_set_Accuracy', 'Accuracy','F1-score']]
final_Report

### **K Nearest Neighbor(KNN)**

In [None]:
# To find the best k
error_rate = []
for i in range (1,10):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    error_rate.append(np.mean(y_test != y_pred))

plt.figure(figsize = (10,4))
plt.plot(range(1,10), error_rate, color = "blue", ls = "--", marker ="o", markersize = 10, markerfacecolor ="red")
plt.title("Error Rate vs K Value")
plt.ylabel("Error Rate")
plt.xlabel("K Value");
print( "The least error_rate was ", min(error_rate), "with k=", np.argmin(error_rate)+1)

In [None]:
k = 5
#Train Model and Predict  
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(X_train, y_train)

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
y_pred = knn.predict(X_test)
print("Train set Accuracy: ", accuracy_score(y_train, knn.predict(X_train)))
print('Accuracy of KNN Classification Model is ', accuracy_score(y_test, y_pred))
print('\n', '\n','Confusion Matrix of KNN Classification Model:' '\n', confusion_matrix(y_test, y_pred))
print('\n', '\n','Classification Report for KNN Classification Model:' '\n',classification_report(y_test, y_pred))

### **Random Forest Classifier with gini criterion**

In [None]:
RF= RandomForestClassifier(criterion= 'gini', max_depth= 14, n_estimators = 100, random_state = 0)
RF.fit(X_train, y_train) 
y_pred = RF.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
print("Train set Accuracy: ", accuracy_score(y_train, RF.predict(X_train)))
accuracy = (y_pred == y_test).sum() / len(y_test)
print('Accuracy of RandomForest Classifier (SVC) calculated manually is ', accuracy.round(2), '%')
print('Accuracy of RandomForest Classifier (SVC) is {:.2f} % '.format(accuracy_score(y_test, y_pred)))
print('\n', '\n','Confusion Matrix of RandomForest Classifier (SVC):' '\n', confusion_matrix(y_test, y_pred))
print('\n', '\n','Classification Report for RandomForest Classifier (SVC):' '\n',classification_report(y_test, y_pred))
plt.figure()
plt.title('Confusion matrix')
sns.heatmap(cm, annot=True, cmap = 'plasma',  linecolor='black', linewidths=1)
plt.xlabel("Predicted")
plt.ylabel("Actual");
plt.xticks(np.arange(0.5, 2.5), ['False', 'True'])
plt.yticks(np.arange(0.5, 2.5), ['False', 'True']);

### **Random Forest Classifier with entropy criterion**

In [None]:
RF= RandomForestClassifier(criterion= 'entropy', max_depth= 13, n_estimators = 100, random_state = 0)
RF.fit(X_train, y_train) 
y_pred = RF.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
print("Train set Accuracy: ", accuracy_score(y_train, RF.predict(X_train)))
accuracy = (y_pred == y_test).sum() / len(y_test)
print('Accuracy of RandomForest Classifier (SVC) calculated manually is ', accuracy.round(2), '%')
print('Accuracy of RandomForest Classifier (SVC) is {:.2f} % '.format(accuracy_score(y_test, y_pred)))
print('\n', '\n','Confusion Matrix of RandomForest Classifier (SVC):' '\n', confusion_matrix(y_test, y_pred))
print('\n', '\n','Classification Report for RandomForest Classifier (SVC):' '\n',classification_report(y_test, y_pred))
plt.figure()
plt.title('Confusion matrix')
sns.heatmap(cm, annot=True, cmap = 'plasma',  linecolor='black', linewidths=1)
plt.xlabel("Predicted")
plt.ylabel("Actual");
plt.xticks(np.arange(0.5, 2.5), ['False', 'True'])
plt.yticks(np.arange(0.5, 2.5), ['False', 'True']);

### **Support Vector Machine - Linear Kernel**

In [None]:
from sklearn import svm
svm_linear = svm.SVC(kernel='linear')
svm_linear.fit(X_train, y_train) 
y_pred = svm_linear.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
print("Train set Accuracy: ", accuracy_score(y_train, svm_linear.predict(X_train)))
accuracy = (y_pred == y_test).sum() / len(y_test)
print('Accuracy of Support Vector Classifier (SVC) calculated manually is ', accuracy.round(2), '%')
print('Accuracy of Support Vector Classifier (SVC) is {:.2f} % '.format(accuracy_score(y_test, y_pred)))
print('\n', '\n','Confusion Matrix of Support Vector Classifier (SVC):' '\n', confusion_matrix(y_test, y_pred))
print('\n', '\n','Classification Report for Support Vector Classifier (SVC):' '\n',classification_report(y_test, y_pred))
plt.figure()
plt.title('Confusion matrix')
sns.heatmap(cm, annot=True, cmap = 'plasma',  linecolor='black', linewidths=1)
plt.xlabel("Predicted")
plt.ylabel("Actual");
plt.xticks(np.arange(0.5, 2.5), ['False', 'True'])
plt.yticks(np.arange(0.5, 2.5), ['False', 'True']);

### **Support Vector Machine - Radial Basis Function (RBF) Kernel**

In [None]:
from sklearn import svm
svm_rbf = svm.SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train) 
y_pred = svm_rbf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
print("Train set Accuracy: ", accuracy_score(y_train, svm_rbf.predict(X_train)))
accuracy = (y_pred == y_test).sum() / len(y_test)
print('Accuracy of Support Vector Classifier (SVC) calculated manually is ', accuracy.round(2), '%')
print('Accuracy of Support Vector Classifier (SVC) is {:.2f} % '.format(accuracy_score(y_test, y_pred)))
print('\n', '\n','Confusion Matrix of Support Vector Classifier (SVC):' '\n', confusion_matrix(y_test, y_pred))
print('\n', '\n','Classification Report for Support Vector Classifier (SVC):' '\n',classification_report(y_test, y_pred))
plt.figure()
plt.title('Confusion matrix')
sns.heatmap(cm, annot=True, cmap = 'plasma',  linecolor='black', linewidths=1)
plt.xlabel("Predicted")
plt.ylabel("Actual");
plt.xticks(np.arange(0.5, 2.5), ['False', 'True'])
plt.yticks(np.arange(0.5, 2.5), ['False', 'True']);

### **XGBoost Classifier with dart booster**

In [None]:
from xgboost import XGBClassifier
XGB = XGBClassifier(booster= 'dart', learning_rate=0.05, n_estimators=50, objective='binary:logistic', 
                          use_label_encoder=False,  disable_default_eval_metric = True)
XGB.fit(X_train, y_train)
y_pred_XGB = XGB.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
print("Train set Accuracy: ", accuracy_score(y_train, XGB.predict(X_train)))
accuracy = (y_pred_XGB == y_test).sum() / len(y_test)
print('Accuracy of XGBoost Classifier calculated manually is ', accuracy.round(2))
print('Accuracy of XGBoost Classifier is ', accuracy_score(y_test, y_pred_XGB))
print('\n', '\n','Confusion Matrix of XGBoost Classifier:' '\n', confusion_matrix(y_test, y_pred_XGB))
print('\n', '\n','Classification Report for XGBoost Classifier:' '\n',classification_report(y_test, y_pred_XGB))

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred_XGB)
cmd = ConfusionMatrixDisplay(cm, display_labels=['False', 'True'])
fig, ax = plt.subplots(figsize=(9,9))
cmd.plot(ax=ax)
cmd.ax_.set(xlabel='Predicted', ylabel='Actual', title='Confusion Matrix Actual vs Predicted');

### **XGBoost Classifier with gbtree booster**

In [None]:
from xgboost import XGBClassifier
XGB = XGBClassifier(booster= 'gbtree', learning_rate=0.05, n_estimators=100, objective='binary:logistic', 
                          use_label_encoder=False,  disable_default_eval_metric = True)
XGB.fit(X_train, y_train)
y_pred_XGB = XGB.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
print("Train set Accuracy: ", accuracy_score(y_train, XGB.predict(X_train)))
accuracy = (y_pred_XGB == y_test).sum() / len(y_test)
print('Accuracy of XGBoost Classifier calculated manually is ', accuracy.round(2))
print('Accuracy of XGBoost Classifier is ', accuracy_score(y_test, y_pred_XGB))
print('\n', '\n','Confusion Matrix of XGBoost Classifier:' '\n', confusion_matrix(y_test, y_pred_XGB))
print('\n', '\n','Classification Report for XGBoost Classifier:' '\n',classification_report(y_test, y_pred_XGB))

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred_XGB)
cmd = ConfusionMatrixDisplay(cm, display_labels=['False', 'True'])
fig, ax = plt.subplots(figsize=(9,9))
cmd.plot(ax=ax)
cmd.ax_.set(xlabel='Predicted', ylabel='Actual', title='Confusion Matrix Actual vs Predicted');

### **GradientBoosting Classifier**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
GB = GradientBoostingClassifier(criterion = 'friedman_mse', max_depth = 5, n_estimators= 100)
GB.fit(X_train, y_train)
y_pred_GB = GB.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
print("Train set Accuracy: ", accuracy_score(y_train, GB.predict(X_train)))
accuracy = (y_pred_GB == y_test).sum() / len(y_test)
print('Accuracy of GradientBoosting Classifier calculated manually is ', accuracy.round(2))
print('Accuracy of GradientBoosting Classifier is ', accuracy_score(y_test, y_pred_GB))
print('\n', '\n','Confusion Matrix of GradientBoosting Classifier:' '\n', confusion_matrix(y_test, y_pred_GB))
print('\n', '\n','Classification Report for GradientBoosting Classifier:' '\n',classification_report(y_test, y_pred_GB))

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred_GB)
cmd = ConfusionMatrixDisplay(cm, display_labels=['False', 'True'])
fig, ax = plt.subplots(figsize=(9,9))
cmd.plot(ax=ax)
cmd.ax_.set(xlabel='Predicted', ylabel='Actual', title='Confusion Matrix Actual vs Predicted');

### **Artificial Neural Network**

In [None]:
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.constraints import max_norm
model = Sequential()
# input layer
model.add(Dense(70,  activation='relu'))
model.add(Dropout(0.2))

# hidden layer
model.add(Dense(70, activation='relu'))
model.add(Dropout(0.2))

# hidden layer
model.add(Dense(70, activation='relu'))
model.add(Dropout(0.2))

# output layer
model.add(Dense(units=1,activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(x=X_train, 
          y=y_train, 
          epochs=25,
          verbose = 0,
          batch_size=256,
          validation_data=(X_test, y_test), 
          )

In [None]:
losses = pd.DataFrame(model.history.history)
losses[['loss','val_loss']].plot();

In [None]:
predictions = model.predict(X_test) > 0.5
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import accuracy_score
print('Accuracy of the ANN Model is ', accuracy_score(y_test,predictions))
print('\n', '\n','Confusion Matrix of ANN Model:' '\n', confusion_matrix(y_test,predictions))
print('\n', '\n','Classification Report for ANN Model:' '\n',classification_report(y_test,predictions))

## **Final Prediction**<a name = 'Prediction'></a>

In [None]:
# Prediction using RF Classifier
RF= RandomForestClassifier(criterion= 'entropy', max_depth= 13, n_estimators = 100, random_state = 0)
RF.fit(X_train, y_train) 
submission_preds = RF.predict(titanic_test)
test_ids = df_test['PassengerId']
df = pd.DataFrame({'PassengerId': test_ids.values, 'Transported': submission_preds})
df.to_csv('submission.csv', index = False)

In [None]:
submission_preds