Spaceship Titanic : Exploratory Data Analysis (EDA)
-----------------
Predict which passengers are transported to an alternate dimension

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

####  File and Data Field Descriptions
--------------------------------
train.csv - Personal records for about two-thirds (8700) of the passengers, to be used as training data.

PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.

CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

Destination - The planet the passenger will be debarking to.

Age - The age of the passenger.

VIP - Whether the passenger has paid for special VIP service during the voyage.
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

Name - The first and last names of the passenger.
Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

test.csv - Personal records for the remaining one-third (~]4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

sample_submission.csv - A submission file in the correct format.
PassengerId - Id for each passenger in the test set.
Transported - The target. For each passenger, predict either True or False.

Download Kaggale Dataset Steps

https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
os.listdir('/kaggle/input')

In [None]:
train = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
train.shape

In [None]:
train.head()

In [None]:
train["PassengerGroup"] = train["PassengerId"].apply(lambda x : x.split("_")[1])
# Binary Encoding 1 : Individual and 2 : Family
train["PassengerGroup"] = np.where(train["PassengerGroup"]=="01",1,2)

In [None]:
# Dropping Name and PassengerId for the Fact that these are Unique Features.
train.drop(columns=["PassengerId","Name"],inplace=True)

Univariate , Bivariate and Multi Variate Analysis

In [None]:
"""
Note : As per visual insight it can be seen that :

1. HomePlanet , CryoSleep , Cabin , Destination , VIP and Transported are categorical Features.
2. Age , Amount Spent in (Room Service , FoodCourt and ShoppingMall , Spa and VRDeck) are Numerical Features,.
3. Transported Being a Categorical Feature is also the Target Class.
4. CryoSleep , VIP , Transported are Bi-virate Classes.

"""

In [None]:
# lets Check if there are missing values
print("Missing Value in Each Variable\n")
for feature in train.columns:
  print(f"{feature} : {train[feature].isna().sum()}")

In [None]:
"""
Categorical Features
"""

In [None]:
#Convert the Boolean Classes to Numerical
train.CryoSleep.replace({True:1,False:0},inplace=True)
train.VIP.replace({True:1,False:0},inplace=True)
train.Transported.replace({True:1,False:0},inplace=True)

In [None]:
train.CryoSleep.value_counts(dropna=False)

In [None]:
"""
Note : Condisering as a naive process method, 
replacing the missing values with 0/False suggesting
that these people were not in cryo
"""
train.CryoSleep.fillna(0,inplace=True)
# Post Imputing
train.CryoSleep.value_counts(dropna=False)

In [None]:
"""
Note : Condisering as a naive process method, 
replacing the missing values with 0/False suggesting
that these people were not VIP.
"""
train.VIP.fillna(0,inplace=True)
# Post Imputing
train.VIP.value_counts(dropna=False)

In [None]:
train["HomePlanet"].value_counts(dropna=False)

In [None]:
"""
Note : Condisering as a naive process method, 
replacing the missing values with HomePlanet suggesting
that these people are from Earth.
"""
train["HomePlanet"].fillna("Earth",inplace=True)
# Post Imputing
train.HomePlanet.value_counts(dropna=False)

In [None]:
# Convert the Multi Variate Classes to Numerical
# Performing EDA Before Conversion
sns.countplot(x="HomePlanet",hue="Transported",data=train)

In [None]:
train["Destination"].value_counts(dropna=False)

In [None]:
"""
Note : Condisering as a naive process method, 
replacing the missing values with Destination suggesting
that these people have planned to go TRAPPIST-1e.
"""
train["Destination"].fillna("TRAPPIST-1e",inplace=True)
# Post Imputing
train.Destination.value_counts(dropna=False)

In [None]:
sns.countplot(x="Destination",hue="Transported",data=train)

In [None]:
train["Cabin"].value_counts(dropna=False)

In [None]:
# Analysis Functions
def extract_cabin_deck(cabin):
    try:
        return cabin.split('/')[0]
    except:
        return np.NaN

def extract_cabin_side(cabin):
    try:
        return cabin.split('/')[2]
    except:
        return np.NaN

In [None]:
train["Cabin_deck"]=train["Cabin"].apply(lambda x : extract_cabin_deck(x))

In [None]:
train["Cabin_deck"].value_counts(dropna=False)

In [None]:
train["Cabin_deck"].fillna("F",inplace=True)

In [None]:
sns.countplot(x="Cabin_deck",hue="Transported",data=train)

In [None]:
train["Cabin_side"]=train["Cabin"].apply(lambda x : extract_cabin_side(x))

In [None]:
train["Cabin_side"].value_counts(dropna=False)

In [None]:
# Note since the Starboard ans Port Side have a 50:50 Split hence randomly Imputing the values
np.random.seed(43) 
data = np.random.choice(a=list(train["Cabin_side"].value_counts().index),
                        size=199,
                        p=[0.5,0.5])      
                                                                                                                                                                                                                                            

In [None]:
missing_index=train[train["Cabin_side"].isnull()].index.tolist()

In [None]:
fill = pd.DataFrame(index = train.index[train.Cabin_side.isnull()], data= data,columns=["Cabin_side"])
train.fillna(fill,inplace=True)

In [None]:
"""
Note : Takes the form deck/num/side, where side can be either P for Port or S for Starboard. Since Num becomes a Unique Feature. 
Splitting the Original Cabin variable into two Features Cabin_deck and Cabin_side.
Also Removing the Cabin Feature from the dataset to avoid redundancy.
"""
train.drop(columns=["Cabin"],inplace=True)

In [None]:
sns.countplot(x="Cabin_side",hue="Transported",data=train)

In [None]:
# Convert the Categorical Feature to Numerical Features

#custom binary encoding
train["Cabin_side"] = np.where(train["Cabin_side"]=="P", 1, 0)
train["Transported"] = np.where(train["Transported"]==True, 1, 0)

#Label Encoding
# Import label encoder
from sklearn import preprocessing
# label_encoder object knows how to understand word labels.
label_encoder_cd = preprocessing.LabelEncoder()
label_encoder_dest = preprocessing.LabelEncoder()
label_encoder_hp = preprocessing.LabelEncoder()
# Encode labels in column 'species'.
train["Cabin_deck"] = label_encoder_cd.fit_transform(train["Cabin_deck"])
train["Destination"] = label_encoder_dest.fit_transform(train["Destination"])
train["HomePlanet"] = label_encoder_hp.fit_transform(train["HomePlanet"])

In [None]:
"""
Numerical Features
"""

In [None]:
#Age
plt.figure(figsize=(20,8))
plt.subplot(161)
sns.violinplot(data=train["Age"])
 
# And now add something in the second part:
plt.subplot(162)
sns.violinplot(data=train["RoomService"])

plt.subplot(163)
sns.violinplot(data=train["FoodCourt"])

plt.subplot(164)
sns.violinplot(data=train["ShoppingMall"])

plt.subplot(165)
sns.violinplot(data=train["Spa"])

plt.subplot(166)
sns.violinplot(data=train["VRDeck"])

# Show the graph
plt.show()

In [None]:
train[["Age","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"]].describe()

In [None]:
sns.histplot(train["Age"])

In [None]:
"""
Note : 
1. As it can be seen the Age Group Bin where Max Passengers are present is in 18-32 Range with Mean as 28 and Median 27. 
2. Also it can be seen that for Variables such as "RoomService","FoodCourt","ShoppingMall","Spa","VRDeck" have median expenditures as 0, suggesting they dont spend much 
on the ammenties.
"""

train["Age"].fillna(27,inplace=True)
train["RoomService"].fillna(0,inplace=True)
train["FoodCourt"].fillna(0,inplace=True)
train["ShoppingMall"].fillna(0,inplace=True)
train["Spa"].fillna(0,inplace=True)
train["VRDeck"].fillna(0,inplace=True)

In [None]:
plt.figure(figsize=(20,8))
plt.subplot(161)
sns.violinplot(data=np.log(train["Age"]+1))
 
# And now add something in the second part:
plt.subplot(162)
sns.violinplot(data=np.log(train["RoomService"]+1))

plt.subplot(163)
sns.violinplot(data=np.log(train["FoodCourt"]+1))

plt.subplot(164)
sns.violinplot(data=np.log(train["ShoppingMall"]+1))

plt.subplot(165)
sns.violinplot(data=np.log(train["Spa"]+1))

plt.subplot(166)
sns.violinplot(data=np.log(train["VRDeck"]+1))

# Show the graph
plt.show()

In [None]:
sns.histplot(train["Age"],bins=5)

Correlation Plot

In [None]:
plt.figure(figsize=(8,4))
corr=train[["PassengerGroup","Age","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"]].corr()
sns.heatmap(corr,cmap="Greens",annot=True)
plt.show()

In [None]:
""""
plt.figure(figsize=(4,4))
train["Expenditure"]=train["RoomService"]+train["FoodCourt"]+train["ShoppingMall"]+train["Spa"]+train["VRDeck"]
corr=train[["Age","Expenditure"]].corr()
sns.heatmap(corr,cmap="Greens",annot=True)
plt.show()
"""
train["Expenditure"]=train["RoomService"]+train["FoodCourt"]+train["ShoppingMall"]+train["Spa"]+train["VRDeck"]

In [None]:
"""
Note:
Dropping the "RoomService","FoodCourt","ShoppingMall","Spa","VRDeck" Features and Going Forward with the Expenditure Feature to kepp the Model as Simple as Possible.
"""
#train.drop(columns=["RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"],inplace=True)

In [None]:
#reformating the columns for ease of visualization
train = train[["PassengerGroup","HomePlanet","CryoSleep","Destination","Age","VIP","Cabin_deck","Cabin_side","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck","Transported"]]
train

In [None]:
sns.countplot(x="Transported",data=train)

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(train.corr(),cmap="Greens",annot=True)
plt.show()

Pair Plot

In [None]:
plt.figure(figsize=(10,8))
sns.pairplot(train)
plt.show()

In [None]:
from sklearn.preprocessing import StandardScaler,MinMaxScaler
#Std_Scaler = StandardScaler()
#Std_Scaler_train=Std_Scaler.fit(train[["PassengerGroup","HomePlanet","CryoSleep","Destination","Age","VIP","Cabin_deck","Cabin_side","Expenditure"]])
#vect_data=Std_Scaler_train.transform(train[["PassengerGroup","HomePlanet","CryoSleep","Destination","Age","VIP","Cabin_deck","Cabin_side","Expenditure"]])

In [None]:
min_max_Scaler = MinMaxScaler()
min_max_train=min_max_Scaler.fit(train[["PassengerGroup","HomePlanet","CryoSleep","Destination","Age","VIP","Cabin_deck","Cabin_side","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"]])
vect_data=min_max_train.transform(train[["PassengerGroup","HomePlanet","CryoSleep","Destination","Age","VIP","Cabin_deck","Cabin_side","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"]])

In [None]:
vect_train=pd.DataFrame(columns=["PassengerGroup","HomePlanet","CryoSleep","Destination","Age","VIP","Cabin_deck","Cabin_side","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"],data=vect_data)
vect_train["Transported"] = train["Transported"]
vect_train

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(vect_train.corr(),cmap="Greens",annot=True)
plt.show()

In [None]:
vect_train.to_csv("vectorized_train.csv",index=False)

Modelling

In [None]:
"""
Logistic Regression
"""

In [None]:
from sklearn.model_selection import train_test_split 

In [None]:
train = pd.read_csv("vectorized_train.csv")

X = train[["PassengerGroup","HomePlanet","CryoSleep","Destination","Age","VIP","Cabin_deck","Cabin_side","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"]]
y = train.pop("Transported")


In [None]:
print(X.shape , y.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.33 ,random_state = 42)
print(X_train.shape , X_test.shape , y_train.shape , y_test.shape)

In [None]:
"""
Model 1 : Logistic Regression (Default Regularization : l2 Penalty {Ridge})
"""
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix , accuracy_score , f1_score , precision_score , recall_score

lr = LogisticRegression()
lr.fit(X_train,y_train)


In [None]:
y_train_pred = lr.predict(X_train)
print("confusion matrix : " ,confusion_matrix(y_train, y_train_pred))
print("accuracy : ",round(accuracy_score(y_train, y_train_pred),2))
print("f1-score : ",round(f1_score(y_train, y_train_pred),2))
print("precision : ",round(precision_score(y_train, y_train_pred),2))
print("recall : " ,round(recall_score(y_train, y_train_pred),2))

In [None]:
y_pred = lr.predict(X_test)
y_proba = lr.predict_proba(X_test)

In [None]:
print("confusion matrix : " ,confusion_matrix(y_test, y_pred))
print("accuracy : ",round(accuracy_score(y_test, y_pred),2))
print("f1-score : ",round(f1_score(y_test, y_pred),2))
print("precision : ",round(precision_score(y_test, y_pred),2))
print("recall : " ,round(recall_score(y_test, y_pred),2))

In [None]:
"""
Model 2 : Logistic Regression (L1 Norm/Penalty {Lasso})
"""
lr = LogisticRegression(penalty="l1",solver="liblinear")
lr.fit(X_train,y_train)

In [None]:
y_train_pred = lr.predict(X_train)
print("confusion matrix : " ,confusion_matrix(y_train, y_train_pred))
print("accuracy : ",round(accuracy_score(y_train, y_train_pred),2))
print("f1-score : ",round(f1_score(y_train, y_train_pred),2))
print("precision : ",round(precision_score(y_train, y_train_pred),2))
print("recall : " ,round(recall_score(y_train, y_train_pred),2))

In [None]:
y_pred = lr.predict(X_test)
y_proba = lr.predict_proba(X_test)

In [None]:
print("confusion matrix : " ,confusion_matrix(y_test, y_pred))
print("accuracy : ",round(accuracy_score(y_test, y_pred),2))
print("f1-score : ",round(f1_score(y_test, y_pred),2))
print("precision : ",round(precision_score(y_test, y_pred),2))
print("recall : " ,round(recall_score(y_test, y_pred),2))

In [None]:
lr.feature_names_in_

In [None]:
print("Model the Beta Coefficients:\n")
print("Beta0 :", lr.intercept_[0])
  
for x,beta in enumerate(lr.coef_[0]):
  print(f"Beta{x}/{lr.feature_names_in_[x]} : {round(beta,2)}")

In [None]:
lr.predict_proba(X_test)[:,0]

This Gave an Kaggle Submission Accuracy of 78%.

In [None]:
from sklearn.metrics import roc_curve

fpr , tpr , threshold = roc_curve(y_test,lr.predict_proba(X_test)[:,0])
roc_df = pd.DataFrame({"recall":tpr,"specificity":1-fpr})
ax = roc_df.plot(x="specificity",y="recall",figsize=(4,4),legend=False)
ax.set_ylim(0,1)
ax.set_xlim(1,0)
ax.plot((1,0),(0,1))
ax.set_xlabel("specificity")
ax.set_ylabel("recall")

In [None]:
"""
Tree Model
"""

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier(criterion="entropy", min_impurity_decrease=0.01,min_samples_split=10,min_samples_leaf=10)
dt_clf.fit(X_train,y_train)

In [None]:
y_train_pred=dt_clf.predict(X_train)
print("confusion matrix : " ,confusion_matrix(y_train, y_train_pred))
print("accuracy : ",round(accuracy_score(y_train, y_train_pred),2))
print("f1-score : ",round(f1_score(y_train, y_train_pred),2))
print("precision : ",round(precision_score(y_train, y_train_pred),2))
print("recall : " ,round(recall_score(y_train, y_train_pred),2))

In [None]:
y_pred=dt_clf.predict(X_test)
print("confusion matrix : " ,confusion_matrix(y_test, y_pred))
print("accuracy : ",round(accuracy_score(y_test, y_pred),2))
print("f1-score : ",round(f1_score(y_test, y_pred),2))
print("precision : ",round(precision_score(y_test, y_pred),2))
print("recall : " ,round(recall_score(y_test, y_pred),2))

In [None]:
from sklearn.metrics import roc_curve

fpr , tpr , threshold = roc_curve(y_test,dt_clf.predict_proba(X_test)[:,0])
roc_df = pd.DataFrame({"recall":tpr,"specificity":1-fpr})
ax = roc_df.plot(x="specificity",y="recall",figsize=(4,4),legend=False)
ax.set_ylim(0,1)
ax.set_xlim(1,0)
ax.plot((1,0),(0,1))
ax.set_xlabel("specificity")
ax.set_ylabel("recall")

Submission steps

In [None]:
# Submission Document 
submit_test = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
# lets Check if there are missing values
print("Missing Value in Each Variable\n")
for feature in submit_test.columns:
  print(f"{feature} : {submit_test[feature].isna().sum()}")

In [None]:
submit_test["PassengerGroup"] = submit_test["PassengerId"].apply(lambda x : x.split("_")[1])
submit_test["PassengerGroup"] = np.where(submit_test["PassengerGroup"]=="01",1,2)

submit_test["HomePlanet"].fillna("Earth",inplace=True)
submit_test["Destination"].fillna("TRAPPIST-1e",inplace=True)
submit_test["Cabin_side"]=submit_test["Cabin"].apply(lambda x : extract_cabin_side(x))
submit_test["Cabin_deck"]=submit_test["Cabin"].apply(lambda x : extract_cabin_deck(x))

submit_test.CryoSleep.fillna(0,inplace=True)
submit_test.VIP.fillna(0,inplace=True)
submit_test.CryoSleep.replace({True:1,False:0},inplace=True)
submit_test.VIP.replace({True:1,False:0},inplace=True)

submit_test["Age"].fillna(27,inplace=True)
submit_test["RoomService"].fillna(0,inplace=True)
submit_test["FoodCourt"].fillna(0,inplace=True)
submit_test["ShoppingMall"].fillna(0,inplace=True)
submit_test["Spa"].fillna(0,inplace=True)
submit_test["VRDeck"].fillna(0,inplace=True)
submit_test["Expenditure"]=submit_test["RoomService"]+submit_test["FoodCourt"]+submit_test["ShoppingMall"]+submit_test["Spa"]+submit_test["VRDeck"]

submit_test["Cabin_side"] = np.where(submit_test["Cabin_side"]=="P", 1, 0)
submit_test["Cabin_deck"] = label_encoder_cd.fit_transform(submit_test["Cabin_deck"])
submit_test["Destination"] = label_encoder_dest.fit_transform(submit_test["Destination"])
submit_test["HomePlanet"] = label_encoder_hp.fit_transform(submit_test["HomePlanet"])

submit_test = submit_test[["PassengerGroup","HomePlanet","CryoSleep","Destination","Age","VIP","Cabin_deck","Cabin_side","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"]]
submit_test = min_max_train.transform(submit_test[["PassengerGroup","HomePlanet","CryoSleep","Destination","Age","VIP","Cabin_deck","Cabin_side","RoomService","FoodCourt","ShoppingMall","Spa","VRDeck"]])

In [None]:
submit_test = pd.DataFrame(data = submit_test, columns =X_train.columns)#.drop(columns=["Destination","Age","VIP","Cabin_side"])

In [None]:
y_submit_pred = lr.predict(submit_test)

In [None]:
submit=pd.DataFrame(data=y_submit_pred,columns=["Transported"])
submit_test = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
submit["PassengerId"]=submit_test["PassengerId"]
submit["Transported"].replace({1:True,0:False},inplace=True)

In [None]:
submit.to_csv("submission2.csv",index=False)