# **Spaceship Titanic**

# **Problem statement**

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!

# **Features**

1. **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
2. **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.
3. **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
4. **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
5. **Destination** - The planet the passenger will be debarking to.
6. **Age** - The age of the passenger.
7. **VIP** - Whether the passenger has paid for special VIP service during the voyage.
8. **RoomService, FoodCourt, ShoppingMall, Spa, VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
9. **Name** - The first and last names of the passenger.
10. **Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

# **1. Importing Libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# **2. Loading Files**

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# **3. Reading Files**

In [None]:
train = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
test = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
submission = pd.read_csv("/kaggle/input/spaceship-titanic/sample_submission.csv")

In [None]:
train

In [None]:
test

In [None]:
submission

# **4. EDA - Exploratory Data Analysis**

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
group_id = train.PassengerId.str.split("_", expand=True)[0].astype("int32")
group_sizes = group_id.value_counts().to_dict()
GroupSize = group_id.map(lambda x: group_sizes[x]).rename("GroupSize")
pd.concat([GroupSize, train.Transported], axis=1).groupby("GroupSize").mean().plot.bar()

* People in medium-sized groups were more likely to survive than individuals and oversized groups

In [None]:
pd.DataFrame(train.groupby(["HomePlanet"])["Transported"].mean())

In [None]:
print("Raw value Cabin:")
print("-"*10)
print("num of unique:", end=' ')
print(train.Cabin.nunique())
print("*"*20)
print("Deck value:")
print("-"*10)
print("num of unique:", end=' ')
deck = train.Cabin.str.split("/", expand=True)[0].rename("Deck")
print(deck.nunique())
print("*"*20)
print("Side value:")
print("-"*10)
print("num of unique:", end=' ')
side = train.Cabin.str.split("/", expand=True)[2].rename("Side")
print(side.nunique())
print("*"*20)
deck = pd.concat([deck, train.Transported], axis=1)
side = pd.concat([side, train.Transported], axis=1)

In [None]:
print(deck.groupby("Deck")["Transported"].mean())
print("-"*20)
print(side.groupby("Side")["Transported"].mean())

* Deck and Side affect the chance to be Transported

In [None]:
pd.DataFrame(train.groupby("CryoSleep")["Transported"].mean())

* Passengers in CryoSleep are more likely to be transported

In [None]:
train.groupby("Destination")["Transported"].mean()

* Destination also affects the chance to be Transported

In [None]:
train.groupby("Age")["Transported"].mean().plot.line()
print("age <= 18 chance:")
print(train[train.Age.isin(range(0, 19))]["Transported"].mean())
print("age 19-25 chance:")
print(train[train.Age.isin(range(19, 26))]["Transported"].mean())
print("age > 25 chance:")
print(train[train.Age.isin(range(26, 100))]["Transported"].mean())

* Passengers under 18 are more likely to be transported.
* Passengers in age group of 19-25 are less predisposed and older people have about the same chances.

In [None]:
train.groupby("VIP").Transported.mean()

* VIP's have a much lower chance of getting transported

In [None]:
train["MS"] = train[["RoomService","FoodCourt","ShoppingMall", "Spa", "VRDeck"]].sum(axis=1)
print("money spent = 0:")
print(train[train["MS"] == 0].Transported.mean())
print("money spent > 0:")
print(train[train["MS"] > 0].Transported.mean())

* Passengers who have not spent money have a much higher chance of being transported

## **a. Transported(Target)**

* Balanced Dataset

In [None]:
ax = train['Transported'].value_counts().plot(kind = 'bar')
ax.set_ylabel('No of Passengers')
ax.set_xlabel('Transported')
ax.set_title('Target Distribution')

## **b. Home Planet**

In [None]:
fig, ax = plt.subplots(1,2, figsize = (16, 5))

ax0 = train['HomePlanet'].value_counts().plot(kind = 'bar', ax = ax[0])
ax0.set_xlabel('Home Planet')
ax0.set_ylabel('No of Passengers')
ax0.set_title('No of Passengers vs Home Planet')

ax = train.groupby('HomePlanet').agg({'Transported':'mean'}).plot(kind = 'bar', ax = ax[1])
ax.set_xlabel('Home Planet')
ax.set_ylabel('Proportion of Passengers')
ax.set_title('Proportion of Transported Passengers vs Home Planet')

* Almost half of the passengers are from Earth with the remaining from Europa (One of Jupiter’s Moons) or Mars
* ~60% of passengers from Europa were “transported” compared to ~50% from Mars and ~ 40% from Earth.

## **c. Cryo Sleep**

In [None]:
fig, ax = plt.subplots(1,2, figsize = (16, 5))

ax0 = train['CryoSleep'].value_counts().plot(kind = 'bar', ax = ax[0])
ax0.set_xlabel('Cryo Sleep')
ax0.set_ylabel('No of Passengers')
ax0.set_title('No of Passengers vs Cryo Sleep')

ax1 = train.groupby('CryoSleep').agg({'Transported':'mean'}).plot(kind = 'bar', ax = ax[1])
ax1.set_xlabel('Cryo Sleep')
ax1.set_ylabel('Proportion of Passengers')
ax1.set_title('Proportion of Transported Passengers vs Cryo Sleep')

* ~ 1/3 of passengers are in Cryo Sleep
* Almost 80% of passengers in cryo sleep are transported compared to 30% of passengers not in cryo sleep

## **d. Cabin**

In [None]:
train[['Deck', 'Num', 'Side']] = train['Cabin'].str.split('/', expand = True).fillna('Missing')

### **d.1 Deck**

In [None]:
fig, ax = plt.subplots(1,2, figsize = (16, 5))

ax0 = train['Deck'].value_counts().sort_index().plot(kind = 'bar', ax = ax[0])
ax0.set_xlabel('Deck')
ax0.set_ylabel('No of Passengers')
ax0.set_title('No of Passengers vs Deck')

ax1 = train.groupby('Deck').agg({'Transported':'mean'}).plot(kind = 'bar', ax = ax[1])
ax1.set_xlabel('Deck')
ax1.set_ylabel('Proportion of Passengers')
ax1.set_title('Proportion of Transported Passengers vs Deck')

### **d.2 Side**

In [None]:
fig, ax = plt.subplots(1,2, figsize = (16, 5))

ax0 = train['Side'].value_counts().sort_index().plot(kind = 'bar', ax = ax[0])
ax0.set_xlabel('Side')
ax0.set_ylabel('No of Passengers')
ax0.set_title('No of Passengers vs Side')

ax1 = train.groupby('Side').agg({'Transported':'mean'}).plot(kind = 'bar', ax = ax[1])
ax1.set_xlabel('Side')
ax1.set_ylabel('Proportion of Passengers')
ax1.set_title('Proportion of Transported Passengers vs Side')

## **e. Destination**

In [None]:
fig, ax = plt.subplots(1,2, figsize = (16, 5))

ax0 = train['Destination'].value_counts().sort_index().plot(kind = 'bar', ax = ax[0])
ax0.set_xlabel('Destination')
ax0.set_ylabel('No of Passengers')
ax0.set_title('No of Passengers vs Destination')

ax1 = train.groupby('Destination').agg({'Transported':'mean'}).plot(kind = 'bar', ax = ax[1])
ax1.set_xlabel('Destination')
ax1.set_ylabel('Proportion of Passengers')
ax1.set_title('Proportion of Transported Passengers vs Destination')

## **f. VIP**

In [None]:
fig, ax = plt.subplots(1,2, figsize = (16, 5))

ax0 = train['VIP'].value_counts().sort_index().plot(kind = 'bar', ax = ax[0])
ax0.set_xlabel('VIP')
ax0.set_ylabel('No of Passengers')
ax0.set_title('No of Passengers vs VIP')

ax1 = train.groupby('VIP').agg({'Transported':'mean'}).plot(kind = 'bar', ax = ax[1])
ax1.set_xlabel('VIP')
ax1.set_ylabel('Proportion of Passengers')
ax1.set_title('Proportion of Transported Passengers vs VIP')

## **g. Age**

In [None]:
ax = sns.histplot(train, x = 'Age', binwidth = 1)
ax.set_ylabel('No of Passengers')
ax.set_xlabel('Age')
ax.set_title('Passengers Age Distribution')

In [None]:
train['Age'].median()

In [None]:
train["Age"].describe()

In [None]:
agegroup_mapper = {0:'0-9', 1:'10-19', 2:'20-29', 3:'30-39', 4:'40-49', 5:'50-59', 6:'60-69', 7:'70-79', 8:'80-89'}
train['AgeGroup'] = train['Age'].apply(lambda x: np.floor(x/10)).map(agegroup_mapper)
ax = (pd.pivot_table(train, index = 'AgeGroup', columns = 'Transported', values = 'PassengerId', aggfunc = 'count')
      .rename(columns = {True: 'True', False:'False'})
      .assign(PctTransported = lambda x: x['True']/(x['True']+x['False'])*100)
      .reset_index()
      .plot(kind = 'bar', x = 'AgeGroup', y = 'PctTransported'))
ax.set_ylabel('% of Passengers')
ax.set_xlabel('Age Group')
ax.set_title('% of Passengers Transported by Age Group')

* Median age of 27 years old
* Age group of 0-9 years old have higher probability of being transported

## **h. Spending**

In [None]:
fill_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
train[fill_cols] = train[fill_cols].fillna(0)
train['TotalSpend'] = train['RoomService'] + train['FoodCourt'] + train['ShoppingMall'] + train['Spa'] + train['VRDeck']

In [None]:
ax = sns.histplot(train, x = 'TotalSpend', hue = 'Transported', binwidth = 1000)
ax.set_ylabel('No of Passengers')
ax.set_xlabel('Spending')
ax.set_title('Passengers Total Spending Distribution')
ax.set_xlim(0,10000)

In [None]:
train.groupby('Transported')['TotalSpend'].describe()

# **5. Analyzing training and testing data**

In [None]:
train.isnull().sum()

In [None]:
train1 = train.dropna()
train1

In [None]:
train1['Transported'].replace({False: 0, True: 1},inplace=True)
train1['Transported']

In [None]:
sns.displot(train1['Transported'])

In [None]:
trans_count = train1['Transported'].value_counts()
trans_count

In [None]:
trans_percent = trans_count / len(train1)
trans_percent

In [None]:
plt.figure(figsize=(25, 7))
ax = plt.subplot()
ax.scatter(train1[train1['Transported'] == 1]['Age'], train1[train['Transported'] == 1]['FoodCourt'], c='green', s=train1[train1['Transported'] == 1]['VRDeck'])
ax.scatter(train1[train1['Transported'] == 0]['Age'], train1[train['Transported'] == 0]['FoodCourt'], c='red', s=train1[train1['Transported'] == 0]['VRDeck']);

In [None]:
target = train1['Transported']

train1.drop(['Transported'],axis=1, inplace=True)
train1

# **6. Combining training and testing data**

In [None]:
combi = train1.append(test)
combi

In [None]:
combi.info()

In [None]:
combi.describe()

In [None]:
combi.isnull().sum()

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp = IterativeImputer(random_state=42)

date = pd.Timestamp('2200-01-01')

for col in combi:
    if combi[col].dtype=="object":
        combi[col].fillna("not listed", inplace=True)
    if combi[col].dtype=="int":
        #X[col].fillna(X[col].mode()[0], inplace=True)
        combi[col].fillna(combi[col].mean(), inplace=True)
        #combi[col] = combi[col].astype.int()
    if combi[col].dtype=='float':
       #X[col].fillna(X[col].mean(), inplace=True)
       combi[col] = imp.fit_transform(combi[col].values.reshape(-1,1))
    if combi[col].dtype=="datetime64[ns]":
        combi[col].fillna(date, inplace=True)
combi

In [None]:
combi.isnull().sum()

# **7. Analyzing Features**

## **a. Home Planet**

In [None]:
sns.displot(combi['HomePlanet'])

In [None]:
home_count = combi['HomePlanet'].value_counts()
home_count

In [None]:
home_percent = home_count / len(combi)
home_percent

In [None]:
mylabels = ["Earth", "Europa", "Mars", "not listed"]
plt.pie(home_percent, labels=mylabels)
plt.show() 

In [None]:
combi['HomePlanet'].replace({"Earth": 1, "Europa": 2, "Mars": 3, "not listed": 4},inplace=True)
combi['HomePlanet']

## **b. Cryo Sleep**

In [None]:
combi['CryoSleep'][combi['CryoSleep'] == 'not listed'] = False

In [None]:
print(combi.iloc[6674])

In [None]:
combi['CryoSleep'].replace({False: 0, True: 1})

In [None]:
sns.distplot(combi['CryoSleep'])

In [None]:
sleep_count = combi['CryoSleep'].value_counts()
sleep_count

In [None]:
sleep_percent = sleep_count / len(combi)
sleep_percent

In [None]:
combi['CryoSleep'] = combi['CryoSleep'].astype(int)
combi['CryoSleep']

## **c. Destination**

In [None]:
sns.displot(combi['Destination'])

In [None]:
dest_count = combi['Destination'].value_counts()
dest_count

In [None]:
dest_percent = dest_count / len(combi)
dest_percent

In [None]:
mylabels = ["TRAPPIST-1e", "55 Cancri e", "PSO J318.5-22", "not listed"]
plt.pie(dest_percent, labels=mylabels)
plt.show() 

In [None]:
combi['Destination'].replace({"TRAPPIST-1e": 1, "55 Cancri e": 2, "PSO J318.5-22": 3, "not listed": 4},inplace=True)
combi['Destination']

## **d. Age**

In [None]:
combi['Age_group'] = pd.cut(x=combi['Age'], bins=[-1, 18, 40, 65, 100], labels=['child', 'young adult', 'middle age', 'pensioner'])
combi['Age_group']

In [None]:
sns.displot(combi['Age_group'])

In [None]:
age_count = combi['Age_group'].value_counts()
age_count

In [None]:
age_percent = age_count / len(combi)
age_percent

In [None]:
mylabels = ["young adult", "child", "middle age", "pensioner"]
plt.pie(age_percent, labels=mylabels)
plt.show() 

In [None]:
combi['Age_group'].replace({"young adult": 1, "child": 2, "middle age": 3, "pensioner": 4},inplace=True)
combi['Age_group']

In [None]:
combi['Age_group'] = combi['Age_group'].astype(int)

## **e. VIP**

In [None]:
combi['VIP'][combi['VIP'] == 'not listed'] = False

In [None]:
combi['VIP'].replace({False: 0, True: 1})

In [None]:
sns.distplot(combi['VIP'])

In [None]:
vip_count = combi['VIP'].value_counts()
vip_count

In [None]:
vip_percent = vip_count / len(combi)
vip_percent

In [None]:
combi['VIP'] = combi['VIP'].astype(int)
combi['VIP']

## **f. Room Service**

In [None]:
sns.violinplot(combi['RoomService'])

In [None]:
rm_service_high = combi['RoomService'].max()
print(rm_service_high)

In [None]:
combi['Room_Service_group'] = pd.cut(x=combi['RoomService'], bins=[-1, 2000, 8000, 12000], labels=['low', 'med', 'high'])
combi['Room_Service_group']

In [None]:
sns.displot(combi['Room_Service_group'])

In [None]:
rm_service_count = combi['Room_Service_group'].value_counts()
rm_service_count

In [None]:
rm_service_percent = rm_service_count / len(combi)
rm_service_percent

In [None]:
mylabels = ["low", "med", "high"]
plt.pie(rm_service_percent, labels=mylabels)
plt.show()

In [None]:
combi['Room_Service_group'].replace({"low": 1, "med": 2, "high": 3},inplace=True)
combi['Room_Service_group']

## **g. Food Court**

In [None]:
sns.violinplot(combi['FoodCourt'])

In [None]:
food_high = combi['FoodCourt'].max()
print(food_high)

In [None]:
combi['Food_Court_group'] = pd.cut(x=combi['FoodCourt'], bins=[-1, 5000, 20000, 30000], labels=['low', 'med', 'high'])
combi['Food_Court_group']

In [None]:
sns.displot(combi['Food_Court_group'])

In [None]:
fd_court_count = combi['Food_Court_group'].value_counts()
fd_court_count

In [None]:
fd_court_percent = fd_court_count / len(combi)
fd_court_percent

In [None]:
mylabels = ["low", "med", "high"]
plt.pie(fd_court_percent, labels=mylabels)
plt.show()

In [None]:
combi['Food_Court_group'].replace({"low": 1, "med": 2, "high": 3},inplace=True)
combi['Food_Court_group']

## **h. Shopping Mall**

In [None]:
sns.violinplot(combi['ShoppingMall'])

In [None]:
shop_high = combi['ShoppingMall'].max()
print(shop_high)

In [None]:
combi['Shopping_group'] = pd.cut(x=combi['ShoppingMall'], bins=[-1, 2000, 8000, 13000], labels=['low', 'med', 'high'])
combi['Shopping_group']

In [None]:
sns.displot(combi['Shopping_group'])

In [None]:
shopping_count = combi['Shopping_group'].value_counts()
shopping_count

In [None]:
shopping_percent = shopping_count / len(combi)
shopping_percent

In [None]:
mylabels = ["low", "med", "high"]
plt.pie(shopping_percent, labels=mylabels)
plt.show()

In [None]:
combi['Shopping_group'].replace({"low": 1, "med": 2, "high": 3},inplace=True)
combi['Shopping_group']

## **i. Spa**

In [None]:
sns.violinplot(combi['Spa'])

In [None]:
spa_high = combi['Spa'].max()
print(spa_high)

In [None]:
combi['Spa_group'] = pd.cut(x=combi['Spa'], bins=[-1, 5000, 15000, 23000], labels=['low', 'med', 'high'])
combi['Spa_group']

In [None]:
sns.displot(combi['Spa_group'])

In [None]:
spa_count = combi['Spa_group'].value_counts()
spa_count

In [None]:
spa_percent = spa_count / len(combi)
spa_percent

In [None]:
mylabels = ["low", "med", "high"]
plt.pie(spa_percent, labels=mylabels)
plt.show()

In [None]:
combi['Spa_group'].replace({"low": 1, "med": 2, "high": 3},inplace=True)
combi['Spa_group']

## **j. VR Deck**

In [None]:
sns.violinplot(combi['VRDeck'])

In [None]:
vr_high = combi['VRDeck'].max()
print(vr_high)

In [None]:
combi['VR_group'] = pd.cut(x=combi['VRDeck'], bins=[-1, 5000, 15000, 23000], labels=['low', 'med', 'high'])
combi['VR_group']

In [None]:
sns.displot(combi['VR_group'])

In [None]:
vr_count = combi['VR_group'].value_counts()
vr_count

In [None]:
vr_percent = vr_count / len(combi)
vr_percent

In [None]:
mylabels = ["low", "med", "high"]
plt.pie(vr_percent, labels=mylabels)
plt.show()

In [None]:
combi['VR_group'].replace({"low": 1, "med": 2, "high": 3},inplace=True)
combi['VR_group']

# **8. Assigning Features**

In [None]:
combi.info()

# **9. Prediction**

## **a. Defining X & Y**

In [None]:
features = ["HomePlanet", "CryoSleep", "Destination", "Age_group", "Room_Service_group", "Food_Court_group", "Shopping_group", "Spa_group", "VR_group"]

y = target
X = combi[features][: len(train1)]
X_test = combi[features][len(train1) :]

In [None]:
# Heatmap

cmap = combi[features].corr()
sns.heatmap(cmap)

## **b. Splitting data for training & validation**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.10, random_state=1, stratify=y, shuffle=True)
X_train.shape, X_val.shape, y_train.shape, y_val.shape, X_test.shape

## **c. Selecting Model**

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=42, C=10).fit(X_train, y_train)
print(model.score(X_train, y_train))

## **d. Predicitng Validation**

In [None]:
y_pred = model.predict(X_val)
print(model.score(X_val, y_val))

## **e. Confusion Metrics**

In [None]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_val, y_pred))

In [None]:
# Predict on X-test

predictions = model.predict(X_test)
predictions = predictions.astype(str)
predictions = np.char.replace(predictions, '0', 'False')
predictions = np.char.replace(predictions, '1', 'True')
predictions

# **10. Submission**

In [None]:
submission['Transported'] = predictions
submission.to_csv('submission.csv', index=False)
my_submission = pd.read_csv("submission.csv")
my_submission