# Spaceship Titanic Kaggle Competition

## Dataset Description
In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

## File and Data Field Descriptions
### `train.csv` 
Personal records for about two-thirds (~8700) of the passengers, to be used as training data.

`PassengerId` - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

`HomePlanet` - The planet the passenger departed from, typically their planet of permanent residence.

`CryoSleep` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

`Cabin` - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

`Destination` - The planet the passenger will be debarking to.

`Age` - The age of the passenger.

`VIP` - Whether the passenger has paid for special VIP service during the voyage.

`RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

`Name` - The first and last names of the passenger.

`Transported` - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

### `test.csv` 
Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

### `sample_submission.csv` 
A submission file in the correct format.

`PassengerId` - Id for each passenger in the test set.

`Transported` - The target. For each passenger, predict either True or False.

### Import libraries and data

In [127]:
import pandas as pd

# Define the CSV file path
train_file = 'train.csv' 
test_file = 'test.csv' 

# Read the CSV file into a pandas DataFrame
train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)

train_df['CryoSleep'] = train_df['CryoSleep'].astype(bool)
train_df['VIP'] = train_df['VIP'].astype(bool)

# Display the DataFrame
train_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Add a field called GroupId to the data. This is the first four characters of the PassengerId field.

In [128]:
train_df['GroupId'] = train_df['PassengerId'].str[:4].astype(int)

# Display the updated DataFrame
train_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,GroupId
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,1
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,2
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,3
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,3
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,4


Set CryoSleep to true if `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` are 0.

In [129]:
# Create a boolean condition to check if 'CryoSleep' is null and all specified columns are 0
condition = (train_df['CryoSleep'].isnull()) & (train_df['RoomService'] == 0) & (train_df['FoodCourt'] == 0) & (train_df['ShoppingMall'] == 0) & (train_df['Spa'] == 0) & (train_df['VRDeck'] == 0)

# Set 'CryoSleep' to True where the condition is True
train_df.loc[condition, 'CryoSleep'] = True

# Create a boolean condition to check if any of the specified columns are greater than 0
condition = (train_df['RoomService'] > 0) | (train_df['FoodCourt'] > 0) | (train_df['ShoppingMall'] > 0) | (train_df['Spa'] > 0) | (train_df['VRDeck'] > 0)

# Set 'CryoSleep' to False where the condition is True
train_df.loc[condition, 'CryoSleep'] = False


# Display the updated DataFrame
train_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,GroupId
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,1
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,2
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,3
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,3
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,4


If `Cabin` field is missing, copy it from a row with same `GroupId`

In [130]:
# Define a function to copy 'Cabin' from another row with the same 'GroupId'
def copy_cabin(row):
    if pd.isna(row['Cabin']):
        same_group = train_df[train_df['GroupId'] == row['GroupId']]
        if not same_group.empty:
            return same_group.iloc[0]['Cabin']
    return row['Cabin']

# Apply the function to fill missing 'Cabin' values
train_df['Cabin'] = train_df.apply(copy_cabin, axis=1)

train_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,GroupId
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,1
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,2
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,3
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,3
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,4


Split field `Cabin` into `CabinDeck`, `CabinNum` and `CabinSide`.

In [131]:
# Split the 'Cabin' column into three new columns: 'CabinDeck', 'CabinNum', and 'CabinSide'
train_df[['CabinDeck', 'CabinNum', 'CabinSide']] = train_df['Cabin'].str.extract('([A-Za-z]+)/(\d+)/([A-Za-z]+)')

# Display the updated DataFrame
train_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,GroupId,CabinDeck,CabinNum,CabinSide
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,1,B,0,P
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,2,F,0,S
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,3,A,0,S
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,3,A,0,S
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,4,F,1,S


Remove fields `HomePlanet`, `Destination`, `Cabin`, `Name`

In [132]:
# List of columns to remove
columns_to_remove = ['HomePlanet', 'Destination', 'Cabin', 'Name']

# Drop the specified columns from the DataFrame
train_df = train_df.drop(columns=columns_to_remove)

# Display the updated DataFrame
train_df.head()

Unnamed: 0,PassengerId,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,GroupId,CabinDeck,CabinNum,CabinSide
0,0001_01,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,1,B,0,P
1,0002_01,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,2,F,0,S
2,0003_01,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,3,A,0,S
3,0003_02,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,3,A,0,S
4,0004_01,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,4,F,1,S


If CryoSleep is true set `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, and `VRDeck` to be 0.
If CryoSleep is false set `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, and `VRDeck` to be the mean of the non-zero values.

In [133]:
# Set the specified columns to 0 where 'CryoSleep' is true
train_df.loc[train_df['CryoSleep'], ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = 0

# Calculate the means of the columns
mean_room_service = int(train_df[train_df['RoomService'] != 0]['RoomService'].mean())
mean_food_court = int(train_df[train_df['FoodCourt'] != 0]['FoodCourt'].mean())
mean_shopping_mall = int(train_df[train_df['ShoppingMall'] != 0]['ShoppingMall'].mean())
mean_spa = int(train_df[train_df['Spa'] != 0]['Spa'].mean())
mean_vr_deck = int(train_df[train_df['VRDeck'] != 0]['VRDeck'].mean())

# Fill missing values with the calculated means
train_df['RoomService'].fillna(mean_room_service, inplace=True)
train_df['FoodCourt'].fillna(mean_food_court, inplace=True)
train_df['ShoppingMall'].fillna(mean_shopping_mall, inplace=True)
train_df['Spa'].fillna(mean_spa, inplace=True)
train_df['VRDeck'].fillna(mean_vr_deck, inplace=True)

# Display the updated DataFrame
train_df.head()

Unnamed: 0,PassengerId,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,GroupId,CabinDeck,CabinNum,CabinSide
0,0001_01,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,1,B,0,P
1,0002_01,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,2,F,0,S
2,0003_01,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,3,A,0,S
3,0003_02,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,3,A,0,S
4,0004_01,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,4,F,1,S


Replace missing age with mean age

In [134]:
mean_age = train_df['Age'].mean()

# Fill missing value with the calculated mean
train_df['Age'].fillna(mean_age, inplace=True)
train_df['Age'] = train_df['Age'].astype(int)

# Display the updated DataFrame
train_df.head()

Unnamed: 0,PassengerId,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,GroupId,CabinDeck,CabinNum,CabinSide
0,0001_01,False,39,False,0.0,0.0,0.0,0.0,0.0,False,1,B,0,P
1,0002_01,False,24,False,109.0,9.0,25.0,549.0,44.0,True,2,F,0,S
2,0003_01,False,58,True,43.0,3576.0,0.0,6715.0,49.0,False,3,A,0,S
3,0003_02,False,33,False,0.0,1283.0,371.0,3329.0,193.0,False,3,A,0,S
4,0004_01,False,16,False,303.0,70.0,151.0,565.0,2.0,True,4,F,1,S
