# Spaceship Titanic raw data preprocessing

## File and Data Field Descriptions

train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.\
    * `PassengerId` - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.\
    * `HomePlanet` - The planet the passenger departed from, typically their planet of permanent residence.\
    * `CryoSleep` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.\
    * `Cabin` - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.\
    * `Destination` - The planet the passenger will be debarking to.\
    * `Age` - The age of the passenger.\
    * `VIP` - Whether the passenger has paid for special VIP service during the voyage.\
    * `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.\
    * `Name` - The first and last names of the passenger.\
    * `Transported` - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.\
\
test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.\
\
sample_submission.csv - A submission file in the correct format.\
    *PassengerId* - Id for each passenger in the test set.\
    *Transported* - The target. For each passenger, predict either True or False.\


In [30]:
import os
import dotenv
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## 1 - Load environmental variables

In [31]:
project_dir = str(Path().resolve().parents[0])
dotenv_path = os.path.join(project_dir, '.env')
env_var = dotenv.load_dotenv(dotenv_path)
raw_data_path = os.environ.get("RAW_DATA_PATH")

## 2 - Read raw data
The infer_objects function automatically identifies the dtypes for each column

In [32]:
raw_data = pd.read_csv(os.path.join(project_dir, raw_data_path, "train.csv")).infer_objects()
raw_data.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


### 2.1 Checking columns dtypes

In [33]:
print(f"Number of columns: {raw_data.shape[1]}")
raw_data.dtypes

Number of columns: 14


PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

In [34]:
numerical_cols = raw_data.select_dtypes(include="float64").columns.to_list()
categorical_cols = raw_data.select_dtypes(include="object").columns.to_list()
target_col = raw_data.select_dtypes(include="bool").columns.to_list()[0]
print(f"Numerical columns: {numerical_cols}")
print(f"Categorival columns: {categorical_cols}")
print(f"Target column: {target_col}")

Numerical columns: ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
Categorival columns: ['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP', 'Name']
Target column: Transported


## 3 - Removing ID type columns

In [47]:
print(raw_data[categorical_cols].nunique())
id_types_cols = raw_data.columns[raw_data.nunique() == raw_data.shape[0]].to_list()
pre_processed_data = raw_data.set_index(id_types_cols)
print(f"Removed ID type columns: {id_types_cols}")
# Seems that the Name column presents equal names. 
# Are there different entries for the same person?
# Are these entries equal o different?
# We must check for equal rows and also equal rows with different classes.

PassengerId    8693
HomePlanet        3
CryoSleep         2
Cabin          6560
Destination       3
VIP               2
Name           8473
dtype: int64
Removed ID type columns: ['PassengerId']


## 4 - Missing data imputation

In [48]:

print(f"Number of NaN values: {pre_processed_data.isna().sum().sum()}/{pre_processed_data.size}")
print(f"Number of columns with NaN: {pre_processed_data.isna().any().sum()}/{pre_processed_data.shape[1]}")
print(f"Number of rows with NaN: {pre_processed_data.isna().any(axis=1).sum()}/{pre_processed_data.shape[0]}")
# ToDo: 
# We need to inputate missing data for all columns 
# We need to check if this is possible


Number of NaN values: 2324/113009
Number of columns with NaN: 12/13
Number of rows with NaN: 2087/8693


## 5 - Standarize the data by fixed terms

In [49]:
mean_numerical_cols = pre_processed_data[numerical_cols].mean()
std_numerical_cols = pre_processed_data[numerical_cols].std()
pre_processed_data_standard = (pre_processed_data[numerical_cols] - mean_numerical_cols)/ std_numerical_cols

print(f'Mean used for standarizantion is:')
print(mean_numerical_cols)
print(f'Standard Deviation used for standarizantion is:')
print(std_numerical_cols)

raw_data_standard



Mean used for standarizantion is:
Age              28.827930
RoomService     224.687617
FoodCourt       458.077203
ShoppingMall    173.729169
Spa             311.138778
VRDeck          304.854791
dtype: float64
Standard Deviation used for standarizantion is:
Age               14.489021
RoomService      666.717663
FoodCourt       1611.489240
ShoppingMall     604.696458
Spa             1136.705535
VRDeck          1145.717189
dtype: float64


Unnamed: 0_level_0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0001_01,0.702054,-0.337006,-0.284257,-0.287300,-0.273720,-0.266082
0002_01,-0.333213,-0.173518,-0.278672,-0.245957,0.209255,-0.227678
0003_01,2.013391,-0.272511,1.934808,-0.287300,5.633703,-0.223314
0003_02,0.287947,-0.337006,0.511901,0.326231,2.654919,-0.097629
0004_01,-0.885355,0.117460,-0.240819,-0.037588,0.223331,-0.264336
...,...,...,...,...,...,...
9276_01,0.840089,-0.337006,3.947233,-0.287300,1.171685,-0.201494
9278_01,-0.747320,-0.337006,-0.284257,-0.287300,-0.273720,-0.266082
9279_01,-0.195177,-0.337006,-0.284257,2.808468,-0.272840,-0.266082
9280_01,0.218929,-0.337006,0.366694,-0.287300,0.036827,2.557477


## 6 -  Convert categorical columns to numerical

In [52]:
pre_processed_data_categorical_convert = pd.get_dummies(pre_processed_data)

In [54]:
print(pre_processed_data_categorical_convert.columns)

Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported', 'HomePlanet_Earth', 'HomePlanet_Europa',
       'HomePlanet_Mars',
       ...
       'Name_Zinopus Spandisket', 'Name_Zinor Axlentindy',
       'Name_Zinor Proorbeng', 'Name_Zinoth Lansuffle', 'Name_Zosmark Trattle',
       'Name_Zosmark Unaasor', 'Name_Zosmas Ineedeve',
       'Name_Zosmas Mormonized', 'Name_Zubeneb Flesping',
       'Name_Zubeneb Pasharne'],
      dtype='object', length=15050)
