# Which passengers were transported to an alternate dimension?

First import the train data.

In [1]:
import pandas as pd

raw_train_data = pd.read_csv("spaceship_titanic_data\\train.csv")

## Cleaning

The data has a lot of missing values.  By using the info() function we can see the only columns without missing data are PassengerID and Transported.  We can also see that many of the columns are object data types.

In [2]:
raw_train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [3]:
cleaned_train_data = raw_train_data.copy()
categories = ["HomePlanet", "CryoSleep", "Destination", "VIP", "Transported"]
for column in categories:
    cleaned_train_data[column] = cleaned_train_data[column].astype("category")
cleaned_train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   PassengerId   8693 non-null   object  
 1   HomePlanet    8492 non-null   category
 2   CryoSleep     8476 non-null   category
 3   Cabin         8494 non-null   object  
 4   Destination   8511 non-null   category
 5   Age           8514 non-null   float64 
 6   VIP           8490 non-null   category
 7   RoomService   8512 non-null   float64 
 8   FoodCourt     8510 non-null   float64 
 9   ShoppingMall  8485 non-null   float64 
 10  Spa           8510 non-null   float64 
 11  VRDeck        8505 non-null   float64 
 12  Name          8493 non-null   object  
 13  Transported   8693 non-null   category
dtypes: category(5), float64(6), object(3)
memory usage: 654.4+ KB


We can use imputation on the columns with missing data.  For HomePlanet, CryoSleep, Destination, and VIP we can replace the missing entries with the entry that occurs the most.  For Cabin and Name we can replace the missing entries with "Unknown".  For Age, RoomService, FoodCourt, ShoppingMall, Spa, and VRDeck we can replace the missing entries with the mean value.

In [4]:
for column in ["HomePlanet", "CryoSleep", "Destination", "VIP"]:
    cleaned_train_data[column].fillna(cleaned_train_data[column].mode()[0], inplace=True) 
    # code found at https://stackoverflow.com/questions/40619445/how-to-replace-na-values-with-mode-of-a-dataframe-column-in-python
for column in ["Cabin", "Name"]:
    cleaned_train_data.loc[cleaned_train_data[column].isna(), column] = "Unknown"
for column in ["Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]:
    cleaned_train_data.loc[cleaned_train_data[column].isna(), column] = cleaned_train_data[column].mean()    
cleaned_train_data.info()    

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   PassengerId   8693 non-null   object  
 1   HomePlanet    8693 non-null   category
 2   CryoSleep     8693 non-null   category
 3   Cabin         8693 non-null   object  
 4   Destination   8693 non-null   category
 5   Age           8693 non-null   float64 
 6   VIP           8693 non-null   category
 7   RoomService   8693 non-null   float64 
 8   FoodCourt     8693 non-null   float64 
 9   ShoppingMall  8693 non-null   float64 
 10  Spa           8693 non-null   float64 
 11  VRDeck        8693 non-null   float64 
 12  Name          8693 non-null   object  
 13  Transported   8693 non-null   category
dtypes: category(5), float64(6), object(3)
memory usage: 654.4+ KB


Since the Cabin entries have the form deck/num/side, we add two separate columns for deck and side, then change those columns to categorical values.  For the "Unknown" entries we use "Unknown" again for the deck and side.  Then we change the data types to "category".

In [5]:
deck_values = []
side_values = []
for cabin in cleaned_train_data["Cabin"]:
    if cabin != "Unknown":
        cabin.replace("/", ",")
        deck_values.append(cabin[0])
        side_values.append(cabin[2])
    else:
        deck_values.append("Unknown")
        side_values.append("Unknown")
cleaned_train_data["Deck"] = deck_values
cleaned_train_data["Side"] = side_values
for column in ["Deck", "Side"]:
    cleaned_train_data[column] = cleaned_train_data[column].astype("category")
cleaned_train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   PassengerId   8693 non-null   object  
 1   HomePlanet    8693 non-null   category
 2   CryoSleep     8693 non-null   category
 3   Cabin         8693 non-null   object  
 4   Destination   8693 non-null   category
 5   Age           8693 non-null   float64 
 6   VIP           8693 non-null   category
 7   RoomService   8693 non-null   float64 
 8   FoodCourt     8693 non-null   float64 
 9   ShoppingMall  8693 non-null   float64 
 10  Spa           8693 non-null   float64 
 11  VRDeck        8693 non-null   float64 
 12  Name          8693 non-null   object  
 13  Transported   8693 non-null   category
 14  Deck          8693 non-null   category
 15  Side          8693 non-null   category
dtypes: category(7), float64(6), object(3)
memory usage: 672.1+ KB


## Algorithm

Now that we've cleaned the data somewhat, we can apply the predicting algorithm.  I've chosen to apply the standard $k$-nearest neighbors classifier to all the columns except for PassengerID, Cabin, Name, and Transported (the feature we are predicting).  However, we still have categorical features that we need to convert to numerical.  We use one-hot encoding.

_Alternatively:_ Since we are using one-hot encoding we can do a PCA transform of the data.  That would reveal correlations between features and we can see if "Transported" is among them.

In [23]:
# Tried to use OneHotEncoder, couldn't finish it.

from sklearn.preprocessing import OneHotEncoder

to_train = cleaned_train_data[cleaned_train_data.columns[1:]]
to_train = to_train.drop(columns = ["Cabin", "Name", "Transported"])
true_values = cleaned_train_data["Transported"]

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(to_train)
enc.categories_

[array(['Earth', 'Europa', 'Mars'], dtype=object),
 array([False, True], dtype=object),
 array(['55 Cancri e', 'PSO J318.5-22', 'TRAPPIST-1e'], dtype=object),
 array([ 0.        ,  1.        ,  2.        ,  3.        ,  4.        ,
         5.        ,  6.        ,  7.        ,  8.        ,  9.        ,
        10.        , 11.        , 12.        , 13.        , 14.        ,
        15.        , 16.        , 17.        , 18.        , 19.        ,
        20.        , 21.        , 22.        , 23.        , 24.        ,
        25.        , 26.        , 27.        , 28.        , 28.82793047,
        29.        , 30.        , 31.        , 32.        , 33.        ,
        34.        , 35.        , 36.        , 37.        , 38.        ,
        39.        , 40.        , 41.        , 42.        , 43.        ,
        44.        , 45.        , 46.        , 47.        , 48.        ,
        49.        , 50.        , 51.        , 52.        , 53.        ,
        54.        , 55.        , 56. 