# Spaceship Titanic Competition

*Created by: Taylor Daugherty*

*Created on: 5/23/2023    Last updated: 6/13/2023 - Clean up notebook*

This notebook is used to practice using cross validation.


**Input File:** train.csv and test.csv from the spaceship-titanic competition

- train.csv contains a training set of data with features and true values

- test.csv contains a testing set of data with only features. This is what will be used to make predictions and submit to the competition


**Purpose of notebook:** Gain more practice with cross validation and make submissions to the spaceship titanic competition

## Table of Contents

1. Import packages

2. Useful functions

3. Initial Model

4. Improvement 1: Remove 'Converted' group

5. Improvement 2: Impute NaN values

    a. Constant
    
    b. Mean
    
    c. Median

## Import packages

Import the packages necessary to run the notebook

1. numpy: used for linear algebra

2. pandas: used for dataframe creation and manipulation

4. `SimpleImputer`: Used to fill NaN values

5. `LogisticRegression`: a classification model using logistic regression

6. `KNeighborsClassifier`: a classification model using KNN

7. `SVC`: a classification model using SVM

8. `KFold` and `cross_val_score`: used for cross validation

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.model_selection import StratifiedKFold, cross_val_score

In [2]:
train_filepath = '/kaggle/input/spaceship-titanic/train.csv'
test_filepath = '/kaggle/input/spaceship-titanic/test.csv'

### List of models

These are the models that will be used during cross-validation. 

**Models:**

- Logistic Regression

- KNN

- SVM

In [3]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', SVC()))

### apply_imputer()

This function is used in the Improvement 2 section for all three imputation methods. Here, the imputer is applied to the dataframe and the columns are returned to their correct names

**Input:** A simple imputer to apply, 
           a dataframe to impute
           
**Output:** The imputed dataframe

In [4]:
def apply_imputer(si, df):
    df_imputed = pd.DataFrame(si.fit_transform(df))
    df_imputed.columns = df.columns
    return df_imputed

### perform_cross_validation()

This function goes through the steps to perform K-fold cross validation using the list of models described above.

**Input:** A dataframe containing the features use to build the model, a Series of the true values associated with the feature list

**Output:** Printed result for the mean and standard deviation of each model

In [5]:
def perform_cross_validation(X_train, y_train):
    results = dict()

    for name, model in models:
        kfold = StratifiedKFold(n_splits=10)
        cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
        results[name] = (cv_results.mean(), cv_results.std())

    print('Model\t\tCV Mean\t\tCV std')
    print(results)

In [6]:
def expand_cabin(df):
    df[['Deck','Num','Side']] = df['Cabin'].astype(str).str.split('/', expand=True)
    df = df.drop('Cabin', axis=1)
    df = df.drop('Num', axis=1)
    return df

In [7]:
def fill_na(df, features, fill_vals):
    df_filled = df
    
    for i in range(len(features)):
        feature = features[i]
        fill_val = fill_vals[i]
        
        df_filled[feature] = df_filled[feature].fillna(fill_val)
        
    return df_filled

In [8]:
def map_cat(df, features, dictionaries):
    df_mapped = df
    
    for i in range(len(features)):
        feature = features[i]
        dictionary = dictionaries[i]
        
        df_mapped[feature] = df[feature].map(dictionary)
        
    return df_mapped

In [9]:
def normalize(feature):
    '''
    This function normalizes a Series
    
    Input: A feature of type Series
    
    Output: The normalized feature of type Series
    '''
    return (feature - feature.mean())/feature.std()

In [10]:
def normalize_features(df):
    '''
    This function normalizes all features in a dataframe
    
    Input: A pandas dataframe
    
    Output: The normalized dataframe
    '''
    for column in df.columns:
        df[column] = normalize(df[column])
    return df

## Initial Model

In [11]:
df_train = pd.read_csv(train_filepath, index_col='PassengerId')
df_test = pd.read_csv(test_filepath, index_col='PassengerId')

In [12]:
df_train.head(10)

Unnamed: 0_level_0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
0005_01,Earth,False,F/0/P,PSO J318.5-22,44.0,False,0.0,483.0,0.0,291.0,0.0,Sandie Hinetthews,True
0006_01,Earth,False,F/2/S,TRAPPIST-1e,26.0,False,42.0,1539.0,3.0,0.0,0.0,Billex Jacostaffey,True
0006_02,Earth,True,G/0/S,TRAPPIST-1e,28.0,False,0.0,0.0,0.0,0.0,,Candra Jacostaffey,True
0007_01,Earth,False,F/3/S,TRAPPIST-1e,35.0,False,0.0,785.0,17.0,216.0,0.0,Andona Beston,True
0008_01,Europa,True,B/1/P,55 Cancri e,14.0,False,0.0,0.0,0.0,0.0,0.0,Erraiam Flatic,True


In [13]:
df_train = df_train.drop('Name', axis=1)
df_test = df_test.drop('Name', axis=1)

In [14]:
df_train = expand_cabin(df_train)
df_test = expand_cabin(df_test)

In [15]:
X = df_train.drop('Transported', axis=1)

y = df_train['Transported']

In [16]:
X['Deck'] = X['Deck'].replace('nan','F')
df_test['Deck'] = df_test['Deck'].replace('nan','F')

In [17]:
rs_mean = X['RoomService'].mean()
fc_mean = X['FoodCourt'].mean()
sm_mean = X['ShoppingMall'].mean()
spa_mean = X['Spa'].mean()
vr_mean = X['VRDeck'].mean()
age_mean = X['Age'].mean()

rs_test_mean = df_test['RoomService'].mean()
fc_test_mean = df_test['FoodCourt'].mean()
sm_test_mean = df_test['ShoppingMall'].mean()
spa_test_mean = df_test['Spa'].mean()
vr_test_mean = df_test['VRDeck'].mean()
age_test_mean = df_test['Age'].mean()

In [18]:
fill_features = ['HomePlanet', 'Destination', 'Side', 'CryoSleep', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age']
fill_vals = ['Earth', 'TRAPPIST-1e', 'S', False, False]

fill_vals_train = fill_vals + [rs_mean, fc_mean, sm_mean, spa_mean, vr_mean, age_mean]
fill_vals_test = fill_vals + [rs_test_mean, fc_test_mean, sm_test_mean, spa_test_mean, vr_test_mean, age_test_mean]

In [19]:
X_imputed = fill_na(X, fill_features, fill_vals_train)
X_test_imputed = fill_na(df_test, fill_features, fill_vals_test)
X_imputed.head(10)

Unnamed: 0_level_0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Deck,Side
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0001_01,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,B,P
0002_01,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,F,S
0003_01,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,A,S
0003_02,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,A,S
0004_01,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,F,S
0005_01,Earth,False,PSO J318.5-22,44.0,False,0.0,483.0,0.0,291.0,0.0,F,P
0006_01,Earth,False,TRAPPIST-1e,26.0,False,42.0,1539.0,3.0,0.0,0.0,F,S
0006_02,Earth,True,TRAPPIST-1e,28.0,False,0.0,0.0,0.0,0.0,304.854791,G,S
0007_01,Earth,False,TRAPPIST-1e,35.0,False,0.0,785.0,17.0,216.0,0.0,F,S
0008_01,Europa,True,55 Cancri e,14.0,False,0.0,0.0,0.0,0.0,0.0,B,P


In [20]:
map_features = ['HomePlanet','Destination','Deck','Side']

map_dictionaries = [{'Earth':2, 'Europa':1, 'Mars':0}, 
                    {'TRAPPIST-1e':2, '55 Cancri e':1, 'PSO J318.5-22':0}, 
                    {'T':0, 'A':1, 'D':2, 'C':3, 'B':4, 'E':5, 'G':6, 'F':7}, 
                    {'P':0, 'S':1}]

In [21]:
X_mapped = map_cat(X_imputed, map_features, map_dictionaries)
X_test_mapped = map_cat(X_test_imputed, map_features, map_dictionaries)

X_mapped.head()

Unnamed: 0_level_0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Deck,Side
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0001_01,1,False,2,39.0,False,0.0,0.0,0.0,0.0,0.0,4,0
0002_01,2,False,2,24.0,False,109.0,9.0,25.0,549.0,44.0,7,1
0003_01,1,False,2,58.0,True,43.0,3576.0,0.0,6715.0,49.0,1,1
0003_02,1,False,2,33.0,False,0.0,1283.0,371.0,3329.0,193.0,1,1
0004_01,2,False,2,16.0,False,303.0,70.0,151.0,565.0,2.0,7,1


In [22]:
X_normalized = normalize_features(X_mapped)
X_test_normalized = normalize_features(X_test_mapped)

In [23]:
X_normalized.head()

Unnamed: 0_level_0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Deck,Side
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0001_01,-0.44036,-0.732728,0.601283,0.709396,-0.153054,-0.34057,-0.287298,-0.2908,-0.276648,-0.269007,-0.843795,-1.032805
0002_01,0.817212,-0.732728,0.601283,-0.336698,-0.153054,-0.175354,-0.281653,-0.248954,0.211493,-0.230181,0.91917,0.968125
0003_01,-0.44036,-0.732728,0.601283,2.034449,6.532879,-0.275393,1.955503,-0.2908,5.693962,-0.225769,-2.606761,0.968125
0003_02,-0.44036,-0.732728,0.601283,0.290958,-0.153054,-0.34057,0.517376,0.330206,2.683316,-0.098702,-2.606761,0.968125
0004_01,0.817212,-0.732728,0.601283,-0.894615,-0.153054,0.118702,-0.243395,-0.038046,0.225719,-0.267242,0.91917,0.968125


In [24]:
# perform_cross_validation(X_normalized, y)

In [25]:
clf = SVC().fit(X_normalized, y)

predictions = clf.predict(X_test_normalized)

submission1 = pd.DataFrame(data={'Transported':predictions}, index=df_test.index)

submission1.head()

Unnamed: 0_level_0,Transported
PassengerId,Unnamed: 1_level_1
0013_01,True
0018_01,False
0019_01,True
0021_01,True
0023_01,True


In [26]:
submission1.to_csv('2023-6-13_Submission1.csv')

**--------------------------------------------------------------------------------**

In [27]:
df_train2 = pd.read_csv(train_filepath, index_col='PassengerId')
df_test2 = pd.read_csv(test_filepath, index_col='PassengerId')

In [28]:
df_train2 = df_train2.drop('Name', axis=1)
df_test2 = df_test2.drop('Name', axis=1)

In [29]:
df_train2 = expand_cabin(df_train2)
df_test2 = expand_cabin(df_test2)

In [30]:
X2 = df_train2.drop('Transported', axis=1)

y2 = df_train2['Transported']

In [31]:
X2['Deck'] = X2['Deck'].replace('nan','F')
df_test2['Deck'] = df_test2['Deck'].replace('nan','F')

In [32]:
rs_med = X2['RoomService'].median()
fc_med = X2['FoodCourt'].median()
sm_med = X2['ShoppingMall'].median()
spa_med = X2['Spa'].median()
vr_med = X2['VRDeck'].median()
age_med = X2['Age'].median()

rs_test_med = df_test2['RoomService'].median()
fc_test_med = df_test2['FoodCourt'].median()
sm_test_med = df_test2['ShoppingMall'].median()
spa_test_med = df_test2['Spa'].median()
vr_test_med = df_test2['VRDeck'].median()
age_test_med = df_test2['Age'].median()

In [33]:
fill_features = ['HomePlanet', 'Destination', 'Side', 'CryoSleep', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age']
fill_vals = ['Earth', 'TRAPPIST-1e', 'S', False, False]

fill_vals_train_med = fill_vals + [rs_med, fc_med, sm_med, spa_med, vr_med, age_med]
fill_vals_test_med = fill_vals + [rs_test_med, fc_test_med, sm_test_med, spa_test_med, vr_test_med, age_test_med]

In [34]:
X_imputed2 = fill_na(X2, fill_features, fill_vals_train_med)
X_test_imputed2 = fill_na(df_test2, fill_features, fill_vals_test_med)

In [35]:
map_features = ['HomePlanet','Destination','Deck','Side']

map_dictionaries = [{'Earth':2, 'Europa':1, 'Mars':0}, 
                    {'TRAPPIST-1e':2, '55 Cancri e':1, 'PSO J318.5-22':0}, 
                    {'T':0, 'A':1, 'D':2, 'C':3, 'B':4, 'E':5, 'G':6, 'F':7}, 
                    {'P':0, 'S':1}]

In [36]:
X_mapped2 = map_cat(X_imputed2, map_features, map_dictionaries)
X_test_mapped2 = map_cat(X_test_imputed2, map_features, map_dictionaries)

In [37]:
X_normalized2 = normalize_features(X_mapped2)
X_test_normalized2 = normalize_features(X_test_mapped2)

In [38]:
# perform_cross_validation(X_normalized2, y2)

In [39]:
clf_med = SVC().fit(X_normalized2, y2)

predictions_med = clf.predict(X_test_normalized2)

submission2 = pd.DataFrame(data={'Transported':predictions_med}, index=df_test.index)

submission2.head()

Unnamed: 0_level_0,Transported
PassengerId,Unnamed: 1_level_1
0013_01,True
0018_01,False
0019_01,True
0021_01,True
0023_01,True


In [40]:
submission2.to_csv('2023-6-13_Submission2.csv')