# Spaceship Titanic Competition

*Created by: Taylor Daugherty*

*Created on: 5/23/2023    Last updated: 6/13/2023 - Clean up notebook*

This notebook is used to practice using cross validation.


**Input File:** train.csv and test.csv from the spaceship-titanic competition

- train.csv contains a training set of data with features and true values

- test.csv contains a testing set of data with only features. This is what will be used to make predictions and submit to the competition


**Purpose of notebook:** Gain more practice with cross validation and make submissions to the spaceship titanic competition

## Table of Contents

1. Import packages

2. Useful functions

3. Initial Model

4. Improvement 1: Remove 'Converted' group

5. Improvement 2: Impute NaN values

    a. Constant
    
    b. Mean
    
    c. Median

## Import packages

Import the packages necessary to run the notebook

1. numpy: used for linear algebra

2. pandas: used for dataframe creation and manipulation

4. `SimpleImputer`: Used to fill NaN values

5. `LogisticRegression`: a classification model using logistic regression

6. `KNeighborsClassifier`: a classification model using KNN

7. `SVC`: a classification model using SVM

8. `KFold` and `cross_val_score`: used for cross validation

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.model_selection import StratifiedKFold, cross_val_score

### File paths

These are the filepaths to the original data

In [2]:
train_filepath = '/kaggle/input/spaceship-titanic/train.csv'
test_filepath = '/kaggle/input/spaceship-titanic/test.csv'

In [3]:
df_train = pd.read_csv(train_filepath)
df_test = pd.read_csv(test_filepath)

### Fill Lists

There are many features with missing values in the dataset. For this reason, almost every column needs to fill missing values.

These two lists are designed to identify features and common values to fill the NaN values with.

- The first variable is a list of feature names that have missing values and will need to be filled

- The second variable is a list of the most frequent values for their respective columns

In [4]:
fill_features = ['HomePlanet', 'Destination', 'Side', 'CryoSleep', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age']
fill_freq = ['Earth', 'TRAPPIST-1e', 'S', False, False]

### Mapping Lists

There are many columns in the dataset that contain purely categorical data. Categorical data cannot be used in machine learning models, so the data must be converted to numbers

These two lists are designed to identify features and common values to map the categorical data.

- The first variable is a list of feature names that have categorical values and will need to be mapped

- The second variable is a list of dictionaries to use in mapping. The mapping values are based on frequency in the dataset with higher numbers corresponding to higher frequency

In [5]:
map_features = ['HomePlanet','Destination','Deck','Side']

map_dictionaries = [{'Earth':2, 'Europa':1, 'Mars':0}, 
                    {'TRAPPIST-1e':2, '55 Cancri e':1, 'PSO J318.5-22':0}, 
                    {'T':0, 'A':1, 'D':2, 'C':3, 'B':4, 'E':5, 'G':6, 'F':7}, 
                    {'P':0, 'S':1}]

### List of models

These are the models that will be used during cross-validation. 

**Models:**

- Logistic Regression

- KNN

- SVM

In [6]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', SVC()))

In [7]:
rs_mean = df_train['RoomService'].mean()
fc_mean = df_train['FoodCourt'].mean()
sm_mean = df_train['ShoppingMall'].mean()
spa_mean = df_train['Spa'].mean()
vr_mean = df_train['VRDeck'].mean()
age_mean = df_train['Age'].mean()

rs_test_mean = df_test['RoomService'].mean()
fc_test_mean = df_test['FoodCourt'].mean()
sm_test_mean = df_test['ShoppingMall'].mean()
spa_test_mean = df_test['Spa'].mean()
vr_test_mean = df_test['VRDeck'].mean()
age_test_mean = df_test['Age'].mean()

mean_nan_fill_train = [rs_mean, fc_mean, sm_mean, spa_mean, vr_mean, age_mean]
mean_nan_fill_test = [rs_test_mean, fc_test_mean, sm_test_mean, spa_test_mean, vr_test_mean, age_test_mean]

In [8]:
rs_med = df_train['RoomService'].median()
fc_med = df_train['FoodCourt'].median()
sm_med = df_train['ShoppingMall'].median()
spa_med = df_train['Spa'].median()
vr_med = df_train['VRDeck'].median()
age_med = df_train['Age'].median()

rs_test_med = df_test['RoomService'].median()
fc_test_med = df_test['FoodCourt'].median()
sm_test_med = df_test['ShoppingMall'].median()
spa_test_med = df_test['Spa'].median()
vr_test_med = df_test['VRDeck'].median()
age_test_med = df_test['Age'].median()

med_nan_fill_train = [rs_med, fc_med, sm_med, spa_med, vr_med, age_med]
med_nan_fill_test = [rs_test_med, fc_test_med, sm_test_med, spa_test_med, vr_test_med, age_test_med]

### perform_cross_validation()

This function goes through the steps to perform Stratified K-fold cross validation using the list of models described above.

**Input:** A dataframe containing the features use to build the model, a Series of the true values associated with the feature list

**Output:** Printed result for the mean and standard deviation of each model

In [9]:
def perform_cross_validation(X_train, y_train):
    '''
    This function goes through the steps to perform Stratified K-fold cross validation using the list of models described above.
    
    Input: 
        - A dataframe containing the features use to build the model
        - A Series of the true values associated with the feature list
    
    Output: Printed result for the mean and standard deviation of each model
    '''
    results = dict()

    for name, model in models:
        kfold = StratifiedKFold(n_splits=10)
        cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
        results[name] = (cv_results.mean(), cv_results.std())

    print('Model\t\tCV Mean\t\tCV std')
    print(results)

### expand_cabin()

This expands the 'Cabin' column into the 'Deck' and 'Side'. 

The 'Num' feature is removed because there are two many unique values. The 'Cabin' feature is removed since it is not needed after the expansion

**Input:** A dataframe to with the Cabin feature to expand

**Output:** A new dataframe with the cabin expanded to 'Deck' and 'Side' and without 'Cabin' or 'Num'

In [10]:
def expand_cabin(df):
    '''
    This expands the 'Cabin' column into the 'Deck' and 'Side'.
    
    Input: A dataframe to with the Cabin feature to expand
    
    Output: A new dataframe with the cabin expanded to 'Deck' and 'Side' and without 'Cabin' or 'Num'
    '''
    df[['Deck','Num','Side']] = df['Cabin'].astype(str).str.split('/', expand=True)
    df = df.drop('Cabin', axis=1)
    df = df.drop('Num', axis=1)
    return df

### fill_na()

Fill the NaN values of the features with the fill values

**Input:** A dataframe, a list of features, a list of fill values

- Dataframe to fill the missing values in

- A list of features to look for missing values

- A list of values to fill the missing values of the feature with

**Output:** A dataframe with the found missing values filled

In [11]:
def fill_na(df, features, fill_vals):
    '''
    Fill the NaN values of the features with the fill values
    
    Input:
        - Dataframe to fill the missing values in
        - A list of features to look for missing values
        - A list of values to fill the missing values of the feature with
    
    Output: A dataframe with the found missing values filled
    '''
    df_filled = df
    
    for i in range(len(features)):
        feature = features[i]
        fill_val = fill_vals[i]
        
        df_filled[feature] = df_filled[feature].fillna(fill_val)
        
    return df_filled

### map_cat()

Map categorical data to be numeric

**Input:** A dataframe, a list of features, a list of dictionaries

- Dataframe for the mapping to take place in

- List of features that contain categorical data

- List of dictionaries to govern how to map the feature

**Output:** A dataframe with the given features mapped according to the dictionaries

In [12]:
def map_cat(df, features, dictionaries):
    '''
    Map categorical data to be numeric
    
    Input:
        - Dataframe for the mapping to take place in
        - List of features that contain categorical data
        - List of dictionaries to govern how to map the feature
    
    Output: A dataframe with the given features mapped according to the dictionaries
        
    '''
    df_mapped = df
    
    for i in range(len(features)):
        feature = features[i]
        dictionary = dictionaries[i]
        
        df_mapped[feature] = df[feature].map(dictionary)
        
    return df_mapped

### normalize()

Normalizes a Series

**Input:** A feature of type Series

**Output:** The normalized feature of type Series

In [13]:
def normalize(feature):
    '''
    This function normalizes a Series
    
    Input: A feature of type Series
    
    Output: The normalized feature of type Series
    '''
    return (feature - feature.mean())/feature.std()

### normalize_features()

Normalizes all features in a given dataframe. This will normalize ALL features, so ensure that the inputted dataframe consists only of numeric values.

**Input:** A dataframe to normalize

**Output:** A normalized dataframe

In [14]:
def normalize_features(df):
    '''
    This function normalizes all features in a dataframe
    
    Input: A pandas dataframe
    
    Output: The normalized dataframe
    '''
    for column in df.columns:
        df[column] = normalize(df[column])
    return df

In [15]:
def prepare_dataframe(df, fill_nans):
    df = df.drop('Name', axis=1)
    df = expand_cabin(df)
    df['Deck'] = df['Deck'].replace('nan','F')
    
    fill_vals = fill_freq + fill_nans
    df_filled = fill_na(df, fill_features, fill_vals)
    
    df_mapped = map_cat(df_filled, map_features, map_dictionaries)
    return df_mapped

**--------------------------------------------------------------------------------**

## Fill with Mean

This model fills the missing values for spending features with the mean of the feature

**Unique attributes:**

- Fill NaN with feature mean

**Result of CV:** The model used to make predictions was SVM because it had the highest accuracy score

**Accuracy:** The accuracy after submitting the predictions

    0.79541

In [16]:
df_train_mean = pd.read_csv(train_filepath, index_col='PassengerId')

df_train_mean = prepare_dataframe(df_train_mean, mean_nan_fill_train)

X_mean_train = df_train_mean.drop('Transported', axis=1)
y_mean = df_train_mean['Transported']

X_mean_norm_train = normalize(X_mean_train)

In [17]:
df_test_mean = pd.read_csv(test_filepath, index_col='PassengerId')

df_test_mean = prepare_dataframe(df_test_mean, mean_nan_fill_train)

df_test_mean_norm = normalize(df_test_mean)

In [18]:
# perform_cross_validation(X_mean_norm_train, y_mean)

In [19]:
clf = SVC().fit(X_mean_norm_train, y_mean)

predictions = clf.predict(df_test_mean_norm)

submission1 = pd.DataFrame(data={'Transported':predictions}, index=df_test.index)

submission1.head()

Unnamed: 0,Transported
0,True
1,False
2,True
3,True
4,True


In [20]:
submission1.to_csv('2023-6-13_Submission1.csv')

Result: 0.79541

**--------------------------------------------------------------------------------**

In [21]:
df_train2 = pd.read_csv(train_filepath, index_col='PassengerId')
df_test2 = pd.read_csv(test_filepath, index_col='PassengerId')

In [22]:
df_train2 = df_train2.drop('Name', axis=1)
df_test2 = df_test2.drop('Name', axis=1)

In [23]:
df_train2 = expand_cabin(df_train2)
df_test2 = expand_cabin(df_test2)

In [24]:
X2 = df_train2.drop('Transported', axis=1)

y2 = df_train2['Transported']

In [25]:
X2['Deck'] = X2['Deck'].replace('nan','F')
df_test2['Deck'] = df_test2['Deck'].replace('nan','F')

In [26]:
rs_med = X2['RoomService'].median()
fc_med = X2['FoodCourt'].median()
sm_med = X2['ShoppingMall'].median()
spa_med = X2['Spa'].median()
vr_med = X2['VRDeck'].median()
age_med = X2['Age'].median()

rs_test_med = df_test2['RoomService'].median()
fc_test_med = df_test2['FoodCourt'].median()
sm_test_med = df_test2['ShoppingMall'].median()
spa_test_med = df_test2['Spa'].median()
vr_test_med = df_test2['VRDeck'].median()
age_test_med = df_test2['Age'].median()

In [27]:
fill_features = ['HomePlanet', 'Destination', 'Side', 'CryoSleep', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age']
fill_vals = ['Earth', 'TRAPPIST-1e', 'S', False, False]

fill_vals_train_med = fill_vals + [rs_med, fc_med, sm_med, spa_med, vr_med, age_med]
fill_vals_test_med = fill_vals + [rs_test_med, fc_test_med, sm_test_med, spa_test_med, vr_test_med, age_test_med]

In [28]:
X_imputed2 = fill_na(X2, fill_features, fill_vals_train_med)
X_test_imputed2 = fill_na(df_test2, fill_features, fill_vals_test_med)

In [29]:
map_features = ['HomePlanet','Destination','Deck','Side']

map_dictionaries = [{'Earth':2, 'Europa':1, 'Mars':0}, 
                    {'TRAPPIST-1e':2, '55 Cancri e':1, 'PSO J318.5-22':0}, 
                    {'T':0, 'A':1, 'D':2, 'C':3, 'B':4, 'E':5, 'G':6, 'F':7}, 
                    {'P':0, 'S':1}]

In [30]:
X_mapped2 = map_cat(X_imputed2, map_features, map_dictionaries)
X_test_mapped2 = map_cat(X_test_imputed2, map_features, map_dictionaries)

In [31]:
X_normalized2 = normalize_features(X_mapped2)
X_test_normalized2 = normalize_features(X_test_mapped2)

In [32]:
# perform_cross_validation(X_normalized2, y2)

In [33]:
clf_med = SVC().fit(X_normalized2, y2)

predictions_med = clf.predict(X_test_normalized2)

submission2 = pd.DataFrame(data={'Transported':predictions_med}, index=df_test.index)

submission2.head()

Unnamed: 0,Transported
0,True
1,False
2,True
3,True
4,True


In [34]:
submission2.to_csv('2023-6-13_Submission2.csv')

Result: 0.79611