# Spaceship Titanic Competition

*Created by: Taylor Daugherty*

*Created on: 5/23/2023    Last updated: 6/13/2023 - Clean up notebook*

This notebook contains code to make predictions for the Spaceship Titanic competition. The notebook attempts to determine the best imputation method for most accurate competition results.

**Input File:** train.csv and test.csv from the spaceship-titanic competition

- train.csv contains a training set of data with features and true values

- test.csv contains a testing set of data with only features. This is what will be used to make predictions and submit to the competition


**Purpose of notebook:** Gain more practice with cross validation and make submissions to the spaceship titanic competition

## Table of Contents

1. **Universal Application**

    a. Imports
    
    b. Lists
    
    b. Functions

2. **Models**

    a. With Mean
    
    b. With Median

**----------------------------------------------------------------------------------------------------------------------------------------------------------------**

# Univeral Application

In this section is all of the code that is universally applicable throughout the notebook. This includes imports, lists, and functions that make the code much more readable

## Contents:

1. Imports

2. Lists

3. Functions

## Imports

The program relies on importing basic functionalities and data. In this section is the most essential information for the notebook to run

### Contents:

1. Packages

2. Filenames

3. Initial Dataframe

### Packages

Import the packages necessary to run the notebook

1. numpy: used for linear algebra

2. pandas: used for dataframe creation and manipulation

4. `SimpleImputer`: Used to fill NaN values

5. `LogisticRegression`: a classification model using logistic regression

6. `KNeighborsClassifier`: a classification model using KNN

7. `SVC`: a classification model using SVM

8. `KFold` and `cross_val_score`: used for cross validation

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.model_selection import StratifiedKFold, cross_val_score

### File paths

These are the filepaths to the original data

In [2]:
train_filepath = '/kaggle/input/spaceship-titanic/train.csv'
test_filepath = '/kaggle/input/spaceship-titanic/test.csv'

### Dataframes

This is an initial import of the data. There has been no special adaptations to this dataframe that may exist in later imports

In [3]:
df_train = pd.read_csv(train_filepath)
df_test = pd.read_csv(test_filepath)

**--------------------------------------------------------------------------------**


## Lists

There are many cases in this notebook that call for a list. This is so prevalent that a section is needed for it.

### Contents:

1. Fill NaN

2. Mapping

3. Models

### Fill NaN

This section contains all lists associated with filling missing values in the dataframe.

There are many features with missing values in the dataset. For this reason, almost every column needs to fill missing values that are unique to the features

#### Basic fills

These two lists are designed to identify features and common values to fill the NaN values with.

- The first variable is a list of feature names that have missing values and will need to be filled

- The second variable is a list of the most frequent values for their respective columns

In [4]:
fill_features = ['HomePlanet', 'Destination', 'Side', 'CryoSleep', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Age']
fill_freq = ['Earth', 'TRAPPIST-1e', 'S', False, False]

#### Mean fills

These are the means of the features containing numeric data. In this code are the means for both the training and testing sets. 

For simplicity, the values have been combined into a lists (one for training, one for testing

In [5]:
# Training means
rs_mean = df_train['RoomService'].mean()
fc_mean = df_train['FoodCourt'].mean()
sm_mean = df_train['ShoppingMall'].mean()
spa_mean = df_train['Spa'].mean()
vr_mean = df_train['VRDeck'].mean()
age_mean = df_train['Age'].mean()

# Testing means
rs_test_mean = df_test['RoomService'].mean()
fc_test_mean = df_test['FoodCourt'].mean()
sm_test_mean = df_test['ShoppingMall'].mean()
spa_test_mean = df_test['Spa'].mean()
vr_test_mean = df_test['VRDeck'].mean()
age_test_mean = df_test['Age'].mean()

# Mean lists
mean_nan_fill_train = [rs_mean, fc_mean, sm_mean, spa_mean, vr_mean, age_mean]
mean_nan_fill_test = [rs_test_mean, fc_test_mean, sm_test_mean, spa_test_mean, vr_test_mean, age_test_mean]

#### Median fills

These are the medians of the features containing numeric data. In this code are the means for both the training and testing sets. 

For simplicity, the values have been combined into a lists (one for training, one for testing

In [6]:
# Training medians
rs_med = df_train['RoomService'].median()
fc_med = df_train['FoodCourt'].median()
sm_med = df_train['ShoppingMall'].median()
spa_med = df_train['Spa'].median()
vr_med = df_train['VRDeck'].median()
age_med = df_train['Age'].median()

# Testing medians
rs_test_med = df_test['RoomService'].median()
fc_test_med = df_test['FoodCourt'].median()
sm_test_med = df_test['ShoppingMall'].median()
spa_test_med = df_test['Spa'].median()
vr_test_med = df_test['VRDeck'].median()
age_test_med = df_test['Age'].median()

# Median lists
med_nan_fill_train = [rs_med, fc_med, sm_med, spa_med, vr_med, age_med]
med_nan_fill_test = [rs_test_med, fc_test_med, sm_test_med, spa_test_med, vr_test_med, age_test_med]

### Mapping

There are many columns in the dataset that contain purely categorical data. Categorical data cannot be used in machine learning models, so the data must be converted to numbers

These two lists are designed to identify features and common values to map the categorical data.

- The first variable is a list of feature names that have categorical values and will need to be mapped

- The second variable is a list of dictionaries to use in mapping. The mapping values are based on frequency in the dataset with higher numbers corresponding to higher frequency

In [7]:
map_features = ['HomePlanet','Destination','Deck','Side']

map_dictionaries = [{'Earth':2, 'Europa':1, 'Mars':0}, 
                    {'TRAPPIST-1e':2, '55 Cancri e':1, 'PSO J318.5-22':0}, 
                    {'T':0, 'A':1, 'D':2, 'C':3, 'B':4, 'E':5, 'G':6, 'F':7}, 
                    {'P':0, 'S':1}]

### Models

These are the models that will be used during cross-validation. 

**Models:**

- Logistic Regression

- KNN

- SVM

In [8]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', SVC()))

**--------------------------------------------------------------------------------**

## Functions

This notebook is heavily dependent on the functions that were created special for it. 

These functions improve both the readability of the notebook as well as the recreatability of its results. 

### Contents:

1. Dataframe Adjustments

2. Cross Validation

3. Transform Dataframe

### Dataframe Adjustments

These functions all relate to making changes to the dataframe

#### expand_cabin()

This expands the 'Cabin' column into the 'Deck' and 'Side'. 

The 'Num' feature is removed because there are two many unique values. The 'Cabin' feature is removed since it is not needed after the expansion

**Input:** A dataframe to with the Cabin feature to expand

**Output:** A new dataframe with the cabin expanded to 'Deck' and 'Side' and without 'Cabin' or 'Num'

In [9]:
def expand_cabin(df):
    '''
    This expands the 'Cabin' column into the 'Deck' and 'Side'.
    
    Input: A dataframe to with the Cabin feature to expand
    
    Output: A new dataframe with the cabin expanded to 'Deck' and 'Side' and without 'Cabin' or 'Num'
    '''
    df[['Deck','Num','Side']] = df['Cabin'].astype(str).str.split('/', expand=True)
    df = df.drop('Cabin', axis=1)
    df = df.drop('Num', axis=1)
    return df

#### fill_na()

Fill the NaN values of the features with the fill values

**Input:** A dataframe, a list of features, a list of fill values

- Dataframe to fill the missing values in

- A list of features to look for missing values

- A list of values to fill the missing values of the feature with

**Output:** A dataframe with the found missing values filled

In [10]:
def fill_na(df, features, fill_vals):
    '''
    Fill the NaN values of the features with the fill values
    
    Input:
        - Dataframe to fill the missing values in
        - A list of features to look for missing values
        - A list of values to fill the missing values of the feature with
    
    Output: A dataframe with the found missing values filled
    '''
    df_filled = df
    
    for i in range(len(features)):
        feature = features[i]
        fill_val = fill_vals[i]
        
        df_filled[feature] = df_filled[feature].fillna(fill_val)
        
    return df_filled

#### map_cat()

Map categorical data to be numeric

**Input:** A dataframe, a list of features, a list of dictionaries

- Dataframe for the mapping to take place in

- List of features that contain categorical data

- List of dictionaries to govern how to map the feature

**Output:** A dataframe with the given features mapped according to the dictionaries

In [11]:
def map_cat(df, features, dictionaries):
    '''
    Map categorical data to be numeric
    
    Input:
        - Dataframe for the mapping to take place in
        - List of features that contain categorical data
        - List of dictionaries to govern how to map the feature
    
    Output: A dataframe with the given features mapped according to the dictionaries
        
    '''
    df_mapped = df
    
    for i in range(len(features)):
        feature = features[i]
        dictionary = dictionaries[i]
        
        df_mapped[feature] = df[feature].map(dictionary)
        
    return df_mapped

#### normalize()

Normalizes a Series

**Input:** A feature of type Series

**Output:** The normalized feature of type Series

In [12]:
def normalize(feature):
    '''
    This function normalizes a Series
    
    Input: A feature of type Series
    
    Output: The normalized feature of type Series
    '''
    return (feature - feature.mean())/feature.std()

#### normalize_features()

Normalizes all features in a given dataframe. This will normalize ALL features, so ensure that the inputted dataframe consists only of numeric values.

**Input:** A dataframe to normalize

**Output:** A normalized dataframe

In [13]:
def normalize_features(df):
    '''
    This function normalizes all features in a dataframe
    
    Input: A pandas dataframe
    
    Output: The normalized dataframe
    '''
    for column in df.columns:
        df[column] = normalize(df[column])
    return df

### Cross Validation

This function relates to performing cross validation on the dataset using the models listed earlier in the notebook

#### perform_cross_validation()

This function goes through the steps to perform Stratified K-fold cross validation using the list of models described above.

**Input:** A dataframe containing the features use to build the model, a Series of the true values associated with the feature list

**Output:** Printed result for the mean and standard deviation of each model

In [14]:
def perform_cross_validation(X_train, y_train):
    '''
    This function goes through the steps to perform Stratified K-fold cross validation using the list of models described above.
    
    Input: 
        - A dataframe containing the features use to build the model
        - A Series of the true values associated with the feature list
    
    Output: Printed result for the mean and standard deviation of each model
    '''
    results = dict()

    for name, model in models:
        kfold = StratifiedKFold(n_splits=10)
        cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
        results[name] = (cv_results.mean(), cv_results.std())

    print('Model\t\tCV Mean\t\tCV std')
    print(results)

### Transform Dataframe

Apply dataframe adjustment functions in a single place to transform the dataframe in one step.

#### prepare_dataframe()

Prepare the dataframe for splitting into feature and target. This combines most steps from previous file versions into a single function. This is intended to improve readability and reduce the length of the code.

**Input:** A dataframe, list of fill values

- A dataframe to perform functions on. Works with training and testing dataframes

- A list of values to fill the missing datapoints with

    - This was added to give freedom for train and test dataframes as well as different imputation methods
    
**Output:** A dataframe ready to be normalizedPrepare the dataframe for splitting into feature and target

In [15]:
def prepare_dataframe(df, fill_nans):
    '''
    Prepare the dataframe for splitting into feature and target
    
    Input:
        - Prepare the dataframe for splitting into feature and target
        - A list of values to fill the missing datapoints with
        
    Output: A dataframe ready to be normalizedPrepare the dataframe for splitting into feature and target
    '''
    df = df.drop('Name', axis=1)
    df = expand_cabin(df)
    df['Deck'] = df['Deck'].replace('nan','F')
    
    fill_vals = fill_freq + fill_nans
    df_filled = fill_na(df, fill_features, fill_vals)
    
    df_mapped = map_cat(df_filled, map_features, map_dictionaries)
    return df_mapped

**----------------------------------------------------------------------------------------------------------------------------------------------------------------**

# Models

This notebook investigates the accuracy of two models on the Spaceship Titanic Competition dataset.

Given the larget number of missing values in the dataset, the models will differ in the method they use to impute those missing points

## Contents:

1. Fill with Mean

2. Fill with Median

## Results:

The best imputation method was median with a competition accuracy score = 0.79635 

**--------------------------------------------------------------------------------**

## Fill with Mean

This model fills the missing values for spending features with the mean of the feature

**Unique attributes:**

- Fill NaN with feature mean

**Result of CV:** The model used to make predictions was SVM because it had the highest accuracy score

**Accuracy:** The accuracy after submitting the predictions

    0.79611

### Prepare Training

Here, the training data is loaded, prepared, separated into features and target, and the features are normalized

In [16]:
df_train_mean = pd.read_csv(train_filepath, index_col='PassengerId')

df_train_mean = prepare_dataframe(df_train_mean, mean_nan_fill_train)

X_mean_train = df_train_mean.drop('Transported', axis=1)
y_mean = df_train_mean['Transported']

X_mean_norm_train = normalize_features(X_mean_train)

### Prepare testing

Here, the testing data is loaded, prepared, and normalized

In [17]:
df_test_mean = pd.read_csv(test_filepath, index_col='PassengerId')

df_test_mean = prepare_dataframe(df_test_mean, med_nan_fill_test)

df_test_mean_norm = normalize_features(df_test_mean)

### Cross Validation

Uncomment this cell to see the results of cross validation on this training set

In [18]:
# perform_cross_validation(X_mean_norm_train, y_mean)

### Predictions

Using the most accurate model from cross validation (SVM), build and fit a model and make predictions based on the normalized test set.

The predictions are then put into a dataframe to be submitted to the competition

In [19]:
clf_mean = SVC().fit(X_mean_norm_train, y_mean)

predictions_mean = clf_mean.predict(df_test_mean_norm)

submission_mean = pd.DataFrame(data={'Transported':predictions_mean}, index=df_test_mean_norm.index)

submission_mean.head()

Unnamed: 0_level_0,Transported
PassengerId,Unnamed: 1_level_1
0013_01,True
0018_01,False
0019_01,True
0021_01,True
0023_01,True


### Final File

This is the final file ready for submission

In [20]:
submission_mean.to_csv('2023-6-13_Mean-fill-Nan_Submission1.csv')

## Result

The submissions scored an accuracy of **0.79611** on the competition website

**--------------------------------------------------------------------------------**

## Fill with Median

This model fills the missing values for spending features with the median of the feature

**Unique attributes:**

- Fill NaN with feature median

**Result of CV:** The model used to make predictions was SVM because it had the highest accuracy score

**Accuracy:** The accuracy after submitting the predictions

    0.79635

### Prepare Training

Here, the training data is loaded, prepared, separated into features and target, and the features are normalized

In [21]:
df_train_med = pd.read_csv(train_filepath, index_col='PassengerId')

df_train_med = prepare_dataframe(df_train_med, med_nan_fill_train)

X_med_train = df_train_med.drop('Transported', axis=1)
y_med = df_train_med['Transported']

X_med_norm_train = normalize_features(X_med_train)

### Prepare testing

Here, the testing data is loaded, prepared, and normalized

In [22]:
df_test_med = pd.read_csv(test_filepath, index_col='PassengerId')

df_test_med = prepare_dataframe(df_test_med, med_nan_fill_test)

df_test_med_norm = normalize_features(df_test_med)

### Cross Validation

Uncomment this cell to see the results of cross validation on this training set

In [23]:
# perform_cross_validation(X_med_norm_train, y_mean)

### Predictions

Using the most accurate model from cross validation (SVM), build and fit a model and make predictions based on the normalized test set.

The predictions are then put into a dataframe to be submitted to the competition

In [24]:
clf_med = SVC().fit(X_med_train, y_med)

predictions_med = clf_med.predict(df_test_med_norm)

submission_med = pd.DataFrame(data={'Transported':predictions_med}, index=df_test_med_norm.index)

submission_med.head()

Unnamed: 0_level_0,Transported
PassengerId,Unnamed: 1_level_1
0013_01,True
0018_01,False
0019_01,True
0021_01,True
0023_01,True


### Final File

This is the final file ready for submission

In [25]:
submission_med.to_csv('2023-6-13_Median-fill-Nan_Submission1.csv')

## Result

The submissions scored an accuracy of **0.79635** on the competition website