<a href="https://www.kaggle.com/code/taylordaugherty/titanic-competition-norm-2023-6-14?scriptVersionId=137170355" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Titanic Competition

*Created by: Taylor Daugherty*

*Created on: 5/23/2023    Last updated: 7/18/2023 - Experiment with ensemble methods*

This notebook contains code to make predictions for the Titanic competition. The notebook attempts to determine the best imputation method for most accurate competition results.

**Input File:** train.csv and test.csv from the titanic competition

- train.csv contains a training set of data with features and true values

- test.csv contains a testing set of data with only features. This is what will be used to make predictions and submit to the competition


**Purpose of notebook:** Gain more practice with cross validation and make submissions to the titanic competition

<img src="https://wallpapercave.com/wp/0swzmR9.jpg" alt="Titanic" width="600"/>

## Table of Contents

1. **Universal Application**

    a. Imports
    
    b. Lists
    
    b. Functions

2. **Models**

    a. Without Age
    
    b. With Age

## Results

The highest scoring model of the notebook scored **0.7791** in the competition. This was acheived by filling the missing 'Age' values with the mean of the feature. 

This can be found using the following sections:

    ML Models > Models With Age > Fill Age with Mean

**----------------------------------------------------------------------------------------------------------------------------------------------------------------**

# Univeral Application

In this section is all of the code that is universally applicable throughout the notebook. This includes imports, lists, and functions that make the code much more readable

## Contents:

1. Imports

2. Imputing values

3. Lists

3. Functions

**--------------------------------------------------------------------------------**

### Packages

Import the packages necessary to run the notebook

1. numpy: used for linear algebra

2. pandas: used for dataframe creation and manipulation

5. `LogisticRegression`: a classification model using logistic regression

6. `KNeighborsClassifier`: a classification model using KNN

7. `SVC`: a classification model using SVM

8. `KFold` and `cross_val_score`: used for cross validation

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import VotingClassifier

### File paths

These are the filepaths to the original data

In [2]:
train_filepath = '/kaggle/input/titanic/train.csv'
test_filepath = '/kaggle/input/titanic/test.csv'

### Dataframes

This is an initial import of the data. There has been no special adaptations to this dataframe that may exist in later imports

In [3]:
df_train = pd.read_csv(train_filepath)
df_test = pd.read_csv(test_filepath)

**--------------------------------------------------------------------------------**

## Imputing Values

This dataset contains a number of missing values. This notebook focuses on finding the best imputations for the 'Fare' and 'Age' features.

### Fare

There is a single value missing in the 'Fare' column of the testing dataset. This can be imputed by using the mean or median of the feature. Both options are shown in the cell below

In [4]:
fill_fare_mean = df_test['Fare'].mean()
fill_fare_med = df_test['Fare'].median()

### Age

The 'Age' column is missing many data points. There are three ways to impute these missing values: mean, median, mode. These are all shown below.

The mode had an issue during implementation, so the integer literal of the mode for each dataset was used instead.

In [5]:
# Training
age_mean_train = df_train['Age'].mean()
age_med_train = df_train['Age'].median()
age_mode_train = 24

# Testing
age_mean_test = df_test['Age'].mean()
age_med_test = df_test['Age'].median()
age_mode_test = 24

**--------------------------------------------------------------------------------**

## Lists

There are many cases in this notebook that call for a list. This is so prevalent that a section is needed for it.

### Contents:

1. Dropping

2. Mapping

3. Models

### Dropping

Some features in this dataset will not be useful for making predictions. The names of the three features that will be excluded from the models are listed below.

In [6]:
drop_feature_names = ['Name', 'Cabin', 'Ticket']

### Mapping

There are many columns in the dataset that contain purely categorical data. Categorical data cannot be used in machine learning models, so the data must be converted to numbers

These two lists are designed to identify features and common values to map the categorical data.

- The first variable is a list of feature names that have categorical values and will need to be mapped

- The second variable is a list of dictionaries to use in mapping. The mapping values are based on frequency in the dataset with higher numbers corresponding to higher frequency

In [7]:
map_features = ['Sex', 'Embarked']
map_dicts = [{'male':0, 'female':1}, {'Q':0, 'C':1, 'S':2}]

### Models

These are the models that will be used during cross-validation. 

**Models:**

- Logistic Regression

- KNN

- SVM

In [8]:
models = []

models.append(('SVM', SVC()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('LR', LogisticRegression()))

**--------------------------------------------------------------------------------**

## Functions

This notebook is heavily dependent on the functions that were created special for it. 

These functions improve both the readability of the notebook as well as the recreatability of its results. 

### Contents:

1. Dataframe Adjustments

2. Cross Validation

3. Transform Dataframe

### Dataframe Adjustments

These functions all relate to making changes to the dataframe

#### fill_na()

Fill the NaN values of the features with the fill values

**Input:** A dataframe, a list of features, a list of fill values

- Dataframe to fill the missing values in

- A list of features to look for missing values

- A list of values to fill the missing values of the feature with

**Output:** A dataframe with the found missing values filled

In [9]:
def fill_na(df, features, vals):
    '''
    Fill the NaN values of the features with the fill values
    
    Input:
        - Dataframe to fill the missing values in
        - A list of features to look for missing values
        - A list of values to fill the missing values of the feature with
    
    Output: A dataframe with the found missing values filled
    '''
    df_filled = df
    
    for i in range(len(features)):
        feature = features[i]
        fill_val = vals[i]
        
        df_filled[feature] = df_filled[feature].fillna(fill_val)
        
    return df_filled

### drop_features()

This drops a list of features from a dataframe. The purpose of this function is to reduce the number of lines in the program by removing all unwanted features at once.

**Input:** A dataframe, a list of features

- A dataframe to remove the features from

- A list of features to remove from the dataframe

**Output:** A dataframe without the listed features

In [10]:
def drop_features(df, d_features):
    '''
    Drops a list of features from a dataframe
    
    Input:
        - A dataframe to remove the features from
        - A list of features to remove from the dataframe
        
    Output: A dataframe without the listed features
    '''
    df_dropped = df
    
    for feature in d_features:
        df_dropped = df_dropped.drop(str(feature), axis=1)
        
    return df_dropped

#### map_cat()

Map categorical data to be numeric

**Input:** A dataframe, a list of features, a list of dictionaries

- Dataframe for the mapping to take place in

- List of features that contain categorical data

- List of dictionaries to govern how to map the feature

**Output:** A dataframe with the given features mapped according to the dictionaries

In [11]:
def map_cat(df, features, dictionaries):
    '''
    Map categorical data to be numeric
    
    Input:
        - Dataframe for the mapping to take place in
        - List of features that contain categorical data
        - List of dictionaries to govern how to map the feature
    
    Output: A dataframe with the given features mapped according to the dictionaries
        
    '''
    df_mapped = df
    
    for i in range(len(features)):
        feature = features[i]
        dictionary = dictionaries[i]
        
        df_mapped[feature] = df[feature].map(dictionary)
        
    return df_mapped

#### normalize()

Normalizes a Series

**Input:** A feature of type Series

**Output:** The normalized feature of type Series

In [12]:
def normalize(feature):
    '''
    This function normalizes a Series
    
    Input: A feature of type Series
    
    Output: The normalized feature of type Series
    '''
    return (feature - feature.mean())/feature.std()

#### normalize_features()

Normalizes all features in a given dataframe. This will normalize ALL features, so ensure that the inputted dataframe consists only of numeric values.

**Input:** A dataframe to normalize

**Output:** A normalized dataframe

In [13]:
def normalize_features(df):
    '''
    This function normalizes all features in a dataframe
    
    Input: A pandas dataframe
    
    Output: The normalized dataframe
    '''
    for column in df.columns:
        df[column] = normalize(df[column])
    return df

### Cross Validation

This function relates to performing cross validation on the dataset using the models listed earlier in the notebook

#### perform_cross_validation()

This function goes through the steps to perform Stratified K-fold cross validation using the list of models described above.

**Input:** A dataframe containing the features use to build the model, a Series of the true values associated with the feature list

**Output:** Printed result for the mean and standard deviation of each model

In [14]:
def perform_cross_validation(X_train, y_train):
    '''
    This function goes through the steps to perform Stratified K-fold cross validation using the list of models described above.
    
    Input: 
        - A dataframe containing the features use to build the model
        - A Series of the true values associated with the feature list
    
    Output: Printed result for the mean and standard deviation of each model
    '''
    results = dict()

    for name, model in models:
        kfold = StratifiedKFold(n_splits=10)
        cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
        results[name] = (cv_results.mean(), cv_results.std())

    print('Model\t\tCV Mean\t\tCV std')
    print(results)

### Transform Dataframe

Apply dataframe adjustment functions in a single place to transform the dataframe in one step.

#### prepare_dataframe()

Prepare the dataframe for splitting into feature and target. This combines most steps from previous file versions into a single function. This is intended to improve readability and reduce the length of the code.

**Input:** A dataframe, list of features to drop, fill value for fare, fill value for age

- A dataframe to perform functions on. Works with training and testing dataframes

- A list of features to drop from the dataframe

- A value to fill the missing fare point with

- A value to fill the missing age points with
    
**Output:** A dataframe ready to be normalized

In [15]:
def prepare_dataframe(df, d_features, fill_fare, fill_age=-1):
    '''
    Prepare the dataframe for splitting into feature and target.
    
    Input:
        - dataframe to perform operations on
        - List of features to drop from dataframe
        - Value to fill missing fare point
        - Value to fill missing age points
        
    Output: A dataframe ready to be normalized
    '''
    # Fill NaN
    fill_features = ['Fare', 'Embarked', 'Age']
    fill_vals = [fill_fare, 'S', fill_age]
    df_filled = fill_na(df, fill_features, fill_vals)
    
    # Drop features
    df_dropped = drop_features(df_filled, d_features)
    
    # Map categorical features
    df_mapped = map_cat(df_dropped, map_features, map_dicts)
    
    # Return the finished dataframe
    return df_mapped

**----------------------------------------------------------------------------------------------------------------------------------------------------------------**

# ML Models

This notebook investigates the accuracy of 5 models on the Titanic Competition dataset.

Given the larget number of missing values in the dataset, the models will differ in the method they use to impute those missing points

## Contents:

1. Models Without Age

2. Models With Age

## Results:

The highest scoring model of the notebook scored **0.7791** in the competition. This was acheived by filling the missing 'Age' values with the mean of the feature. 

This can be found using the following sections:

    Models With Age > Fill Age with Mean

**----------------------------------------------------------------------------------------------------------------------------------------------------------------**

# Models without Age

This notebook investigates the accuracy of two models on the Titanic Competition dataset.

Given the large number of missing values in the dataset, the models will differ in the method they use to impute the missing point in the Fare feature

## Contents:

1. With mean

2. With median

## Results:

### Score: 0.76794

There was no difference between the different imputation methods, implying that the method to imput the missing value in 'Fare' is insignificant. For this reason, the mean fare will be used in the reaminder of the notebook

**--------------------------------------------------------------------------------**

## Fill Fare with Mean

This model fills the missing value in Fare with the mean for the feature. This model does not include age in its predictions

**Unique attributes:**

- Fill NaN with feature mean

- Drop 'Age' from the dataframe

**Result of CV:** The model used to make predictions was KNN because it had the highest accuracy score

**Accuracy:** The accuracy after submitting the predictions

    0.76794

### Load dataframes

Use the filepath and `read_csv()` to load the files for training and testing into their respective dataframes

In [16]:
df_train_noAge = pd.read_csv(train_filepath, index_col='PassengerId')
df_test_noAge = pd.read_csv(test_filepath, index_col='PassengerId')

### Drop age

Since this model doesn't include the 'Age' feature, it is added to the list of features to drop from the dataframe

In [17]:
drop_feature_names_w_age = drop_feature_names + ['Age']

### Prepare Training

Prepare the dataframes to be split into target and features and later normalized

In [18]:
df_train_noAge = prepare_dataframe(df_train_noAge, drop_feature_names_w_age, fill_fare_mean)
df_test_noAge = prepare_dataframe(df_test_noAge, drop_feature_names_w_age, fill_fare_mean)

### Split Training Data

Separate the training data into the features and target for cross-validation and model building

In [19]:
X_noAge = df_train_noAge.drop('Survived', axis=1)
y_noAge = df_train_noAge['Survived']

### Normalize data

Normalize the features of the training data and the testing data

In [20]:
X_noAge = normalize_features(X_noAge)
df_test_noAge = normalize_features(df_test_noAge)

### Perform cross-validation

Perform cross validation to determine the best model to use for the data

In [21]:
perform_cross_validation(X_noAge, y_noAge)

Model		CV Mean		CV std
{'SVM': (0.8058052434456927, 0.02817776516195646), 'KNN': (0.8092384519350813, 0.04749914265913915), 'LR': (0.7934706616729088, 0.024861946815253916)}


Since KNN had the highest accuracy, this is what will be used to build the model.

### Make predictions

The model is built and fit using the KNN classifier.

The model is used to make predictions of the test set.

The predictions are formatted into a dataframe to prepare for exporting

In [22]:
clf_noAge = KNeighborsClassifier().fit(X_noAge,y_noAge)

predictions_noAge = clf_noAge.predict(df_test_noAge)

submission1_noAge = pd.DataFrame(data={'Survived':predictions_noAge}, index=df_test_noAge.index)

submission1_noAge.head()

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,1


### Export

Export the dataframe from above as a `.csv` file

In [23]:
submission1_noAge.to_csv('Titanic_Submission-No_Age-Mean_Fare-2023_6_14.csv')

### Result: 

The submissions scored an accuracy of **0.76794** on the competition website

**--------------------------------------------------------------------------------**

## Fill Fare with Median

This model fills the missing value in Fare with the median for the feature. This model does not include age in its predictions

**Unique attributes:**

- Fill NaN with feature median

- Drop 'Age' from the dataframe

**Result of CV:** The model used to make predictions was KNN because it had the highest accuracy score

**Accuracy:** The accuracy after submitting the predictions

    0.76794

### Load dataframes

Use the filepath and `read_csv()` to load the files for training and testing into their respective dataframes

In [24]:
df_train_noAge = pd.read_csv(train_filepath, index_col='PassengerId')
df_test_noAge = pd.read_csv(test_filepath, index_col='PassengerId')

### Drop age

Since this model doesn't include the 'Age' feature, it is added to the list of features to drop from the dataframe

In [25]:
drop_feature_names_w_age = drop_feature_names + ['Age']

### Prepare Training

Prepare the dataframes to be split into target and features and later normalized

In [26]:
df_train_noAge = prepare_dataframe(df_train_noAge, drop_feature_names_w_age, fill_fare_med)
df_test_noAge = prepare_dataframe(df_test_noAge, drop_feature_names_w_age, fill_fare_med)

### Split Training Data

Separate the training data into the features and target for cross-validation and model building

In [27]:
X_noAge = df_train_noAge.drop('Survived', axis=1)
y_noAge = df_train_noAge['Survived']

### Normalize data

Normalize the features of the training data and the testing data

In [28]:
X_noAge = normalize_features(X_noAge)
df_test_noAge = normalize_features(df_test_noAge)

### Perform cross-validation

Perform cross validation to determine the best model to use for the data

In [29]:
perform_cross_validation(X_noAge, y_noAge)

Model		CV Mean		CV std
{'SVM': (0.8058052434456927, 0.02817776516195646), 'KNN': (0.8092384519350813, 0.04749914265913915), 'LR': (0.7934706616729088, 0.024861946815253916)}


Since KNN had the highest accuracy, this is what will be used to build the model.

### Make predictions

The model is built and fit using the KNN classifier.

The model is used to make predictions of the test set.

The predictions are formatted into a dataframe to prepare for exporting

In [30]:
clf_noAge = KNeighborsClassifier().fit(X_noAge,y_noAge)

predictions_noAge = clf_noAge.predict(df_test_noAge)

submission2_noAge = pd.DataFrame(data={'Survived':predictions_noAge}, index=df_test_noAge.index)

submission2_noAge.head()

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,1


### Export

Export the dataframe from above as a `.csv` file

In [31]:
submission2_noAge.to_csv('Titanic_Submission-No_Age-Med_Fare-2023_6_14.csv')

### Result: 

The submissions scored an accuracy of **0.76794** on the competition website

**----------------------------------------------------------------------------------------------------------------------------------------------------------------**

# Models with Age

This section investigates the accuracy of three models on the Titanic Competition dataset.

Given the large number of missing values in the dataset, the models will differ in the method they use to impute the missing points in the Age feature

## Contents:

1. With mean

2. With median

3. With mode

## Results:

### Score: 0.7791

The highest score from these models was acheived by filling the missing Age values with the mean. This scored **0.7791**

There was no difference between the other two imputation methods. Both models scored **0.77751**

**--------------------------------------------------------------------------------**

## Fill Age with Mean

This model fills the missing value in Fare with the mean for the feature. This model fills the missing values in Age with the feature's mean.

**Unique attributes:**

- Fill NaN in Fare with feature mean

- Fill NaN in Age with feature mean

**Result of CV:** The model used to make predictions was SVM because it had the highest accuracy score

**Accuracy:** The accuracy after submitting the predictions

    0.7791

### Load dataframes

Use the filepath and `read_csv()` to load the files for training and testing into their respective dataframes

In [32]:
df_train_meanAge = pd.read_csv(train_filepath, index_col='PassengerId')
df_test_meanAge = pd.read_csv(test_filepath, index_col='PassengerId')

### Prepare Training

Prepare the dataframes to be split into target and features and later normalized

In [33]:
df_train_meanAge = prepare_dataframe(df_train_meanAge, drop_feature_names, fill_fare_mean, age_mean_train)
df_test_meanAge = prepare_dataframe(df_test_meanAge, drop_feature_names, fill_fare_mean, age_mean_test)

### Split Training Data

Separate the training data into the features and target for cross-validation and model building

In [34]:
X_meanAge = df_train_meanAge.drop('Survived', axis=1)
y_meanAge = df_train_meanAge['Survived']

### Normalize data

Normalize the features of the training data and the testing data

In [35]:
X_meanAge = normalize_features(X_meanAge)
df_test_meanAge = normalize_features(df_test_meanAge)

### Perform cross-validation

Perform cross validation to determine the best model to use for the data

In [36]:
perform_cross_validation(X_meanAge, y_meanAge)

Model		CV Mean		CV std
{'SVM': (0.8249313358302122, 0.03690856840480252), 'KNN': (0.809250936329588, 0.044119166613680365), 'LR': (0.7946192259675404, 0.02242820343899094)}


Since SVM had the highest accuracy, this is what will be used to build the model.

### Make predictions

The model is built and fit using the SVM classifier.

The model is used to make predictions of the test set.

The predictions are formatted into a dataframe to prepare for exporting

In [37]:
clf_meanAge = SVC().fit(X_meanAge, y_meanAge)

predictions_meanAge = clf_meanAge.predict(df_test_meanAge)

submission1_meanAge = pd.DataFrame(data={'Survived':predictions_meanAge}, index=df_test_meanAge.index)

submission1_meanAge.head()

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,0


### Export

Export the dataframe from above as a `.csv` file

In [38]:
submission1_meanAge.to_csv('Titanic_Submission-Mean_Age-2023_6_14.csv')

### Result: 

The submissions scored an accuracy of **0.7791** on the competition website

**--------------------------------------------------------------------------------**

## Fill Age with Median

This model fills the missing value in Fare with the mean for the feature. This model fills the missing values in Age with the feature's median.

**Unique attributes:**

- Fill NaN in Fare with feature mean

- Fill NaN in Age with feature median

**Result of CV:** The model used to make predictions was SVM because it had the highest accuracy score

**Accuracy:** The accuracy after submitting the predictions

    0.77751

### Load dataframes

Use the filepath and `read_csv()` to load the files for training and testing into their respective dataframes

In [39]:
df_train_medAge = pd.read_csv(train_filepath, index_col='PassengerId')
df_test_medAge = pd.read_csv(test_filepath, index_col='PassengerId')

### Prepare Training

Prepare the dataframes to be split into target and features and later normalized

In [40]:
df_train_medAge = prepare_dataframe(df_train_medAge, drop_feature_names, fill_fare_mean, age_med_train)
df_test_medAge = prepare_dataframe(df_test_medAge, drop_feature_names, fill_fare_mean, age_med_test)

### Split Training Data

Separate the training data into the features and target for cross-validation and model building

In [41]:
X_medAge = df_train_medAge.drop('Survived', axis=1)
y_medAge = df_train_medAge['Survived']

### Normalize data

Normalize the features of the training data and the testing data

In [42]:
X_medAge = normalize_features(X_medAge)
df_test_meanAge = normalize_features(df_test_medAge)

### Perform cross-validation

Perform cross validation to determine the best model to use for the data

In [43]:
perform_cross_validation(X_medAge, y_medAge)

Model		CV Mean		CV std
{'SVM': (0.8249313358302122, 0.03690856840480252), 'KNN': (0.8114731585518102, 0.04093208033293136), 'LR': (0.7946192259675405, 0.02702347765082892)}


Since SVM had the highest accuracy, this is what will be used to build the model.

### Make predictions

The model is built and fit using the SVM classifier.

The model is used to make predictions of the test set.

The predictions are formatted into a dataframe to prepare for exporting

In [44]:
clf_medAge = SVC().fit(X_medAge, y_medAge)

predictions_medAge = clf_medAge.predict(df_test_medAge)

submission1_medAge = pd.DataFrame(data={'Survived':predictions_medAge}, index=df_test_medAge.index)

submission1_medAge.head()

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,0


### Export

Export the dataframe from above as a `.csv` file

In [45]:
submission1_medAge.to_csv('Titanic_Submission-Median_Age-2023_6_14.csv')

### Result: 

The submissions scored an accuracy of **0.77751** on the competition website

**--------------------------------------------------------------------------------**

## Fill Age with Mode

This model fills the missing value in Fare with the mean for the feature. This model fills the missing values in Age with the feature's mode.

**Unique attributes:**

- Fill NaN in Fare with feature mean

- Fill NaN in Age with feature mode

**Result of CV:** The model used to make predictions was SVM because it had the highest accuracy score

**Accuracy:** The accuracy after submitting the predictions

    0.77751

### Load dataframes

Use the filepath and `read_csv()` to load the files for training and testing into their respective dataframes

In [46]:
df_train_modeAge = pd.read_csv(train_filepath, index_col='PassengerId')
df_test_modeAge = pd.read_csv(test_filepath, index_col='PassengerId')

### Prepare Training

Prepare the dataframes to be split into target and features and later normalized

In [47]:
df_train_modeAge = prepare_dataframe(df_train_modeAge, drop_feature_names, fill_fare_mean, age_mode_train)
df_test_modeAge = prepare_dataframe(df_test_modeAge, drop_feature_names, fill_fare_mean, age_mode_test)

### Split Training Data

Separate the training data into the features and target for cross-validation and model building

In [48]:
X_modeAge = df_train_modeAge.drop('Survived', axis=1)
y_modeAge = df_train_modeAge['Survived']

### Normalize data

Normalize the features of the training data and the testing data

In [49]:
X_modeAge = normalize_features(X_modeAge)
df_test_modeAge = normalize_features(df_test_modeAge)

### Perform cross-validation

Perform cross validation to determine the best model to use for the data

In [50]:
perform_cross_validation(X_modeAge, y_modeAge)

Model		CV Mean		CV std
{'SVM': (0.8260549313358302, 0.03793120045875092), 'KNN': (0.8070037453183521, 0.03824776974876488), 'LR': (0.7923720349563046, 0.028051434354094185)}


Since SVM had the highest accuracy, this is what will be used to build the model.

### Make predictions

The model is built and fit using the SVM classifier.

The model is used to make predictions of the test set.

The predictions are formatted into a dataframe to prepare for exporting

In [51]:
clf_modeAge = SVC().fit(X_modeAge, y_modeAge)

predictions_modeAge = clf_modeAge.predict(df_test_modeAge)

submission1_modeAge = pd.DataFrame(data={'Survived':predictions_medAge}, index=df_test_modeAge.index)

submission1_modeAge.head()

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,0


### Export

Export the dataframe from above as a `.csv` file

In [52]:
submission1_modeAge.to_csv('Titanic_Submission-Mode_Age-2023_6_14.csv')

### Result: 

The submissions scored an accuracy of **0.77751** on the competition website

# Ensemble Models

It is possible that the most accurate method is a combination of multiple machine learning models. 
This section explores how to use ensemble models to make predictions and evaluate its accuracy.

### Load dataframes

The best model previously came from using the mean age to fill null values, so this is what will be used for ensemble models.

In [53]:
df_train_ensemble = df_train_meanAge.copy()
df_test_ensemble = df_test_meanAge.copy()

### Split Training Data

Separate the training data into the features and target for cross-validation and model building

In [54]:
X_ensemble = df_train_ensemble.drop('Survived', axis=1)
y_ensemble = df_train_ensemble['Survived']

### Separate data training data

To get an estimated accuracy, use train/test split for training and validation data.

In [55]:
X_ensemble_train, X_ensemble_val, y_ensemble_train, y_ensemble_val = train_test_split(X_ensemble, y_ensemble)

### Evaluate Model

Create a model and use it to evaluate its performance.

In [56]:
ensemble_model = VotingClassifier([('knn', KNeighborsClassifier()), 
                                   ('lr', LogisticRegression()), 
                                   ('svm', SVC()), 
                                   ('lda', LinearDiscriminantAnalysis())])

ensemble_model.fit(X_ensemble_train, y_ensemble_train)
ensemble_model.score(X_ensemble_val, y_ensemble_val)

0.7219730941704036

### Make predictions

The model is built and fit using the SVM classifier.

The model is used to make predictions of the test set.

The predictions are formatted into a dataframe to prepare for exporting

In [57]:
ensemble_model.fit(X_ensemble, y_ensemble)

predictions_ensemble = clf_modeAge.predict(df_test_ensemble)

submission1_ensemble = pd.DataFrame(data={'Survived':predictions_ensemble}, index=df_test_ensemble.index)

submission1_ensemble.head()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,0
896,0


### Export

Export the dataframe from above as a `.csv` file

In [58]:
submission1_ensemble.to_csv('Titanic_Submission-Ensemble1-2023_7_18.csv')