# CHAD Challenge

For this challenge you will try to predict who survives on the Titanic. We give you a template notebook that modifies the features, performs cross validation, and generates the result csv for Kaggle.

**IMPORTANT NOTE** Kaggle limits to 5 submissions per day so please use them wisely.

Running the code below should produce a `results.csv` file.

In [1]:
import os
from copy import deepcopy
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

In [2]:
train_df = pd.read_csv(os.path.join("titanic", "train.csv"))
test_df = pd.read_csv(os.path.join("titanic", "test.csv"))

## Feature Engineering

A very important part of machine learning is feature engineering. This requires analyzing and modyfing the features (or columns) of the dataset to both work with the model and also help the model better learn the features to generate a prediction.

Below is some fairly basic feature engineering that does the following:
* Remove all non-relevant columns that are not needed for the model to learn
* Encoding categorical features
* Fix values that are NaN by removing the observation
* Seperates the features from the labels
* Normalizing the numerical features

### Keeping Only Relevant Columns

"Put the women and children in and lower away" -- Captain Jack Smith (Titanic 1999)

Knowing some background on the Titanic (or just watching the movie), it can be inferred that certain features like sex, age, pclass, and fare are a good indicators on if a person surived on the Titanic. Features like the person's name do not give insight, or the model insight, for if the person surived or not. The following features will be used by the model to predict if the person survived:
* Pclass
* Sex
* Age
* Fare

Do you think any other features should be kept in the dataset?

In [3]:
def remove_features(df, features=["PassengerId", "Name", "SibSp", "Parch", "Ticket", "Cabin", "Embarked"]):
    return df.drop(features, axis=1)

## Fixing categorical Values

A categorical feature is a feature that has classes instead of numerical values. For example Sex, and Embarked are all categorical features because they are not floats and have distinct values that represent each possible class (for example male or female). Pclass could be considered a categorical feature two because it is not continuous. Name is also categorical because, but it is a one class per one observation, so it is not helpful.

Two ways to encode the categorical value is to:
* [Ordinally Encode](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) the feature, by assigning a value to each class in the feature with the lower the value being the more frequent the class.

For example to Ordinally Encode `X = [["apple"], ["orange"], ["orange"], ["pear"]]` it will result in `[[1], [0], [0], [2]]`

* [One Hot Encode](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) the feature, which spreads out the feature into several features which are 1 for if the class is present and 0 otherwise.

For example to One Hot Encode `X = [["apple"], ["orange"], ["orange"], ["pear"]]` it will result in `[[1,0,0], [0,1,0], [0,1,0], [0,0,1]]`

Below I provide functions to encode the dataframe both ways, but I will be use Ordinal encoding for my feature engineering. There are also other categorical encoders like Target Encoders and Label Encoders, but I would encourage with experimenting with these two encoders for now.

In [4]:
def ordinal_encode(df, feature, encoders):
    df = deepcopy(df)
    if encoders.get(feature) is None:
        enc = OrdinalEncoder()
        enc.fit(df[[feature]].values)
        encoders[feature] = enc
    else:
        enc = encoders.get(feature)
    df[feature] = enc.transform(df[[feature]].values)
    return df

def one_hot_encode(df, feature, encoders):
    df = deepcopy(df)
    if encoders.get(feature) is None:
        enc = OneHotEncoder()
        encoders[feature] = enc
        enc.fit(df[[feature]].values)
    else:
        enc = encoders.get(feature)

    one_hot_enc = enc.transform(df[[feature]]).toarray()
    df = df.drop([feature], axis=1)
    for i, new_feat in enumerate(enc.get_feature_names_out([feature])):
        df[new_feat] = one_hot_enc[:,i]
    return df

## Handling NaN's

A NaN or Not a Number is like a null in pandas. This can really muck up the machine learning model, so it is best to remove the the values. As seen below only three features have NaN:
* Age
* Cabin, but this is removed as a feature
* Embarked, but this is also removed as a feature

Removing observations with NaN is not ideal because we are removing valuable data because one value has a NaN. Instead we can "fill in" the NaN with a value using the following means:
* If the feature is a categorical feature, fill in the NaN with the most common class (the mode) for that feature
* If the feature is a numerical feature, fill in the NaN with the average value of the feature

For the Age we will fill in the NaNs with the average age. However you can group by other features in the dataset get the average of that group. This is a smarter NaN fill, but it will be left as an exercise for the reader.

In [5]:
train_df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
def fill_na(df, feature):
    df = deepcopy(df)
    if df[feature].dtype == object:
        # get the mode for categorical values
        mode = df[feature].mode()[0]
        df[feature] = df[feature].fillna(mode)
    elif df[feature].dtype == float:
        # get the average for numerical values
        avr = df[feature].mean().item()
        df[feature] = df[feature].fillna(avr)
    return df

## Get Labels

The training dataset should have features which we use as values in classifying if a person survived, and a label which is used by the model to correlate those features to that label. It is important that we split the two up before feeding them into training the model

In [7]:
def get_label(df, label):
    y = df[label].values
    X = df.drop([label], axis=1).values
    return X, y

## Scale the Data

Finally we have all data as numbers and there are no NaNs. We want to scale the data. Having the data be within a manageable range for the model because it reduces outliers and ensures the data is all on the same scale. But make sure the avoid the label. For this we are using a MinMaxScaler which as the formula as follows:

$$x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}$$



In [8]:
def scale_data(X, encoders):
    if encoders.get("scaler") is None:
        enc = MinMaxScaler()
        encoders["scaler"] = enc
        enc.fit(X)
    else:
        enc = encoders.get("scaler")

    return enc.transform(X)

In [9]:
def engineer_features(df, label=None, encoders={}):
    # remove the features
    rem_feature_list = ["PassengerId", "Name", "SibSp", "Parch", "Ticket", "Cabin", "Embarked"]
    df = remove_features(df, rem_feature_list)

    # encode the categorical data 'Sex'
    # TODO: replace with a better encoder or leave as is
    df = ordinal_encode(df, 'Sex', encoders)

    # fill NaN values
    df = fill_na(df, 'Age')
    df = fill_na(df, 'Fare')

    if label is not None:
        # get label
        X, y = get_label(df, label)
    else:
        X = df.values
        y = None

    # scale the data
    X = scale_data(X, encoders)
    
    return X, y, encoders

## Training the Model

With the features in a more suitable state, lets train our model. In this learning session you learned about three different model types for classification:
* [Support Vector Classifiers (SVC)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
* [Random Forest Classifiers](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [K Nearest Neighbor Classifiers (KNN)](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

Below I use SVC but I encourage you to try the different models above. Also try out different model parameters in order to get the best performance.

In [10]:
def train_model(X, y):
    clf = SVC()
    clf.fit(X, y)
    return clf

## Evaluating the Model

In Machine Learning, you would initially be given a dataset, and you would split up the dataset into training and test datasets. Kaggle does that for us by giving us the labeled training dataset `train.csv` and the unlabelled test datset `test.csv`. However we want to see how the model would perform on data it has never seen before. This involves splitting up the training dataset into the train dataset and the validation dataset. We use the train dataset to train the model, the validation dataset to check if the model is ready for the test dataset. The test datasets accuracy will be what your CHAD award will be judged on.

There are two ways to split up a model. One is to do a train/test split which splits the data by a ratio which is usually 80% train dataset and 20% test/validation dataset. However a better method is to do K fold cross validation. Which splits the dataset similar to the train/test split, but does it K number of times. This way the model can be trained on all the and be validated with the K fold of the validation dataset. Both methods are provided.

In [11]:
def run_train_test_split(X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
    model = train_model(X_train, y_train)

    y_pred = model.predict(X_val)
    acc = accuracy_score(y_val, y_pred)
    print(f"Model Accuracy: {acc:.4f}")

def run_cross_validation(X, y):
    accuracies = []
    kf = KFold(n_splits=5)
    for i, (train_index, val_index) in enumerate(kf.split(X)):
        X_train, y_train = X[train_index], y[train_index]
        X_val, y_val = X[val_index], y[val_index]

        model = train_model(X_train, y_train)
        y_pred = model.predict(X_val)
        acc = accuracy_score(y_val, y_pred)
        print(f"K Fold {i} Accuracy: \t{acc:.4f}")
        accuracies.append(acc)
    print("-"*30)
    print(f"Average Accuracy: \t{sum(accuracies)/len(accuracies):.4f}")

## Putting it All Together

Lets train the model with modified features and evaluate its performance.

In [12]:
def run():
    X_train, y_train, encoders = engineer_features(train_df, 'Survived')
    run_cross_validation(X_train, y_train)
    model = train_model(X_train, y_train)
    return model, encoders
clf_model, clf_encoders = run()

K Fold 0 Accuracy: 	0.7877
K Fold 1 Accuracy: 	0.7640
K Fold 2 Accuracy: 	0.7753
K Fold 3 Accuracy: 	0.7360
K Fold 4 Accuracy: 	0.8034
------------------------------
Average Accuracy: 	0.7733


In [13]:
def generate_report(model, encoders):
    report_df = test_df[['PassengerId']]
    
    X_test, _, _ = engineer_features(test_df, encoders=encoders)
    y_preds = model.predict(X_test)
    report_df['Survived'] = y_preds

    report_df.to_csv("report.csv", index=False)
    
    return report_df

generate_report(clf_model, clf_encoders)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0
