# The Big Titanic Challenge - Who survived, who died?

It's time to get our hands dirty and do some Machine Learning!
Who will find the best model to predict survival and death on the Titanic? Let's find out...

## Shoutouts:

This homework is based on the Titanic challenge on Kaggle: https://www.kaggle.com/c/titanic 

## General Description (taken from Kaggle)

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## The Task

This notebook has several parts that correspond to important steps of the Machine Learning pipeline: Selecting a good model, eliminating useless features, and tuning hyperparameters. There's a lot of assistance and much of the work is already done for you. Still, you will have a lot of opportunities to try out things and improve your models.

The overall goal is to find the best possible model to predict death and survival for a **test set of unseen data**. We will use accuracy (i.e., the percentage of our predictions that were correct) as our perfomance measure.

**Once you're done, post your high score on the [leader board on Miro](https://miro.com/app/board/o9J_kplKUVg=/)!**

Have fun, play around, and squeeze those last percentage points ;-)

## Prerequisites

To run this notebook you should:
- run on Python 3.6
- have numpy, pandas and sklearn installed (e.g., by using pip or anaconda)

## Imports

In [12]:
# Data Processing 
import numpy as np 
import pandas as pd 

# ML Algorithms
import sklearn 
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

## The dataset

For this exercise we will load data that has been preprocessed already. It is based on the data from the Kaggle challenge, but has been processed to make it easier to work with.

Here are the features of this dataset:
* Survived: 0=died, 1=survived
* Sex: 0=male, 1=female
* Embarked: Port of embarkment, 0=Southhampton, 1=Cherbourg, 2=Queenstown
* Title: 0=Mr, 1=Mrs, 2=Miss, 3=Master, 4=Royalty, 5=Officer
* AgeBin: 0=Baby, 1=Child, 2=Teenager, 3=Young Adult, 4=Adult, 5=Senior
* Family_Alone {Family_Small} [Family_Large]: 1=Passenger travelled alone {with 1 to 3 family members aboard} [with more than 4 family members aboard], 0=otherwise
* Pclass_ordinal: Ticket class, 1=1st class, 2=2nd class, 3=3rd class
* The rest is [dummy codings](https://en.wikiversity.org/wiki/Dummy_variable_(statistics)) of Pclass_ordinal, AgeBin, and Title. Wonder why one dummy is always missing? Read [here](https://www.algosome.com/articles/dummy-variable-trap-regression.html) about the dummy variable trap.

In [7]:
# Read train and test data into Pandas DataFrames
data = pd.read_csv("/Users/suadkamardeen/Documents/CODE/SE/SE14 - AI Basics/machine-learning/AI_Guild_Titanic_Data.csv")
data.sample(5)

data.drop(['FareClass'], axis=1, inplace=True)

# print five random data points as example
data.sample(5)

Unnamed: 0.1,Unnamed: 0,Survived,Sex,Embarked,Title,AgeBin,Family_Alone,Family_Small,Family_Large,Pclass_ordinal,...,Title_1,Title_2,Title_3,Title_4,Title_5,Age_1,Age_2,Age_3,Age_4,Age_5
584,584,0,0,1,0,3,1,0,0,3,...,0,0,0,0,0,0,0,1,0,0
246,246,0,1,0,2,3,1,0,0,3,...,0,1,0,0,0,0,0,1,0,0
695,695,0,0,0,0,4,1,0,0,2,...,0,0,0,0,0,0,0,0,1,0
582,582,0,0,0,0,4,1,0,0,2,...,0,0,0,0,0,0,0,0,1,0
270,270,0,0,0,0,4,1,0,0,1,...,0,0,0,0,0,0,0,0,1,0


The loaded data is part of the training data from the Kaggle challenge. 

We'll do some more preprocessing steps here:
- Reduce the dimensionality of the data (pick specific features only)
- Split into input (X) and output(y) data.
- Split into training data (X_train, y_train) and test data (X_test, y_test).

In [9]:
# Pick the features we would like to use for our model
feature_cols = ["Sex", "Embarked", "Title", "AgeBin", "Family_Alone", "Family_Small", "Family_Large", "Pclass_ordinal"]

# split into input (X) and output (y)
X, y = data[feature_cols], data["Survived"]

# Automatically split the data into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

## A simple model: Logistic Regression

Now let's train a simple model: Logistic Regression.
Since we are using sklearn, this is quite easy, as the most common machine learning models have already been implemented. So all we have to do is:
- Instantiate the model and configure it with [hyper-parameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))
- train the model
- print out the result

In [10]:
# Define the model and its hyper parameters
model = LogisticRegression(C=1.0, max_iter=1000)

# train the model 
model.fit(X_train, y_train)

# print trainin set accuracy
train_acc_log = round(model.score(X_train, y_train) * 100, 2)
print("Prediction accuracy (Training Set):", train_acc_log, "%\n")

# Check and print prediction accuracy and model parameters
test_acc_log = round(model.score(X_test, y_test) * 100, 2)
print("Prediction accuracy (Test Set):", test_acc_log, "%\n")


Prediction accuracy (Training Set): 83.29 %

Prediction accuracy (Test Set): 79.89 %



## Your Playground

### EASY: Play around with different models

Here is a place for you to play around and find a machine learning model that performs well. Funnily enough, as long as you stick with sklearn this mostly means altering one specific line from the logistic regression to specify a different combination of model and hyper parameters.

Here is a link of all supervised machine learning models sklearn has to offer: https://scikit-learn.org/stable/supervised_learning.html

Keep in mind that this is a classification problem. Regression problems will need a bit of extra work to apply.

Also, remember that you have to import the models you want to use first.

Have fun :-)

In [17]:
# Define the model and its hyper parameters
model = svm.SVC()

# train the model 
model.fit(X_train, y_train)

# print trainin set accuracy
train_acc_log = round(model.score(X_train, y_train) * 100, 2)
print("Prediction accuracy (Training Set):", train_acc_log, "%\n")

# Check and print prediction accuracy 
test_acc_log = round(model.score(X_test, y_test) * 100, 2)
print("Prediction accuracy (Test Set):", test_acc_log, "%\n")

Prediction accuracy (Training Set): 83.99 %

Prediction accuracy (Test Set): 80.45 %



### MEDIUM: Automatic Feature Selection

To prevent overfitting, only the most important features should be used within a model. When selecting an optimal set of features, we seek to find the sweet spot between high variance and high bias. In other words, we don't want to throw away relevant information, but neither do we want to include spurious relations.

In Recursive Feature Elimination with Cross Validation (RFECV), features are ranked according to their importance for making predictions. Then, two steps are repeated various times: 1. Models are trained and cross-validated (e.g., using accuracy scores as performance metric), 2. The least important feature is eliminated. In the end, the feature set leading to the best average out-of-sample prediction performance is automatically selected.

If you're up for a little challenge, look at the documentation for [Scikit's own RFECV module](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html) and try to implement it for one of your models.

In [None]:
from sklearn.feature_selection import RFECV

model =  # TODO: Add your model here

# Create RFECV instance and let it do the work (TODO: fill out the gaps accordingly)
selector = RFECV(estimator=..., scoring="accuracy")  
selector.fit(...)  

# The rest of this cell was written for your convenience
# No need to modify it

# Print the selected features
support = selector.support_ # List of Booleans: Is feature selected?
optimal_features = np.array(feature_cols)[support] # Unordered list of selected features
print("Selected features:", optimal_features)

# Check and print prediction accuracy after automatic feature selection
test_acc = model.fit(X_train[optimal_features], y_train) \
                .score(X_test[optimal_features], y_test)
print("Out-of-sample prediction accuracy:", round(test_acc * 100, 2), "%")

### HARD: Hyperparameter Tuning with Grid Search

Hyperparameters (HPs) are meta settings of an ML algorithm that are not learned, but need to be specified before the learning process even starts. Examples of HPs for the simple logistic regression model that you dealt with above are the regularization parameter C and the maximum numbers of iterations during the optimization process, max_iter. You can find many more HPs [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Most algorithms offer a myriad of HPs, and for optimal model performance, we are interested in finding the best combination of HPs. The most popular methods for automatic HP optimization (or 'tuning') are exhaustive grid search and random search. Those are very straight-forward and model-free methods. 

In [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), we just define for each HP which values we're interested in and then check all the possible combinations of HP values! This can take some time, but we'll be sure to find the best solution out of those tested. 

Now it's your turn again: Take a simple model (like logistic regression), pick at least two hyperparameters (like C and max_iter), define the HP search spaces, and implement grid search. 

In [None]:
from sklearn.model_selection import GridSearchCV

# TODO: Define search space for at least two hyperparameters
search_space = {
    "C": [0.1, 1, 10, 100],
    ...
}

# Run automatic HP tuning (TODO: Fill the gaps accordingly)
classifier = GridSearchCV(estimator=..., param_grid=...)
classifier.fit(...)

# The rest of this cell was written for your convenience
# No need to modify it

# Print the selected HP values
print(classifier.best_params_)

# Check and print prediction accuracy after automatic feature selection
test_acc = classifier.best_estimator_.score(X_test, y_test)
print("Out-of-sample prediction accuracy:", round(test_acc * 100, 2), "%")

### VERY HARD: Hyperparameter Tuning with Random Search

Apart from Grid Search, another algorithm to perform automatic hyperparameter tuning is called Random Search.

In random search, we repeatedly pick a random value for each HP and then test the resulting combination. We do this n times and go with the best performer. 

For this challenge, implement [Random Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) without further assistance! 

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# TODO: Implement Random Search for a model of your choice.
# Tune at least three hyperparameters and determine the out-of-sample prediction accuracy. 

## How did you do??

What was the best model you found? Post your high-score in the leader board on Miro and be prepared to explain briefly how you arrived there :-) Good luck!

### BONUS: Evaluation Metrics

How well are our models doing? That's a super important question which is not always easy to answer! Throughout this notebook, we have (somewhat arbitrarily) picked a very simple evaluation metric, the accuracy (i.e., what percentage of our predictions were correct?).

If you're curious and have some more time, read [this article](https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226) and find out about alternatives. 