In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import read_csv, DataFrame
from sklearn.preprocessing import StandardScaler
from src.preprocess import Preprocess
from src.models.mlp import MLP

## Problem 2 - I2A2

This project's goal is to apply CRISP-SD methodology to a well known dataset, the Titanic Survival dataset. Accordingly to CRISP-DS, the solution is divided in 5 steps, as no deployment is neeeded for this task. The steps are listed bellow as notebook sections.

1. [Business Understanding](problem2.ipynb#1.-Business-Understanding)
2. [Data Understanding](problem2.ipynb#2.-Data-Understanding)

    2.1 [Data Dictionary](problem2.ipynb#Data-Dictionary)
    
    2.2 [Exploratory Data Analysis](problem2.ipynb#Exploratory-Data-Analysis)
3. [Data Preparation](problem2.ipynb#3.-Data-Preparation)
4. [Modeling](problem2.ipynb#4.-Modeling)
5. [Evaluation](problem2.ipynb#5.-Evaluation)

[References](problem2.ipynb#References)

### 1. Business Understanding

Titanic was a luxury passenger liner that sank on April 14–15, 1912, during its maiden voyage, en route to New York City from Southampton, England, killing about 1,500 passengers and ship personnel. One of the most famous tragedies in modern history, it inspired numerous stories, several films, and a musical and has been the subject of much scholarship and scientific speculation.

This project aims to predict whether a passenger survived or not based on the available data.

### 2. Data Understanding

This step is where the data is read and we can take a first look on it's properties, aiming for a better understanding about the information provided.

A sample from the training dataset is shown bellow.

In [None]:
raw_train_df = read_csv('data/train.csv', index_col=0)
raw_test_df = read_csv('data/test.csv', index_col=0)
y_train = raw_train_df['Survived']

In [None]:
print(f'The shape from the datasets are: train: {raw_train_df.shape}, test: {raw_test_df.shape}')

The test dataset has one less column than the train because the 'survided' column is not provided as it has to be predicted.

### Data Dictionary

Variable|Definition|Key
--------|----------|---
survival|Survival|0 = No, 1 = Yes
pclass|Ticket class|1 = 1st, 2 = 2nd, 3 = 3rd
sex|Sex|
Age|Age in years|
sibsp|# of siblings / spouses aboard the Titanic|
parch|# of parents / children aboard the Titanic|
ticket|Ticket number|
fare|Passenger fare|	
cabin|Cabin number|
embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

#### Exploratory Data Analysis

From printing the types for each column we can see that some columns, like 'pclass' and 'survived', that should be interpreted as categorical are interpreted as numeric.

The methods `describe()` and `hist()` are also used to visualize how the features are deviation.

In [None]:
raw_train_df.dtypes

In [None]:
raw_train_df.describe(include='all')

In [None]:
hist = raw_train_df.hist(figsize=(12,8))
hist

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(raw_train_df.corr())

### 3. Data Preparation

From Business Understanding, it can be taken into account that some features have more probability to be significative to the prediction than others. E.g, `age` is more relevant for survival odds than `name`. So, as a first iteration, some features are going to be dropped.

The resulting features that have null rows are filled with the mean value.

In [None]:
preprocesser = Preprocess(raw_train_df, StandardScaler())
X_train = preprocesser.transform(raw_train_df)
X_test = preprocesser.transform(raw_test_df)

### 4. Modeling

A simple MLP model was chosen and the predictions were genetated.

In [None]:
mlp = MLP()
scores = mlp.train(X_train, y_train)
scores

y_pred = mlp.predict(X_test)

predictions: DataFrame = DataFrame()
predictions['PassengerId'] = raw_test_df.index
predictions['Survived'] = y_pred
predictions

### 5. Evaluation

Since no test results were provided, the predictions were tested using the Kaggle official website for this challenge. 

The simple model obtained a score of 0.74.

![Kaggle Results](kaggle_results.png)

### References

1. Encyclopædia Britannica, inc. (n.d.). Titanic. Encyclopædia Britannica. Retrieved December 25, 2021, from https://www.britannica.com/topic/Titanic.

2. Titanic - machine learning from disaster. Kaggle. (n.d.). Retrieved December 25, 2021, from https://www.kaggle.com/c/titanic/data 