### The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

### Import basics libraries

In [2]:
import pandas as pd
import numpy as np

### Data Loading

In [105]:
train = pd.read_csv('data/Titanic - Machine Learning from Disaster/train.csv')
test = pd.read_csv('data/Titanic - Machine Learning from Disaster/test.csv')

### Feature Engineering

In [106]:
#Check columns with NaN
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [107]:
#Transform Embarked in a numerical feature
def embarked_num(i):
    if i == 'S':
        return 1

    elif i == 'C':
        return 2

    elif i == 'Q':
        return 3

    #elif np.isnan(i):
        #return -1

In [108]:
#train['Age'] = train['Age'].fillna(-1)
train['Embarked'] = train['Embarked'].map(embarked_num)
train = pd.get_dummies(train, columns=['Sex'])

In [109]:
#Selection numerical features only
variables = ['Pclass','Age','SibSp','Parch','Fare','Embarked','Sex_female','Sex_male','Survived']

In [110]:
train = train[variables]

In [111]:
#Fill all NaN values
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp = IterativeImputer(max_iter=10, random_state=0)
train_clean = pd.DataFrame(imp.fit_transform(train),columns=variables)

### Train and Test Split Dataset

In [114]:
#Building dataframe
X = train_clean[['Pclass','Age','SibSp','Parch','Fare','Embarked','Sex_female','Sex_male']]
y = train_clean['Survived']

### Model

In [115]:
#Import Libraries to modeling
from sklearn.model_selection import train_test_split
from sklearn import metrics
from xgboost import XGBClassifier

In [116]:
#Train and Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3)

model = XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')

# fit model
model.fit(X_train, y_train)

# make predictions
y_pred = model.predict(X_test)

#Metrics
metrics.accuracy_score(y_test, y_pred)

0.7937219730941704