### The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

### Import basics libraries

In [468]:
import pandas as pd
import numpy as np

### Data Loading

In [469]:
train = pd.read_csv('data/Titanic - Machine Learning from Disaster/train.csv')
test = pd.read_csv('data/Titanic - Machine Learning from Disaster/test.csv')

### Feature Engineering

In [470]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [471]:
#Check columns with NaN
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [472]:
#Transform Embarked in a numerical feature
def embarked_num(i):
    if i == 'S':
        return 1

    elif i == 'C':
        return 2

    elif i == 'Q':
        return 3

    #elif np.isnan(i):
        #return -1

In [473]:
#train['Age'] = train['Age'].fillna(-1)
#train['Embarked'] = train['Embarked'].map(embarked_num)

train = pd.get_dummies(train, columns=['Sex','Pclass','Embarked'])
test = pd.get_dummies(test, columns=['Sex','Pclass','Embarked'])

In [474]:
train.head(1)

Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S
0,1,0,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,0,1,0,0,1,0,0,1


In [475]:
#Selection numerical features only
variables_train = ['Age','SibSp','Parch','Fare','Sex_female','Sex_male',
'Pclass_1','Pclass_2','Pclass_3','Embarked_C','Embarked_Q','Embarked_S','Survived']

variables_test = ['Age','SibSp','Parch','Fare','Sex_female','Sex_male',
'Pclass_1','Pclass_2','Pclass_3','Embarked_C','Embarked_Q','Embarked_S']

In [476]:
train = train[variables_train]

test = test[variables_test]

In [477]:
#Fill all NaN values
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp = IterativeImputer(max_iter=10, random_state=0, min_value=1)
train_clean = pd.DataFrame(imp.fit_transform(train),columns=variables_train)
test_clean = pd.DataFrame(imp.fit_transform(test),columns=variables_test)

In [494]:
train_clean[train_clean['Age']<1]

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,Survived
78,0.83,0.0,2.0,29.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
305,0.92,1.0,2.0,151.55,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
469,0.75,2.0,1.0,19.2583,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
644,0.75,2.0,1.0,19.2583,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
755,0.67,1.0,1.0,14.5,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
803,0.42,0.0,1.0,8.5167,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
831,0.83,1.0,1.0,18.75,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0


### Train and Test Split Dataset

In [478]:
#Building dataframe
X = train_clean[['Age','SibSp','Parch','Fare','Sex_female','Sex_male',
'Pclass_1','Pclass_2','Pclass_3','Embarked_C','Embarked_Q','Embarked_S']]
y = train_clean['Survived']

### Model

In [479]:
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold, ShuffleSplit

#model = XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')
model = XGBClassifier()

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

data = {'Model Name': [model.__class__.__name__], 'Accuracy': scores.mean()}  
results = pd.DataFrame(data)

results


Unnamed: 0,Model Name,Accuracy
0,XGBClassifier,0.820382


### Pipeline

In [480]:
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.pipeline import make_pipeline

selection = SelectPercentile(chi2, percentile=50)

pipe = make_pipeline(selection, model)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy')

pipe_model = pipe.fit(X, y)

data = {'Model Name': [pipe_model.__class__.__name__], 'Accuracy': scores.mean()}  
results = pd.DataFrame(data)

results

Unnamed: 0,Model Name,Accuracy
0,Pipeline,0.813653


### Tunning Hyperparamenters

In [487]:
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV, RandomizedSearchCV

# Define the search space
param_grid = { 
    # Learning rate shrinks the weights to make the boosting process more conservative
    "learning_rate": [0.0001,0.001, 0.01, 0.1, 1],

    # Maximum depth of the tree, increasing it increases the model complexity.
    "max_depth": range(3,21,3),

    # Number of Estimators

    #'n_estimators': [100,200,300,400,500],

    # Gamma specifies the minimum loss reduction required to make a split.
    "gamma": [i/10.0 for i in range(0,5)],

    # Percentage of columns to be randomly samples for each tree.
    "colsample_bytree": [i/10.0 for i in range(3,10)],

    # reg_alpha provides l1 regularization to the weight, higher values result in more conservative models
    "reg_alpha": [1e-5, 1e-2, 0.1, 1, 10, 100],

    # reg_lambda provides l2 regularization to the weight, higher values result in more conservative models
    "reg_lambda": [1e-5, 1e-2, 0.1, 1, 10, 100]
    }

# Set up score
scoring = ['accuracy']

# Set up the k-fold cross-validation
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

In [489]:
# Define random search
random_search = RandomizedSearchCV(estimator=pipe, 
                           param_distributions=param_grid,
                           n_iter=48,
                           scoring=scoring, 
                           refit='accuracy', 
                           n_jobs=-1, 
                           cv=kfold, 
                           verbose=0)
                
#grid_search = GridSearchCV(model,
#                param_grid,
#                scoring='accuracy',
#                cv = kfold)

# Fit grid search
random_result = random_search.fit(X, y)
# Print grid search summary
random_result
# Print the best score and the corresponding hyperparameters
print(f'The best score is {random_result.best_score_:.4f}')
#print('The best score standard deviation is', round(random_result.cv_results_['std_test_recall'][random_result.best_index_], 4))
print(f'The best hyperparameters are {random_result.best_params_}')

The best score is 0.8451
The best hyperparameters are {'max_depth': 6, 'learning_rate': 0.1, 'gamma': 0.4, 'colsample_bytree': 0.9}


### Output

In [492]:
y_pred = random_result.predict(test_clean)

In [493]:
df = pd.DataFrame(pd.read_csv('data/Titanic - Machine Learning from Disaster/test.csv')['PassengerId'])
df['Survived'] = y_pred
df.to_csv("./submission.csv", index = False)