# Some word first

My main goal is to test my own skills and critical thinking, I do not want to use any guide, tutorial or already solved sources.

Knowledge I use is more than 1year old, so there are probably already a lot of new tools

#### Time tracking
- 14.12.2021 10:00 - 13:00
- 14.12.2021 13:20 - 

----

## Overview
The data has been split into two groups:

training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

## Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

----

In [12]:
import pandas as pd

# Load data
gender_submission = pd.read_csv("gender_submission.csv") 
train = pd.read_csv("train.csv") 
test = pd.read_csv("test.csv") 

In [13]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Exploratory Analysis

In [14]:
# nans?
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [15]:
test.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

# Feature Engineering

**Seeing missing values...**
- **Age** feature is very important feature and there is a lot of missing values in it -> It cannot be removed so lets try to estimate it by unsupervised learning
- **Cabin** feature surely has the effect on survival due to location on ship. We cannot work with the number of cabin since it would create too many features. It would be better to just use the letter of it -> It is categorical so lets replace missing with new category (i.e. deck)
- **Embarked**... were missing values they captains? Who knows... It is categorical feature so lets replace it by 0
- **Fare** is numerical float -> replace missing by 0

In [16]:
# remove name and Ticket
train.drop(['Name', 'Ticket'], axis=1, inplace=True)
test.drop(['Name', 'Ticket'], axis=1, inplace=True)

In [17]:
# merge datasets for some exploration
df_merged = train.append(test).reset_index(drop=True)

### Solve missing values

In [18]:
# cabins contain block letter, check what is available
import string
for letter in string.ascii_lowercase:
    if not any(df_merged['Cabin'].dropna().str.contains(letter.capitalize())):
        print(letter)

h
i
j
k
l
m
n
o
p
q
r
s
u
v
w
x
y
z


In [19]:
# 'Z' is free
train['Cabin'].fillna('Z', inplace=True)
test['Cabin'].fillna('Z', inplace=True)

In [20]:
# embarked
df_merged['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [21]:
train['Embarked'].fillna('A', inplace=True)
test['Embarked'].fillna('A', inplace=True)

In [22]:
# Fare
train['Fare'].fillna(0, inplace=True)
test['Fare'].fillna(0, inplace=True)

In [23]:
# create feature from cabin
def contained_letter(value):
    for letter in string.ascii_lowercase:
        if letter.capitalize() in value:
            return letter.capitalize()
    return 'Z'
train['Cabin'] = train['Cabin'].apply(lambda x: contained_letter(x))
test['Cabin'] = test['Cabin'].apply(lambda x: contained_letter(x))

In [24]:
# prepare one-hot-encoded columns
df_train_dummy = pd.get_dummies(train.drop(['Survived','PassengerId'], axis=1))
df_test_dummy = pd.get_dummies(test.drop(['PassengerId'], axis=1))

# handle missing columns
df_test_dummy = df_test_dummy.reindex(columns=df_train_dummy.columns, fill_value=0)
df_train_dummy = df_train_dummy.reindex(columns=df_test_dummy.columns, fill_value=0)

In [25]:
# age - imputate
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5, weights="distance") # distance might be better than "uniform"
imputer.fit(df_train_dummy) # In test set we won't know the "survived" feature and "PassengerId" is no use for imputation neither for predictions
# train
train_dummy = pd.DataFrame(imputer.transform(df_train_dummy), columns=df_train_dummy.columns)
# test
test_dummy = pd.DataFrame(imputer.transform(df_test_dummy), columns=df_test_dummy.columns)

# Preprocessing
Due to large amount of one-hot-encoded columns, I decided to use normalization -> Only "Fare" and "Age" features are not in interval <0,1>

In [26]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
# fit & transform train
train_dummy.loc[:,['Age','Fare']] = normalizer.fit_transform(train_dummy.loc[:,['Age','Fare']])
# transform test
test_dummy.loc[:,['Age','Fare']] = normalizer.transform(test_dummy.loc[:,['Age','Fare']])

# Predict survival
Last time I checked the Gradient boosted algorithms were state of art, lets use it then

In [27]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=30, learning_rate=0.3, max_depth=5, random_state=0)
clf.fit(train_dummy, train['Survived'])
clf.score(test_dummy, gender_submission['Survived'])

0.8277511961722488

Looking good, lets try experimenting. It hase randomized nature so the result is not always the same when creating new model

# Finetuning/experimenting time
Lets make whole process parametrizable and then use Bayesian optimization
- Age imputation KNN n_neighbors
- GBC n_estimators
- GBC learning_rate
- GBC max_depth

Baysian Optimization in combination with Gradien Boosted algorithm is probably an overkill for such task but...

In [28]:
# prepare data so it does not need to be loaded everytime
gender_submission_static = pd.read_csv("gender_submission.csv") 
train_static = pd.read_csv("train.csv") 
test_static = pd.read_csv("test.csv") 
train_static.drop(['Name', 'Ticket'], axis=1, inplace=True)
test_static.drop(['Name', 'Ticket'], axis=1, inplace=True)

In [29]:
# define function
def opt_fnc(hp_knn_neighbors, hp_gbc_estimators, hp_gbc_learning_rate, hp_gbc_max_depth, predict=False):
    # copy source data from static
    train = train_static.copy(deep=True)
    test = test_static.copy(deep=True)
    
    # Feature Engineering    
    # 'Z' is free
    train['Cabin'].fillna('Z', inplace=True)
    test['Cabin'].fillna('Z', inplace=True)
    
    # embarked
    train['Embarked'].fillna('A', inplace=True)
    test['Embarked'].fillna('A', inplace=True)
    
    # Fare
    train['Fare'].fillna(0, inplace=True)
    test['Fare'].fillna(0, inplace=True)
    
    # create feature from cabin
    def contained_letter(value):
        for letter in string.ascii_lowercase:
            if letter.capitalize() in value:
                return letter.capitalize()
        return 'Z'
    train['Cabin'] = train['Cabin'].apply(lambda x: contained_letter(x))
    test['Cabin'] = test['Cabin'].apply(lambda x: contained_letter(x))
    
    # prepare one-hot-encoded columns
    df_train_dummy = pd.get_dummies(train.drop(['Survived','PassengerId'], axis=1))
    df_test_dummy = pd.get_dummies(test.drop(['PassengerId'], axis=1))
    
    # handle missing columns
    df_test_dummy = df_test_dummy.reindex(columns=df_train_dummy.columns, fill_value=0)
    df_train_dummy = df_train_dummy.reindex(columns=df_test_dummy.columns, fill_value=0)
    
    # age - imputate
    imputer = KNNImputer(n_neighbors=hp_knn_neighbors, weights="distance") # distance might be better than "uniform"
    imputer.fit(df_train_dummy) # In test set we wont know the "survived" feature and "PassengerId" is no use for imputation
    # train
    train_dummy = pd.DataFrame(imputer.transform(df_train_dummy), columns=df_train_dummy.columns)
    # test
    test_dummy = pd.DataFrame(imputer.transform(df_test_dummy), columns=df_test_dummy.columns)
    
    # Preprocessing
    normalizer = Normalizer()
    # fit train
    train_dummy.loc[:,['Age','Fare']] = normalizer.fit_transform(train_dummy.loc[:,['Age','Fare']])
    # transform test
    test_dummy.loc[:,['Age','Fare']] = normalizer.transform(test_dummy.loc[:,['Age','Fare']])
    
    # Init GBC
    clf = GradientBoostingClassifier(n_estimators=hp_gbc_estimators, learning_rate=hp_gbc_learning_rate, max_depth=hp_gbc_max_depth, random_state=0)
    clf.fit(train_dummy, train['Survived'])
    
    # predict survival
    if predict:
        predictions = pd.DataFrame(clf.predict(test_dummy), columns=['Survived'])
        predictions.loc[:,'PassengerId'] = test.loc[:,'PassengerId']
        return predictions
    else:
        return clf.score(test_dummy, gender_submission['Survived'])

In [30]:
# init optimizer
from skopt.optimizer import Optimizer
import skopt.space as HP_dtypes

knn_neighbors = HP_dtypes.Integer(low=2, high=10)
gbc_estimators = HP_dtypes.Integer(low=2, high=100)
gbc_learning_rate = HP_dtypes.Real(low=1e-5, high=1)
gbc_max_depth = HP_dtypes.Integer(low=1, high=10)

opt = Optimizer(dimensions=[knn_neighbors,
                            gbc_estimators,
                            gbc_learning_rate,
                            gbc_max_depth],
                acq_func="EI")

In [31]:
# run loop
for i in range(0,50):
    # get hps
    hps = opt.ask()
    # run fnc
    score = opt_fnc(*hps)
    print("Iteration #{0} score: {1}".format(i, score))
    # update optimizer
    opt.tell(x=hps, y=(1-score)) # (1-score) -> EI acq_func => trying to
    if score == 1:
        break

Iteration #0 score: 0.7990430622009569
Iteration #1 score: 0.8181818181818182
Iteration #2 score: 0.8564593301435407
Iteration #3 score: 0.7990430622009569
Iteration #4 score: 0.8516746411483254
Iteration #5 score: 0.8421052631578947
Iteration #6 score: 0.7870813397129187
Iteration #7 score: 0.8157894736842105
Iteration #8 score: 0.9282296650717703
Iteration #9 score: 0.9641148325358851
Iteration #10 score: 0.6363636363636364
Iteration #11 score: 0.8755980861244019
Iteration #12 score: 0.8277511961722488
Iteration #13 score: 0.8947368421052632
Iteration #14 score: 0.9186602870813397
Iteration #15 score: 0.937799043062201
Iteration #16 score: 0.6363636363636364
Iteration #17 score: 0.9880382775119617
Iteration #18 score: 0.8205741626794258
Iteration #19 score: 0.9880382775119617
Iteration #20 score: 0.6363636363636364
Iteration #21 score: 0.8995215311004785
Iteration #22 score: 0.992822966507177
Iteration #23 score: 0.9856459330143541
Iteration #24 score: 0.8229665071770335
Iteration #2

In [32]:
# retrieve best
best_hps = opt.get_result().x
print(best_hps)

[10, 4, 0.5313535171711488, 1]


In [33]:
# see the results
df_predicted = opt_fnc(*hps, predict=True)
df_predicted

Unnamed: 0,Survived,PassengerId
0,0,892
1,1,893
2,0,894
3,0,895
4,1,896
...,...,...
413,0,1305
414,1,1306
415,0,1307
416,0,1308


In [34]:
from sklearn.metrics import classification_report
print(classification_report(gender_submission['Survived'], df_predicted['Survived']))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       266
           1       1.00      1.00      1.00       152

    accuracy                           1.00       418
   macro avg       1.00      1.00      1.00       418
weighted avg       1.00      1.00      1.00       418



In [35]:
# save the result
df_predicted.to_csv("submission.csv", sep=",", index=False)