Usually when we want to learn a new technology, we read about a Hello World version of it. I think Titanic can be the Hello World for Machine Learning <br> <br>
You can find plenty of good blogs and tutorial about Titanic and ML on internet. Even on kaggle you can find very good kernels describing the whole ML solution. <br>
I have to say that I really like the concept of kernels. You can learn/see a lot of new ideas and solutions from other experimented people. For me, in comparison with a blog post, the biggest plus for kernels is that I can see the whole code and if I want I can fork it and change/execute it.

Majority of kernels I read present the final solution, but a ML project has many cycles of development until the final version, if there is one. That's why I would like to show you a different way of writing a kernel. <br>

Titanic requires a classification model. I chose one of the simplest one, DecisionTreeClassifier from sklearn. <br>
The entire kernel is around three cycles of development. <br>
In the **first cycle** I will get used with the dataset, make some visualizations and see if decision tree model can obtain better than random results.  
In the **second cycle** I will try different hyperparamentes for the model with the hope to improve the score on the leaderboard.
Based on the knowledge from the previews two cycles, in the **third cycle**  I will add more features to train the model(in combination with different hyperparamenters), also with the goal of obtaining a better score. <br>

I like to implement each cycle of development in three phases (similar with lean startup concept):
    1. ideas - assumptions about what can improve the score
    2. implement - transform ideas into code 
    3. evaluate - evaluate the results of the model

Lean startup is about how to build a product. It's funny that I found its usability in creating this kernel. <br>
If you want to read about the basic principles of learn startup you can start from [here](http://http://theleanstartup.com/principles) or you can read the entire [book](http://http://theleanstartup.com/book). 

**Just for fun, I created a web application which shows your chance to survive on Titanic. <br>
You can play with it here : http://survivortitanic.com**






In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import math

from matplotlib import pyplot as plt
%matplotlib inline

# 1. Cycle one


## 1.1 Ideas
The main goals of this cycle is to get used with the dataset, get insights from data visualizations and see if we can have more than random predictions using the simplest decision tree model possible.

In [None]:
train_titanic = pd.read_csv("../input/train.csv")
test_titanic = pd.read_csv("../input/test.csv")

### Few stats about the datasets
#### training set

In [None]:
print("(# of rows, # of columns) " + str(train_titanic.shape))
train_titanic.describe(include="all")

Training set contains in total 891 exemples. <br>
From the above tables we can see that we deal with measing data, like age and cabin. We will deal with missing data bellow in the notebook.

#### test set

In [None]:
print("(# of rows, # of columns) " + str(test_titanic.shape))
test_titanic.describe(include="all")

Test set contains in total 418 examples. <br>
Here, we are also dealing with missing data in columns like age, fare and cabin.

### First visualizations
Let's see how balanced are the classes from our target variable : Survived

In [None]:
train_titanic.groupby(["Survived"])["Survived"].count().plot.bar()

From above histogram we can see that the number of people who survived are unfortunetely lower that those who died. <br>
0 = No, 1 = Yes
<br>
Main reasons for death were : (info from http://www.eszlinger.com/titanic/titanfacts.html): <br>
        * 2,208 lifeboat seats were needed and only 1,178 lifeboat seats were carried aboard.
        * One of the first lifeboats to leave the Titanic carried only 28 people; it could have held 64 people.
        * Very few people actually went down with the ship. Most died and drifted away in their life-jackets.
        
        

### Feature correlations

In [None]:
import seaborn as sns
train_corr = train_titanic.corr(method="spearman")
plt.figure(figsize=(10,7))
sns.heatmap(train_corr, annot=True)

As we can see from the above heatmap, there is no strong correlation between feature variables.
It is a good news, it means that all features will have an individual importance to predict the target variable. 

It is also good because we have a training dataset of small size. When you have a small training set with many correlated features, it means that in the end you have fewer features to reflat the reality and also it is prone to overfiting.

## 1.2 Implement

### Give it a try with Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

Let's start the easiest way possible. <br>
Start with the most relevant set of features which also don't contain missing values. <br>
For me, the most relevant ones would be Pclass, Sex and Fare


In [None]:
print(train_titanic.shape) 
train_titanic[["Pclass", "Sex", "Fare"]].describe(include="all")

#### Feature visualization

#### Sex feature

In [None]:
train_titanic.groupby("Sex")["Sex"].count().plot.bar(x="Sex", title="Sex feature distribution")

# for the next charts I want to represent the features related to the target variable, survived. 
# Because of this I will create two new columns, alive, not_alive
train_titanic["Alive"] = train_titanic["Survived"].apply(lambda s : s )
train_titanic["Not_alive"] = train_titanic["Survived"].apply(lambda s : abs(1 - s))
train_titanic.groupby(["Sex"])["Alive", "Not_alive"].sum().plot.bar(title="Sex feature distribution related to target variable")

From the first chart we can see that on titanic was almost double males than females. From the second chart we can see that the chance to survive for a men is more lower than for a female. <br>
Based on these charts, the sentance [Women and children first](https://en.wikipedia.org/wiki/Women_and_children_first) seems to be true.

#### Pclass feature

Based on kaggle description, Pclass represent the ticket class or in other works the socio-economic status

In [None]:
train_titanic.groupby(["Pclass"])["Pclass"].count().plot.bar(title="Distribution of Pclass feature")
train_titanic.groupby(["Pclass"])["Alive", "Not_alive"].sum().plot.bar(title = "Pclass feature distribution related to target variable")

It seems that on titanic was way more 'poor' people than 'rich' ones. <br>
The second chart shows that rich people (Pclass=1) had a bigger chance to survive. Somehow it reflec the reality, because in general rich people are more influential.

### Fare features

Fare feature represent the ticket price each passanger paid. <br>
Because fare is a continous feature, I will create fare categories to better visualize it.

In [None]:
bins = list(range(0, 110, 10))
bins.append(600)
train_titanic["Fare_category"] = pd.cut(train_titanic.Fare, bins=bins).apply(lambda x : x.right)
train_titanic.groupby(["Fare_category"])["Alive", "Not_alive"].sum().plot.bar(figsize=(15,5))

From the above chart, it seems if you would buy an expensive ticket you would have more change to survive. Somehow this visualization reflect the results from Pclass visualizations.

In [None]:
train_titanic[["Fare"]].plot.box(vert=False, figsize=(15,5))
print ("Mean, median fare " + str(train_titanic["Fare"].mean()) + ", " + str(train_titanic["Fare"].median()))

Based on above boxplot, we have some outliers in our training set. This can be due to data errors or maybe some passangers paid a lot more than the majority.  <br>
A good practice is to remove the outliers from the training set, but due to the very small size of titanic training set, it is not such an easy decision. 

### Utility methods

In [None]:
import graphviz 
from sklearn.tree import export_graphviz

def plot_decision_tree(decision_tree, features_names) :
    dot_data = export_graphviz(decision_tree=decision_tree, out_file=None, feature_names=features_names)
    return graphviz.Source(dot_data)

def display_feature_importance(model, cols):
    featureImportance = pd.Series(model.feature_importances_, index=cols).sort_values(ascending=True)
    featureImportance.plot(kind="barh")

def resume_wrong_predictions(train_titanic, predictions, groupby_col = ["Fare_category", "Sex", "Pclass"]) :
    wrong_predictions = train_titanic[train_titanic["Survived"] != predictions] \
        .groupby(groupby_col)['Survived'] \
        .agg(['count'])   
    training_predictions = train_titanic. \
        groupby(groupby_col)["Survived"]. \
        agg(["count"])  
    results = pd. \
        merge(wrong_predictions, training_predictions, left_index=True, right_index=True). \
        rename(columns={"count_x" : "wrong_label_count", "count_y" : "training_label_count"} )
    results["wrong_prediction_percetage"] = (100 * results["wrong_label_count"]) / results["training_label_count"]
    return results

def save_submition_file(filename, passengerId, predictions) :
    kaggle_test_submition = pd.DataFrame({"PassengerId":passengerId, "Survived":predictions})
    kaggle_test_submition.to_csv(filename, index=False)
    
def feature_imputer_titanic(column, imputer_column, missing_values="NaN", strategy="median"):
    imputer = Imputer(missing_values=missing_values, strategy=strategy).fit(train_titanic[[column]])
    train_titanic[imputer_column] = imputer.transform(train_titanic[[column]])
    test_titanic[imputer_column] = imputer.transform(test_titanic[[column]])

### Implement first version of Decision Tree

Sex feature is categorical and we need to tranform it into numerical for DecisionTree model.

In [None]:
sexLabelEncoder = LabelEncoder()
sexLabelEncoder.fit(train_titanic["Sex"])
train_titanic["Sex_encoded"] = sexLabelEncoder.transform(train_titanic["Sex"])
test_titanic["Sex_encoded"] = sexLabelEncoder.transform(test_titanic["Sex"])

The kaggle test set contains in total 418 examples and "Fare" features has one missing value.  <br>
All sklearn models are wainting the features to be all numeric and contain no missing values. Because of the fare missing value, the decision tree predict method will fail. <br>
0ne way to handle missing values in sklearn is using Imputer which imput the missing values using either mean, median or most frequent value of the column.


In [None]:
from sklearn.preprocessing import Imputer

fare_imputer = Imputer(missing_values='NaN', strategy="median").fit(test_titanic[["Fare"]])
test_titanic["Fare_median"] = fare_imputer.transform(test_titanic[["Fare"]])
train_titanic["Fare_median"] = fare_imputer.transform(train_titanic[["Fare"]])

Because the training set is so small I would like to use it all to train the model. I will check the performance of the model directly on kaggle test set (leaderboard). Lucky us that we have 10 submits/day.

In [None]:
dt_col_cycle1_0 = ["Pclass", "Sex_encoded", "Fare_median"]
dt_model = DecisionTreeClassifier(random_state=1987)
dt_model.fit(train_titanic[dt_col_cycle1_0], train_titanic["Survived"])

## 1.3 Evaluate

 Our first submit to kaggle

In [None]:
dt_col_cycle1_0

In [None]:
test_titanic_predictions = dt_model.predict(test_titanic[dt_col_cycle1_0])
save_submition_file("dt_cycle_1_submission.csv", test_titanic["PassengerId"], test_titanic_predictions )

yeeey, we obtained 0.77033 accuracy score on kaggle public leaderboard which is a promising step !

Bellow you can see the features importance of the model.


In [None]:
display_feature_importance(dt_model, dt_col_cycle1_0)

Let's see where we get wrong predictions using the training set predictions. <br>

In [None]:
resume_wrong_predictions(train_titanic, dt_model.predict(train_titanic[dt_col_cycle1_0]), groupby_col=["Survived"])

In [None]:
# I want to see the wrong predictions based on the features the model was trained and also taking 
# in considerence the feature importance
resume_wrong_predictions(train_titanic, dt_model.predict(train_titanic[dt_col_cycle1_0]), groupby_col=["Fare_category", "Sex", "Pclass"])

I cannot see a clear pattern where our model makes wrong predictions based on the above table. This can be true because surviving on Titanic was also a lucky situation.  <br>

If you can see a pattern, I will be very  happy if you will let a comment.

All the predictions were made based on a single decision tree. Let's see how it looks !

In [None]:
plot_decision_tree(dt_model, dt_col_cycle1_0)

That was it for the first cycle. <br>
It's a pretty good score using the default decision tree hyperparameters and a small subset of features. <br>
Let's see if we can improve the accuracy tunning some of the hyperparameters!

# 2. Cycle two


## 2.1 Ideas, assumptions
I suppose the score can be improved using the same set of features and only tuning decision tree hyperparameters. <br>
In the first cycle we initialized decision tree with its default parameters. This can cause over-complex trees which don't generalize well. <br>
Sklearn implementation of decision tree offers us multiple options of pruning to avoid such problems :
    * setting max_depth
    * setting min_samples_leaf
You can find docs about above parameters on sklearn decision tree classifier page : http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
    


## 2.2 Implement

Let's play with different values for max_depth hyperparametres <br>

Sklean docs for max_depth <br>
*max_depth : int or None, optional (default=None) <br>
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.*

In [None]:
max_depth = 6
dt_depth_model = DecisionTreeClassifier(criterion="entropy", max_depth=max_depth, random_state=1987)
dt_depth_model.fit(train_titanic[dt_col_cycle1_0], train_titanic["Survived"])

In [None]:
test_titanic_predictions = dt_depth_model.predict(test_titanic[dt_col_cycle1_0])
save_submition_file("dt_depth_6_submission.csv", test_titanic["PassengerId"], test_titanic_predictions)

The best kaggle test set accuracy was obtained by max_depth 6 ! <br>
Bellow you can see the accuracy of the model trained with different max_depth values :  <br>
max_depth 4 -> 0.78468 <br>
max_depth 5 -> 0.78468 <br>
max_depth 6 -> 0.79425 <br>
max_depth 7 -> 0.77033 <br>

Try multiple values for min_samples_leaf <br>

Sklean docs for min_samples_leaf : <br>
*min_samples_leaf : int, float, optional (default=1) <br>
The minimum number of samples required to be at a leaf node: <br>
    If int, then consider min_samples_leaf as the minimum number. <br>
    If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.*




In [None]:
min_samples_leaf = 10
dt_leaf_model = DecisionTreeClassifier(criterion="entropy", min_samples_leaf=min_samples_leaf, random_state=1987)
dt_leaf_model.fit(train_titanic[dt_col_cycle1_0], train_titanic["Survived"])

In [None]:
test_titanic_leaf_predictions = dt_leaf_model.predict(test_titanic[dt_col_cycle1_0])
save_submition_file("dt_min_samples_leaf_10_submission.csv", test_titanic["PassengerId"], test_titanic_leaf_predictions)

The best accuracy was obtained using min_sample_leaf = 10 <br>
Interesting that the accuracy for min_sample_leaf = 10 is the same for max_depth = 6 <br>
min_sample_leaf 5  -> 0.76076 <br>
min_sample_leaf 10 -> 0.79425 <br>
min_sample_leaf 15 -> 0.76555 <br>
min_smaple_leaf 20 -> 0.77511 <br>

Try the class_weight hyperparameter with value balanced.

*The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))*

In [None]:
dt_weigth_model = DecisionTreeClassifier(random_state=1987, min_samples_leaf=10, class_weight="balanced")
dt_weigth_model.fit(train_titanic[dt_col_cycle1_0], train_titanic["Survived"])

In [None]:
test_titanic_weight_predictions = dt_weigth_model.predict(test_titanic[dt_col_cycle1_0])
save_submition_file("dt_weight_balanced_submission.csv", test_titanic["PassengerId"], test_titanic_weight_predictions)

The accuracy for class_weight=balanced and min_samples_leaf = 10 was 0.75598 which is not an improvement.


## 2.3 Evaluate

In the first cycle we obtained accuracy = 0.77033. <br>
Just tuning the hyperparameters we improve the accuracy to 0.79425 using min_sample_leaf = 10 

Let's visualize which are the most influential features for our best decision tree model using min_sample_leaf = 10. <br>
As you can see there are some differences compared with the earlier feature importance plot. 

In [None]:
display_feature_importance(dt_leaf_model, dt_col_cycle1_0)

In [None]:
training_predictions = dt_leaf_model.predict(train_titanic[dt_col_cycle1_0])
print("train accuracy : " + str(accuracy_score(train_titanic["Survived"], training_predictions)))

In [None]:
# I want to see which class has more wrong predictions
resume_wrong_predictions(train_titanic, training_predictions, groupby_col=["Survived"])

Something interesting is happening. <br>
In the first cycle we had a total of 85 wrong **training predictions** compared with 146 wrong predictions from the cycle 2, even if the accuracy on kaggle test set is bigger for cycle 2. Do you have any guesses why ? :)





In [None]:
# I want to see the wrong predictions based on the features the model was trained and also taking 
# in considerence the feature importance
resume_wrong_predictions(train_titanic, training_predictions, groupby_col=["Sex", "Fare_category", "Pclass"])

In [None]:
plot_decision_tree(dt_leaf_model, dt_col_cycle1_0)


That was it for the second cycle. <br>
It seems that we had a good assumption by trying different hyperparameters values to obtained a better model

# Cycle 3

## 3.1 Ideas, assumptions
Until now we used a limited set of features : Pclass, Fare and Sex. <br> 
Let's see if we can improve the accuracy if we add more features to the training process.

## 3.2 Implement

In [None]:
# Stats about all possible features from training set
train_titanic.describe(include="all")

Let's investigate Age feature

In [None]:
train_titanic.groupby(["Age"])["Alive", "Not_alive"]. \
    sum(). \
    plot.bar(title="Age vs survived histogram", figsize=(20,5))
    
train_titanic[train_titanic["Sex"] == "female"]. \
    groupby(["Age"])["Alive", "Not_alive"]. \
    sum(). \
    plot.bar(title = "Female age vs survived histogram", figsize=(20,5))
    
train_titanic[train_titanic["Sex"] == "male"]. \
    groupby(["Age"])["Alive", "Not_alive"]. \
    sum(). \
    plot.bar(title = "Male age vs survived histogram", figsize=(20,5))

What is clean from the above plots is that female had a bigger chance to survived than male, but we already know that. <br>
There is not a clear pattern between age and survive. Let's hope that the ML model will find better correlation with other set of features. <br> <br>
Because Age column has missing values let's fill them with median, mean or most frequent age value.


In [None]:
feature_imputer_titanic("Age", "Age_median", strategy="median")
feature_imputer_titanic("Age", "Age_mean", strategy="mean")
feature_imputer_titanic("Age", "Age_most_frequent", strategy="most_frequent")

Let's see how our new ages vs survived histogram looks after we fill the missing values.

In [None]:
train_titanic.groupby(["Age_median"])["Alive", "Not_alive"]. \
    sum(). \
    plot.bar(title="Age_median vs survived histogram", figsize=(20,5))
    
train_titanic.groupby(["Age_mean"])["Alive", "Not_alive"]. \
    sum(). \
    plot.bar(title="Age_mean vs survived histogram", figsize=(20,5))
    
train_titanic.groupby(["Age_most_frequent"])["Alive", "Not_alive"]. \
    sum(). \
    plot.bar(title="Age_most_frequent vs survived histogram", figsize=(20,5))

A simplest and recommended way is to fill the missing values with the median. Let's see which strategy get the best score.

In [None]:
dt_col_cycle3_0 = ['Pclass', 'Sex_encoded', 'Fare_median', 'Age_mean']

In [None]:
dt_features_model = DecisionTreeClassifier(random_state=1987, min_samples_leaf=10)
dt_features_model.fit(train_titanic[dt_col_cycle3_0], train_titanic["Survived"])

In [None]:
save_submition_file("dt_age_mean_feature_submission.csv", \
                    test_titanic["PassengerId"], dt_features_model.predict(test_titanic[dt_col_cycle3_0]))

Bellow are the scores I have obtained using mean, median and most frequent imputer strategy to fill the age missing values. <br>
Surprising, the best score was using the mean <br>
Age mean = 0.76076 <br>
Age median = 0.74641 <br>
Age most frequent = 0.74641 <br>

In [None]:
display_feature_importance(dt_features_model, dt_col_cycle3_0)

Above we calculated the global mean age and used it to fill the missing values. Maybe a better idea is to calculate the mean age based on groups of people. Bellow I will calculate the mean age for people from the same Pclass group and use this value to fill the missing values. 

In [None]:
train_titanic["Age_mean_window"] = train_titanic[["Pclass", "Age"]] \
    .groupby(["Pclass"]) \
    .apply(lambda x : x.assign(Age_mean_window = lambda x : x.Age.mean())) \
    .reset_index(drop=True) \
    .apply(lambda row : row["Age_mean_window"] if math.isnan(row["Age"]) else row["Age"] , axis=1)
 
test_titanic["Age_mean_window"] = test_titanic[["Pclass", "Age"]] \
    .groupby(["Pclass"]) \
    .apply(lambda x : x.assign(Age_mean_window = lambda x : x.Age.mean())) \
    .reset_index(drop=True) \
    .apply(lambda row : row["Age_mean_window"] if math.isnan(row["Age"]) else row["Age"] , axis=1)
 

In [None]:
train_titanic.groupby(["Age_mean_window"])["Alive", "Not_alive"]. \
    sum(). \
    plot.bar(title="Age_mean_window vs survived histogram", figsize=(20,5))


In [None]:
dt_col_cycle3_1 = ['Pclass', 'Sex_encoded', 'Fare_median', 'Age_mean_window']
dt_features_model = DecisionTreeClassifier(random_state=1987, max_depth = 5, min_samples_leaf=20)
dt_features_model.fit(train_titanic[dt_col_cycle3_1], train_titanic["Survived"])

In [None]:
save_submition_file("dt_age_mean_window_submission_0.csv", \
                    test_titanic["PassengerId"], dt_features_model.predict(test_titanic[dt_col_cycle3_1]))

After using mean age based on Pclass groups and trying many hyperparameters combinations, the best score obtained was 0.79904. 

In [None]:
display_feature_importance(dt_features_model, dt_col_cycle3_1)

In [None]:
resume_wrong_predictions(train_titanic, dt_features_model.predict(train_titanic[dt_col_cycle3_1]), groupby_col=["Survived"])

Maybe majority of us would think that the Name column is not relevant for a new feature. I was also one of these. <br>
Reading more kernels about titanic, I found out that in the Name column is specified the title of the person (Mr, Miss, Master, etc) and this information helped to improve the score.

In [None]:
train_titanic.Name[:5]

In [None]:
train_titanic["Title"] = train_titanic.apply(lambda row : (row.Name.split(",")[1].split(" ")[1]), axis=1)
test_titanic["Title"] = test_titanic.apply(lambda row : (row.Name.split(",")[1].split(" ")[1]), axis=1)

In [None]:
train_titanic.groupby(["Title"])[["Title"]].count()

In [None]:
# Encode the Title values
titleEncoding = LabelEncoder().fit(train_titanic["Title"].append(test_titanic["Title"]))
train_titanic["Title_encoded"] = titleEncoding.transform(train_titanic["Title"])
test_titanic["Title_encoded"] = titleEncoding.transform(test_titanic["Title"])

In [None]:
dt_col_cycle3_3 = ['Pclass', 'Sex_encoded', 'Fare_median', 'Age_mean_window', 'SibSp', 'Parch', "Title_encoded"]
dt_features_model = DecisionTreeClassifier(criterion="entropy", random_state=1987, max_depth = 5, min_samples_leaf=20)
# dt_features_model = DecisionTreeClassifier(criterion="entropy", min_impurity_decrease = 0.009, random_state=1987)
dt_features_model.fit(train_titanic[dt_col_cycle3_3], train_titanic["Survived"])

In [None]:
save_submition_file("dt_more_features_submission_0.csv", \
                    test_titanic["PassengerId"], dt_features_model.predict(test_titanic[dt_col_cycle3_3]))

The score is 0.81339. Wow !!! <br>
At the moment of writing this kernel, with this score we are on Top 6! <br>
My place on the leaderboard is 497/9547 even if the person from 333-th place has the same score ;)

## 3.3 Evaluate

In [None]:
# See where the model makes most wrong predictions on training set.
resume_wrong_predictions(train_titanic, dt_features_model.predict(train_titanic[dt_col_cycle3_3]), groupby_col=["Survived"])

In [None]:
resume_wrong_predictions(train_titanic, dt_features_model.predict(train_titanic[dt_col_cycle3_3]), groupby_col=["Sex", "Pclass", "Age_mean_window"])

In [None]:
display_feature_importance(dt_features_model, dt_col_cycle3_3)

In [None]:
# This is how the best decision tree looks.
plot_decision_tree(dt_features_model, dt_col_cycle3_3)

I hope you enjoyed reading this kernel ! <br>
It's awesome that using one of the simplest machine learning alg. we succedded to be in top 6 on Titanic leaderboard. <br>
If you have any others sugestions for other cycles, I would be happy if you would leave a comment !
