# Titanic: Machine Learning from Disaster

Jonathan Lices Martín

<hr>

This notebook is a recopilation of what I've learned during my training process in python. Some of the functions I'll use here may not be totally mine. So I have to thanks everyone in Kaggle for the help, the functions and models templates :)

<hr>

Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of modern history's deadliest peacetime commercial marine disasters [Wikipedia](https://en.wikipedia.org/wiki/Titanic). In this notebook I'll try to predict if the survival of this disaster depends on the characteristics we have in the dataset with some different models.

## FIRST PART - EDA & DATA PREPROCESS

The first thing we have to do is to load the principal libraries to make a great data exploratoroy analysis.

In [None]:
# Principal libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
sns.set()

In [None]:
# Loading the dataset

data = pd.read_csv("../input/titanic/train.csv")
data_raw = data.copy() #Just in case

data.head()

### Exploratory Data Analysis

So, let's analyse the data. Before we go into the detail, it's always great to have a general idea of what we're going to work with. We have to understand the problem and the data, so print the column names and try to have a brief description of each feature could help us. There are also categorical features so we'll have to deal with them later and, of course, we have non-relevant information. We can expect missing data too, so we could take a first look on it.

In [None]:
# Column names

print("The column names are:", data.columns)

In [None]:
# First look to the missing data

total = data.isnull().sum().sort_values(ascending = False)
porcentage = (data.isnull().sum()/data.isnull().count()).sort_values(ascending = False)

missing_data = pd.concat([total, porcentage], axis = 1, keys = ["Total", "Porcentage"])
missing_data

Now we have a brief summary of the data. There are some missing values in Embarked, Age and Cabin features, so we'll have to deal with them. There are many different approaches to deal with missing values, so we'll decide later which one is better for us. Anyway, we can go into more detail from here.

One of the first questions we have to answer is: how many people survived the disaster? Let's make a quick visualization.

In [None]:
# How many people survived plot

fig, ax = plt.subplots(1, 2, figsize = (15,5))
sns.countplot(data["Survived"], ax = ax[0])
ax[0].set_title("How many people survived?")
ax[0].set_ylabel("Count")
sns.countplot("Sex", hue = "Survived", data = data, ax = ax[1])
ax[1].set_title("Survived by Sex")
ax[1].set_ylabel("Count")

plt.show()

As we can see, not many people survived the disaster. Furthermore, there were more men in the ship than women, but more women survived the accident. This looks interesting for us, because we can build our model depending on this kind  of features. Let's continue exploring the data!

Looking the data head we made before, it's clear that the next feature we could go into detail is Pclass. 

In [None]:
# Pclass analysis

fig, ax = plt.subplots(1, 2, figsize = (15,5))
sns.countplot(data["Pclass"], ax = ax[0])
ax[0].set_title("Pclass Analysis")
ax[0].set_ylabel("Count")
sns.barplot(x = "Pclass", y = "Survived", data = data, ax = ax[1])
ax[1].set_title("Survived by Pclass")
ax[1].set_ylabel("Porcentage of total")

plt.show()

As we expected, it seems that people who went in FirstClass survived more than people who travelled in ThirdClass. Over the 50% of people took the ship in ThirdClass meanwhile about the 25% took it in FirstClass. But, let's go a bit deeper on it. 

In [None]:
# Crosstab 

pd.crosstab(data["Pclass"], data["Survived"], margins = True)

In [None]:
# Pivot Table

data.pivot_table("Survived", index = "Sex", columns = "Pclass")

Now we can see it better. Of 491 people who travelled in ThirdClass just 119 survived meanwhile of 216 who travelled in FirstClass just 80 died. Moreover, women who travelled in FirstClass have a survirval rate of 0.968%, that is to say, just one or two women died on the accident in FirstClass. We can visualize this.

In [None]:
# Survived by Sex and Pclass

fig, ax = plt.subplots(1, 2, figsize = (15,5))
sns.countplot("Pclass", hue = "Survived", data = data, ax = ax[0])
ax[0].set_title("Pclass Analysis")
ax[0].set_ylabel("Count")
sns.countplot("Sex", hue = "Pclass", data = data, ax = ax[1])
ax[1].set_title("Sex by Pclass")
ax[1].set_ylabel("Count")

plt.show()

In [None]:
# Crosstab 

pd.crosstab([data["Survived"], data["Sex"]], data["Pclass"], margins = True)

Now we can clearly see that is better to travel in FirstClass, as we expected ^^

Maybe the name of the people who went in the Titanic, with our objectives, is not relevant, we'll deal with this later. For now, the next feature we can take a look is the age. Being young is better in order to survive?

In [None]:
# Lets try to get some extra info about the age

data["Age"].describe()

So the oldest person in the ship had 80 years, and the yougest one... 0.42 years? Well this is actually not a problem for us.

In [None]:
# Violin and Box Plots

fig, ax = plt.subplots(1, 2, figsize = (15, 5))
sns.boxplot("Sex", "Age", hue = "Survived", data = data, ax = ax[0])
ax[0].set_title("Box Plot")
sns.violinplot("Sex", "Age", hue = "Survived", data = data, split = True, ax = ax[1])
ax[1].set_title("Violin Plot")

plt.show()

In [None]:
# Violin plot for Age, Pclass and Survived

fig = sns.violinplot("Pclass", "Age", hue = "Survived", split = True, data = data)
fig.set_title("Pclass and Age survirval")
plt.show()

So now we're pretty sure about people between 20 and 40 survived more. This is another important feature in order to build a model. As we should remember, we had 177 missing values in age. There are many ways to deal with this problem, but I've learned one really clever way to do it (Really thankful to the Kaggle user ash316, look his notebook in this approach: [EDA To Prediction(DieTanic)](kaggle.com/ash316/eda-to-prediction-dietanic#Part3:-Predictive-Modeling)). The thing is we could fill the blanks with the mean age, but this could result in some problems. The solution is in the name (this is why keep some features which at first look don't seem interesting is really important). We have the salutations, so we can come up with the idea of their age.

In [None]:
# Extract the salutations (THANKS TO ash316)

data["Initial"] = 0
for i in data:
    data["Initial"] = data["Name"].str.extract('([A-Za-z]+)\.')
    
data.head()

In [None]:
# Extract all the salutations

print(data["Initial"].unique())

In [None]:
# Now we can replace them

data["Initial"].replace(["Mlle", "Mme", "Ms", "Dr", "Major", "Lady", "Countess",
                        "Jonkheer", "Col", "Rev", "Capt", "Sir", "Don"], 
                        ["Miss", "Miss", "Miss", "Mr", "Mr", "Mrs", "Mrs", "Other",
                        "Other", "Other", "Mr", "Mr", "Mr"], inplace = True)

data.groupby("Initial")["Age"].mean()

In [None]:
# Assign the new values

data.loc[(data["Age"].isnull())&(data["Initial"]=="Mr"), "Age"] = 33
data.loc[(data["Age"].isnull())&(data["Initial"]=="Miss"), "Age"] = 22
data.loc[(data["Age"].isnull())&(data["Initial"]=="Master"), "Age"] = 5
data.loc[(data["Age"].isnull())&(data["Initial"]=="Mrs"), "Age"] = 36
data.loc[(data["Age"].isnull())&(data["Initial"]=="Other"), "Age"] = 46

In [None]:
# Take a look now into the missing data

total = data.isnull().sum().sort_values(ascending = False)
porcentage = (data.isnull().sum()/data.isnull().count()).sort_values(ascending = False)

missing_data = pd.concat([total, porcentage], axis = 1, keys = ["Total", "Porcentage"])
missing_data

Wow! That was a clever solution by ash316, we solved the problem of the age! Let's take a look to some other features. Maybe, in importance order, where they embarked  could be important.

In [None]:
# Plot Embarked and Survival

fig, ax = plt.subplots(1, 2, figsize = (15, 5))
sns.countplot("Embarked", hue = "Survived", data = data, ax = ax[0])
ax[0].set_title("Embarked and survived")
ax[0].set_ylabel("Count")
sns.countplot("Embarked", hue = "Sex", data = data, ax = ax[1])
ax[1].set_title("Embarked by Sex")
ax[1].set_ylabel("Count")

plt.show()

So we can see that people who embarked in C survived more than in the other ports. It could be interesting to see if those people travelled in FirstClass.

In [None]:
# Crosstab

pd.crosstab([data["Survived"], data["Embarked"]], data["Pclass"], margins = True)

Ok, now we can see that there are not correlation about we said, but we explored it a little bit more!

Remember we had 2 missing values in Embarked. We could fill the gaps with the mean but, we are working now with a categorical feature, so we cannot work like we do with numbers. Most of the people embarked in S, so let's fill the gaps with S.

In [None]:
# Filling missing values

data["Embarked"].fillna("S", inplace = True)

In [None]:
# Take a look now into the missing data

total = data.isnull().sum().sort_values(ascending = False)
porcentage = (data.isnull().sum()/data.isnull().count()).sort_values(ascending = False)

missing_data = pd.concat([total, porcentage], axis = 1, keys = ["Total", "Porcentage"])
missing_data

Another important feature seems to be SibSp, which tell us if the person is alone or with his/her family. Let's take a look into it.

In [None]:
# SibSp plot

fig = sns.barplot("SibSp", "Survived", data = data)
fig.set_title("SibSp and Survived")

plt.show()

We can see that families with 5 or more members had 0% survirval rate. This is very interesting. Let's try to figure it out why!

In [None]:
# SibSp plot with Pclass

fig = sns.countplot("SibSp", hue = "Pclass", data = data)
fig.set_title("Pclass with SibSp")
fig.set_ylabel("Count")

plt.show()

As we can see, families with more of 5 members were all in ThirdClass, so this may be the reason of their survirval rate.

Another feature that seems to be important is the fare, maybe because of its relation with Pclass.

In [None]:
# A brief summary of Fare

data["Fare"].describe()

Well, the minimum value is 0!, I think I wouldn't have taken this trip even for free ^^' 

Last but not least.. we have 687 missing values in Cabin. We'll have to deal with this. But first, let's see the correlation between vars.

In [None]:
# Correlation Plot

sns.heatmap(data.corr(), annot = True, linewidths = 0.1)
plt.show()

### Data Preprocess

Now, we're ready to preprocess the data in order to build our model. We'll prepare the data, so we can drop those features that are non-relevant at all.

In [None]:
# Removing non-relevant features

non_relevant_f = ["PassengerId", "Cabin", "Name", "Ticket", "Initial"]
data = data.drop(non_relevant_f, axis = 1)

data.head()

In [None]:
# Split the data

X = data.iloc[:, 1:].values
y = data.iloc[:, 0].values

In [None]:
# Encoding categorical features

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1]) # "Sex"
labelencoder_X_2 = LabelEncoder()
X[:, 6] = labelencoder_X_2.fit_transform(X[:, 6]) # "Embarked"

transformer = ColumnTransformer(
    transformers=[
        ("Titanic",
        OneHotEncoder(categories="auto"),
        [1]
        )
    ], remainder="passthrough"
)
X = transformer.fit_transform(X)
X = X[:, 1:]

In [None]:
# Last but not least..

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size=0.2,
                                                   random_state = 42)

In [None]:
# Its important to scale the data to make the model better

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

## SECOND PART - PREDICTION MODELS

There are a lot of different ways to approach a classification problem. Here, we'll see some of them. The first one, a classic, is the logistic regression.

In [None]:
# Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)
prediction_lr = model_lr.predict(X_test)

print("The accuracy of the Logistic Regression is:", metrics.accuracy_score(prediction_lr, y_test))

Another classic model is the Random Forest.

In [None]:
# Random Forests

from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(n_estimators=100)
model_rf.fit(X_train, y_train)
prediction_rf = model_rf.predict(X_test)

print("The accuracy of the Random Forests Classifier is:", metrics.accuracy_score(prediction_rf, y_test))

Let's see what happens if we try it with more complex models like Light GBM.

In [None]:
# Lighgt GBM

import lightgbm as lgb
from sklearn.metrics import accuracy_score

training_data = lgb.Dataset(data = X_train, label = y_train)
params = {'num_leaves': 31, 'num_trees': 100, 'objective':'binary'}
params['metric'] = ['auc', 'binary_logloss']
classifier = lgb.train(params = params,
                      train_set = training_data,
                      num_boost_round=10)

prob_pred = classifier.predict(X_test)
y_pred=np.zeros(len(prob_pred))
for i in range(0, len(prob_pred)):
    if prob_pred[i] >= 0.5:
        y_pred[i] = 1
    else:
        y_pred[i] = 0
        
accuracy = accuracy_score(y_pred, y_test) * 100
print("Accuracy: {:.0f} %".format(accuracy))

We've tried enough models for today. We had a 84% accuracy which is good but we could improve it more. 

Thanks to all for having a look at this notebook. Hope you like it and found it useful! :)