# Titanic [EDA + Model Pipeline]

#### This kernel provides exploratory analysis of the data to uncover the underlying structures  to better understand modeling strategies to use for machine learning. If, by any means, this notebook happens to stand out in anyway, please consider **upvoting** as it motivates me to create better notebooks.



## Table of Contents:
1. Introduction
2. Importing Libraries
3. Extracting of Basic Statistics
4. Detailed Data Exploration
5. Preprocessing and Data Cleaning
6. Modelling
7. Plotting Model's Performance

## 1. Introduction (The Challange Description)

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## 2. Import Libraries:

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import matplotlib as mpl

from matplotlib import rcParams
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import Dropout, Dense
from keras.callbacks import EarlyStopping, ModelCheckpoint

np.random.seed(0)

# Plots the figure in the kernel rather than opening a window or tab.
%matplotlib inline

# Set the universal size for figure
rcParams['figure.figsize'] = (10, 8)
plt.style.use("ggplot")
mpl.rc("savefig", dpi = 200)

In [None]:
train_df = pd.read_csv("../input/titanic/train.csv")
test_df  = pd.read_csv("../input/titanic/test.csv")

## 3. Extraction of Basic Statistics:

Here we first acquire information on the dataset that tells us what Data Types do certain columns have, and which of them have Null Values that may require some cleaning at later stages. Then we use the describe method on both DataFrames that provides us with some basic information on how is the data distributed throughout the columns. This helps us determine as to which numerical columns would require normalization and which ones would require scaling.

In [None]:
print("[+] Basic Information on Training Dataset: \n")
print(train_df.info())

print('')
print("[+] Basic Information on Testing Dataset: \n")
print(test_df.info())

In [None]:
print("[+] Basic Statistics on Training DataFrame: \n")
print(train_df.describe())

print('')
print("[+] Basic Statistics on Testing DataFrame: \n")
print(test_df.describe())

In [None]:
print("[+] First Five Rows of Training DataFrame:")
print("##########################################\n")

print(train_df.head(5))
print("")


print("[+] First Five Rows of Testing DataFrame:")
print("##########################################\n")

print(test_df.head(5))

## 4. Extraction of Detailed Statistics

Here, we will begin with tear down of each individual column and study how data is distributed in various categories by plotting their histograms and bar charts. Then in later stages we will study relationships between each of these columns. If there exists any correlation between any of the categories, we will decorrelate them through various techniques at our disposal. 

### [+] Study of Important Features

1. The Study Names Column
2. The Study Pclass Column
3. The Study Sex Column
4. The Study Age Column
5. The Study of Fare Column

### 1. The Study of the "Name" Column

Taking a peek at this column reveals that there is a lot of information that can be extracted from this column. As this column contains names of various people aboard The Titanic, we will explore the distribution of these people according to their titles first. We will use the tools from Regular Expression library to loop over every single record, extract the text, segregate the titles into a seperate column named "Title", and count occurances of all unique titles in this dataset. 

In [None]:
print("[+] The Name Column of Training Dataset:")
print("#######################################\n")

print(train_df['Name'].head(5))

In [None]:
train_df['Title'] = train_df["Name"].apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))



_ = plt.figure(figsize = (12, 5))
_ = plt.xlabel("Title", fontsize = 16)
_ = plt.ylabel("Count", fontsize = 16)
_ = plt.title("Occurances of Titles", fontsize = 20)
_ = plt.xticks(rotation = 90)

sns.countplot(x = 'Title', data = train_df, palette = "Blues_d")

plt.show()

# Repeat the same procedure for testing dataset
test_df['Title'] = test_df["Name"].apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))

In [None]:
Title_Dictionary = {
        "Capt":       "Officer",
        "Col":        "Officer",
        "Major":      "Officer",
        "Dr":         "Officer",
        "Rev":        "Officer",
        "Jonkheer":   "Royalty",
        "Don":        "Royalty",
        "Sir" :       "Royalty",
        "the Countess":"Royalty",
        "Dona":       "Royalty",
        "Lady" :      "Royalty",
        "Mme":        "Mrs",
        "Ms":         "Mrs",
        "Mrs" :       "Mrs",
        "Mlle":       "Miss",
        "Miss" :      "Miss",
        "Mr" :        "Mr",
        "Master" :    "Master"
                   }

train_df['Title'] = train_df['Title'].map(Title_Dictionary)
test_df['Title'] = test_df['Title'].map(Title_Dictionary)

### 2. The Study of the "Pclass" Column

The column "Pclass" represents categories of passengers aboard titanic, just as we would have economy class and first class in modern aviation. After exploration, we can see that there are three categories named 1, 2, 3. We can use seaborn's countplot method to visualize the distribution (An aid for quick glance).

In [None]:
print("[+] First 5 Rows of the  'Pclass' column: ")
print("########################################\n")
print(train_df['Pclass'].head())
print("")

print("[+] Count of Categories in Tabular Format: ")
print("##########################################\n")
print(train_df['Pclass'].value_counts())

_ = plt.figure(figsize = (12, 5))
_ = plt.xlabel("Pclass", fontsize = 15)
_ = plt.ylabel("Count", fontsize = 15)
_ = plt.title("Occurances of Pclass", fontsize = 15)
_ = plt.xticks(rotation = 0)

sns.countplot(x = 'Pclass', data = train_df, palette = "Blues_d")

plt.show()

### 3. The Study of the "Sex" Column

Just like any categorical column of data we process the "Sex" column to find out that there happened to be more male than female, with a male count of 577 and 314 for the female. 

In [None]:
# Plotting and basic EDA on the column

print("[+] First 5 Rows of the  'Sex' column: ")
print("########################################\n")
print(train_df['Sex'].head())
print("")

print("[+] Count of Genders in Tabular Format: ")
print("##########################################\n")
print(train_df['Sex'].value_counts())

_ = plt.figure(figsize = (12, 5))
_ = plt.xlabel("Sex", fontsize = 15)
_ = plt.ylabel("Count", fontsize = 15)
_ = plt.title("Occurances of Sex", fontsize = 15)
_ = plt.xticks(rotation = 0)

sns.countplot(x = 'Sex', data = train_df, palette = "Blues_d")

plt.show()

### 4. The Study of the "Age" and "Fare" Column

Age, upon first glance using the head method, reveals that the data is continous, and it has a total of 177 null cells. Imputing them with median of the column can do the trick of cleaning these up. We then plot the distribution and observe that most of our candidates are centered between 20 and 40, and they are normally distributed. Though we will have to center this data to mean of zero and standard deviation of 1 so that our models can be trained on this data.

We then also explore the use of **Facet Grid** method of Seaborn that allows us to break the columns and adjust multiple plots into one figure. In this case, we segregate plots of age distributions according to whether the indivdual survived or preished. Faceting is the easiest way to make your plots multivariate. 

Then we also want to explore the distribution of age according to various age groups these individuals fall into, and whether these particular individuals survived or not.

In [None]:
# Basic EDA and Imputation of Null Cells

print("[+] The first five rows of the 'Age' column: ")
print("##########################################\n")
print(train_df['Age'].head())
print("")

print("[+] Total Number of Null Values : ", train_df["Age"].isnull().sum())

# Imputing the NaN values from the column
train_df.loc[train_df.Age.isnull(), 'Age'] = train_df.groupby(['Sex','Pclass','Title'])['Age'].transform('median')
test_df.loc[test_df.Age.isnull(), 'Age']   = test_df.groupby(['Sex','Pclass','Title'])['Age'].transform('median')

print("[+] Total Null Values after Imputation: ", train_df["Age"].isnull().sum())

In [None]:
# Plotting the data

_ = plt.figure(figsize = (15,5))
_ = plt.title("Distribuition and density by Age")
_ = plt.xlabel("Age")
_ = plt.xticks(rotation = 0)

sns.distplot(train_df["Age"], bins = 24,  color = 'black')

plt.show()

In [None]:
plt.figure(figsize=(15,5))

plot = sns.FacetGrid(train_df, col = "Survived", size = 6.2)
plot = plot.map(sns.distplot, "Age", color = 'black')

plt.show()

In [None]:
age_intervals = (0, 5, 12, 18, 25, 35, 60, 120)
categories    = ['Babies', 'Children', 'Teen', 'Student', 'Young', 'Adult', 'Senior']

train_df["Age_Category"] = pd.cut(train_df['Age'], age_intervals, labels = categories)
test_df["Age_Category"]  = pd.cut(test_df['Age'], age_intervals, labels = categories)

In [None]:
print(pd.crosstab(train_df['Age_Category'], train_df['Survived']))

_ = plt.figure(figsize = (15, 5))
_ = plt.ylabel("Fare Distribution", fontsize=18)
_ = plt.xlabel("Age Categorys", fontsize=18)
_ = plt.title("Fare Distribution by Age Categorys ", fontsize=20)

sns.swarmplot(x = 'Age_Category',y = "Fare", data = train_df, hue = "Survived", palette = "PuBuGn_d")

plt.subplots_adjust(hspace = 0.5, top = 0.9)

plt.show()

In [None]:
# Figure Setup
_ = plt.figure(figsize=(15, 5))
_ = plt.ylabel("Count", fontsize = 18)
_ = plt.xlabel("Age Categorys", fontsize = 18)
_ = plt.title("Age Distribution ", fontsize = 20)

sns.countplot("Age_Category",data = train_df, hue = "Survived", palette = "PuBuGn_d")

# Plot the figure
plt.show()


## Exploring Relations Between the Important Features

Now that we've looked at some variables that are important for prediction, we drop those that are irrelevent. So to do that we simply use drop method in pandas, and we do the same for test set. Looking at the relations between features we are left with, they unveil before us some interesting patterns, even the ones that might not be in your remote imagination. For example, it is likely that you may have survived the Titanic Tragedy if you had one or two kids. The very features that we may consider that might not correlate with survival, can often times be vital for prediction.

In [None]:
# Remove the irrelevent columns
train_dataset = train_df.drop(columns = ['Fare', 'Ticket', 'Age', 'Cabin', 'Name'])
test_dataset  = test_df.drop(columns = ['Fare', 'Ticket', 'Age', 'Cabin', 'Name'])

In [None]:
train_dataset.head()

In [None]:
_ = plt.figure(figsize = (15, 5))
_ = plt.title("Sex Distribution According to Survived or Not")
_ = plt.xlabel("Sex Distribution")
_ = plt.ylabel("Count")

sns.countplot(x = "Sex", data = train_dataset, hue = "Survived", palette = 'PuBuGn_d')

plt.show()

In [None]:
train_dataset["Embarked"] = train_dataset["Embarked"].fillna('S')
test_dataset["Embarked"]  = test_dataset["Embarked"].fillna('S')

_ = plt.figure(figsize = (15, 5))
_ = plt.title("Pclass Distribution According Survival")
_ = plt.xlabel("Embarked")
_ = plt.ylabel("Count")

sns.countplot(x = 'Embarked', data = train_dataset, hue = 'Survived', palette = 'PuBuGn_d')

plt.show()

In [None]:
plot = sns.factorplot(x = 'SibSp', y = 'Survived', data = train_dataset, kind = 'bar', height = 5, aspect = 1.6, palette = "PuBuGn_d")
_    = plot.set_ylabels("Probability of Survival")
_    = plot.set_xlabels("SibSp Number")

plt.show()

## 5. Preprocessing Data

Now we encode all categorical variables so that we can process them through our machine learning model. Then we plot a correlation matrix to see correlations between various features, and we observe that there are only a few that show no correlation at all, so we need not worry about them.

In the next step we import Standard Scalar from sklearn.preprocessing and apply it to our datasets. The same scalar needs to be applied to both the training and testing datasets. This step is crucial as it transitions us into the modelling phase.

Before modelling make sure all the columns in the test and train test are of same number, otherwise our neural network will pose problems adapting to new shape of data. 

In [None]:
train_dataset = pd.get_dummies(train_dataset, columns = ["Sex", "Embarked", "Age_Category","Title"], prefix = ["Sex", "Emb", "Age", "Prefix"], drop_first = True)
test_dataset  = pd.get_dummies(test_dataset, columns = ["Sex", "Embarked", "Age_Category","Title"], prefix = ["Sex", "Emb", "Age", "Prefix"], drop_first = True)

In [None]:
# Plotting the correlation matrix

_ = plt.figure(figsize = (18, 15))
_ = plt.title("Correlation Matrix of Features in Training Dataset")
_ = sns.heatmap(train_dataset.astype(float).corr(), vmax = 1.0, annot = False, cmap = "Blues")

plt.show()

In [None]:
train_dataset.columns.tolist()

In [None]:
test_dataset.columns.tolist()

In [None]:
training_data = train_dataset.drop(['Survived', 'PassengerId'], axis = 1)
training_target = train_dataset["Survived"]

testing_data = test_dataset.drop(["PassengerId"], axis = 1)

X_train, y_train = training_data.values, training_target.values 
X_test = (testing_data.values).astype(np.float64, copy = False)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.fit_transform(X_test)

In [None]:
print("[+] Shapes of Training and Testing Datasets: ")
print("############################################\n")

print('> Training Dataset = ', X_train.shape)
print('> Testing Dataset  = ', X_test.shape)

## 6. Modelling Using Keras

Since this is a fairly basic dataset, it demands a basic solution. We can do this through various other techniques such as RandomForests and Gradient Boosting, but we will stick with neural networks as they tend to perform exceptionally well with basic datasets. Our model has two layers. One with 50 neurons, activation of relu, a dropout of 50%, and then we add the last layer with one unit in it, and we use sigmoid because it is binary classification that we are performing. 

**Callbacks** allow us to extract information from our model after each iteration of training. This is helpful as it liberates us from viewing the graphs of loss every time we train. We are using two callbacks in this example: Early Stopping and Model Checkpoint. Early Stopping automatically stops training our model as our training loss starts to ramp up for two consecutive iterations, whereas the Model Checkpoint keeps track of validation loss and saves the parameters at every iteration. After training, it saves the one that yielded the least validation loss while discarding the rest.

Then loading the best weights that we acquired during the training phase, we predict on our test set and append those predictions into our sample submission file, saving it as **TitanicPreds.csv**.

In [None]:
model = Sequential()
model.add(Dense(50, input_shape = (17, ), activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation = 'sigmoid'))
model.summary()

callbacks = [EarlyStopping(monitor='val_loss', patience = 2, mode = 'min'), 
             ModelCheckpoint(filepath = 'best_model.h5', monitor = 'val_loss', save_best_only = True)]

In [None]:
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
network = model.fit(X_train, y_train, batch_size = 50, epochs = 100, verbose = True, callbacks = callbacks, validation_split = 0.2)

In [None]:
model.load_weights("best_model.h5")
y_preds = model.predict(X_test)

submission = pd.read_csv('../input/titanic/gender_submission.csv', index_col = 'PassengerId')
submission['Survived'] = y_preds.astype(int)
submission.to_csv('/kaggle/working/TitanicPreds.csv')

## 7. Plotting Model's Performance

Plotting the two metrics, we can see that our validation set gives a very good performance and achieves more than 80 percent accuracy. 

In [None]:
print("[+] Available Parameters in Model's History: ")
print("#############################################\n")

for index, key in enumerate(network.history.keys()): print(str(index + 1) + ". ", key)

_ = plt.figure(figsize = (15, 8))
_ = plt.plot(network.history['val_accuracy'])
_ = plt.plot(network.history['accuracy'])
_ = plt.title('Training and Validation Accuracy')
_ = plt.ylabel('Accuracy')
_ = plt.xlabel('Epoch')
_ = plt.legend(['train', 'validation'], loc = 'upper left')

plt.show()

## 8. Conclusion

This was my first kernel and probably my first kaggle dataset. It is sure to have multitude of errors, but I hope to improve it and learn more about various Deep Learning techniques from top kagglers. Please consider taking a second and **UPVOTING** this notebook as it motivates me to create more.