## Importing libraries and data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [None]:
data = pd.read_csv("../input/titanic/train.csv")
print(data.shape)
data.head()

## Exploratory Data Analysis

### Missing data

First of all, let's check if we have any missing data.

In [None]:
data.isnull().sum()

687 of 891 Cabin are missing. Let's check all the unique values of this column.

In [None]:
data.Cabin.unique()

What can be seen, is the fact that sometimes cabin numbers contain even letters, so we won't be able to fill missing values with mean or median, for example. And, as we found out before, the missing values are the giant part of all the values for this column. So, we can actually drop it.

In [None]:
n_data = data.drop(columns=["Cabin"])
n_data.isnull().sum()

What about Age column, we can simply fill it with the rounded mean.

In [None]:
#Mean on train data
mean = (n_data["Age"].mean()*n_data.shape[0])/n_data.shape[0]
n_data["Age"] = n_data["Age"].fillna(int(float(mean)))
n_data.isnull().sum()

There are only 2 values missing in Embarked column for train data. Let's fill them with the mode.

In [None]:
n_data["Embarked"] = n_data["Embarked"].fillna(str(n_data["Embarked"].mode()))
n_data.isnull().sum()

And let's fill test data Fare missing value with mean.

That's fine, we have dealt with missing data!

Now, let's also drop PassengerId, Name and Ticker columns, as they won't make any sense on our future predictions.

In [None]:
n_data = n_data.drop(columns=["PassengerId", "Name", "Ticket"])
n_data.head()

### Features visualizing and analysis

Let's take a look at our features and theirs distribution. We will visualize it on a plot.

We will also create new dataframe with only survived passengers data, to compare it with the data of both survived and dead passengers later.

In [None]:
sn_data = n_data[n_data.Survived == 1]
sn_data.head()

#### Sex

Let's check Sex column plot. From now, blue bars will display survived passengers data and red bars will display all the passengers.

In [None]:
plt.hist(n_data.Sex, bins=n_data.Sex.unique().size*2 - 1, color="r")
plt.hist(sn_data.Sex, bins=n_data.Sex.unique().size + 1)
plt.show()

As can be seen from the plot, there were much more men on the board. (Red histogram)

What about survived passengers, the situation here is much different, as the 2/3 of all the survived passengers are women. We will use this fact later. (Blue histogram)

#### Pclass

Let's visualize Pclass column.

In [None]:
plt.hist(n_data.Pclass, bins=n_data.Pclass.unique().size*2 -1, color="r")
plt.hist(sn_data.Pclass, bins=sn_data.Pclass.unique().size*2 -1)
plt.show()

What can be seen from the plot, is that the biggest part of passengers are the 3rd class, which was the cheapest. The amount of 1st and 2nd class passengers is almost the same. (Red histogram)

Let's take a look at survived passengers. (Blue histogram)

Situation changes dramatically. As we can see, the highest amount of passengers are from the 1st. 

Moreover, survivors are:                                                                                                 almost 60% of 1st class passengers;
the half of the 2nd class passengers;
small part of 3ed class passengers.

That means, that 1rd class passengers were more likely to survive.

#### Age

Let's visualize Age column.

In [None]:
plt.subplots(figsize=(17, 4))
plt.subplot(1, 3, 1)
sns.boxplot(n_data.Age, color="r")
plt.subplot(1, 3, 2)
sns.boxplot(sn_data.Age)
plt.subplot(1, 3, 3)
sns.distplot(n_data.Age, color="r")
sns.distplot(sn_data.Age)
plt.show()

As we can see there are only several passengers, who are older than 70. (Red boxplot)

Only one person from the age 65-80 survived. (Blue boxplot)

What can be seen from the distplot, is that more children survived, than died. 20-40 year old adults were also more likely to survive. But for the people, who are older than 60, chances to survive were really small.
For other people there is no obvious age-related dependencies.

#### SibSp

Let's visualize SibSp column (number of siblings/spouses aboard the Titanic).

In [None]:
plt.subplots(figsize=(17, 4))
plt.subplot(1, 3, 1)
sns.boxplot(n_data.SibSp, color="r")
plt.subplot(1, 3, 2)
sns.boxplot(sn_data.SibSp)
plt.subplot(1, 3, 3)
plt.hist(n_data.SibSp, bins=20, color="r")
plt.hist(sn_data.SibSp, bins=10)
plt.show()

What can be seen from the boxplot, is that the amount of people with SibSp >= 3 is really small. (Red boxplot)

People with SibSp > 4 didn't survive at all. (Blue boxplot)

There are any obvious dependencies on histogram, so we will move on.

#### Parch

Let's visualize Parch column (number of parents / children aboard the Titanic).

In [None]:
plt.subplots(figsize=(17, 4))
plt.subplot(1, 3, 1)
sns.boxplot(n_data.Parch, color="r")
plt.subplot(1, 3, 2)
sns.boxplot(sn_data.Parch)
plt.subplot(1, 3, 3)
plt.hist(n_data.Parch, bins=17, color="r")
plt.hist(sn_data.Parch, bins=7*2)
plt.show()

We have got only several people with Parch > 0 (Red/first boxplot)

People with 4 and 6 Parch didn't survive at all. (Blue boxplot)

There are any obvious dependencies on histogram, so we will move on.

#### Fare

Let's visualize Fare column

In [None]:
plt.subplots(figsize=(17, 4))
plt.subplot(1, 2, 1)
sns.distplot(n_data.Fare, color="r")
sns.distplot(sn_data.Fare)
plt.subplot(2, 2, 2)
sns.boxplot(sn_data.Fare)
plt.subplot(2, 2, 4)
sns.boxplot(n_data.Fare, color="r")
plt.show()

These plots don't really give us usefull information.

#### Embarked

Let's visualize Embarked - the last column.

In [None]:
plt.hist(n_data.Embarked, bins=15, color="r")
plt.hist(sn_data.Embarked, bins=15)
plt.show

This plot don't really give us usefull information.

### Data encoding

Seems like all the data left is numeric and categorical. Let's check if it is true.

for col in n_data:
    print(col, n_data[col].dtypes)

To encode Sex and Embarked columns we will use OneHotEncoder.

In [None]:
n_data_d = pd.get_dummies(n_data)

Let's check out if it worked fine.

In [None]:
n_data_d

We have got only 3 embarked values : Q, S and C. Let's drop Embarked_0    S\ndtype: object column as it doesn't make any sense.

In [None]:
n_data_d = n_data_d.drop(columns=["Embarked_0    S\ndtype: object"])
n_data_d.head()

### Correlation Matrix

Now, let's check if any features are correlating between each other.

In [None]:
corr_m = n_data_d.corr()
plt.subplots(figsize=(12, 8))
sns.heatmap(corr_m, annot=True, square=True)
plt.show()

In [None]:
high_corr = corr_m.nlargest(12, 'Survived')['Survived'].drop(['Survived'])
high_corr

As we can see, Survive correlates mostly on Sex_female, Sex_male, Pclass and Fare.

### Distribution

Let's use log to make Age and Fare columns distribution more similar to normal.

In [None]:
n_data_d.Age = np.log1p(n_data_d.Age)

In [None]:
sns.distplot(n_data_d.Age)
plt.show()

In [None]:
n_data_d.Fare = np.log1p(n_data_d.Fare)

In [None]:
sns.distplot(n_data_d.Fare)
plt.show()

This operation can help us, because models can make better prediction, when data is normaly distributed.

# Model choosing and training

Import all models and functions

In [None]:
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV, train_test_split, RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, accuracy_score, confusion_matrix, roc_auc_score, auc, precision_recall_curve, make_scorer
import tensorflow.keras as keras
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.models import Sequential
import tensorflow as tf
from lightgbm import LGBMClassifier

First of all, as our task is to classify, we should check, if both Survived classes are balanced.

In [None]:
print("Survived = {} \nDead = {}".format(n_data_d[n_data_d.Survived == 1].shape[0], n_data_d[n_data_d.Survived == 0].shape[0]))

Our classes aren't really well-balanced, so the accuracy metric won't show us the real prediction accuracy. That's why we will use AUC-ROC metrics.

### Splitting data

Now, before training our models, we should split data, to be able to find the best hyperparameters for our models.

In [None]:
n_data_d.head()

In [None]:
#X is what we will use to predict, and y is what we should predict
X, y = n_data_d.drop(columns=["Survived"]), n_data.Survived
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### RandomForestClssifier

We will use RandomForestClassifier. bootstrap is True, as the size of data isn't really big. We will use ROC-AUC metrics to score our models. The advantage of ROC-AUC is, that it doesn't depend on data's balance.

In [None]:
roc_auc = make_scorer(roc_auc_score, higher_is_better=True)

rf = RandomForestClassifier(bootstrap=True, n_estimators=700, criterion='entropy')
rf.fit(X_train, y_train)

print("Best score on train data: {:.4f}".format(roc_auc_score(rf.predict(X_train), y_train)))
print("Best score on test data: {}\n".format(roc_auc_score(rf.predict(X_test), y_test)))

Let's also take a look at important features.

In [None]:
for i in np.arange(len(rf.feature_importances_)):
    print("{} : {:.4f}".format(X_train.columns[i], rf.feature_importances_[i]))

Seems like Fare, Age and Sex have the highest importance for our model.

### KNeighborsClassifier

We will use GridSearchCV to find the best parameters for KNeighborsClassifier.

In [None]:
params={'n_neighbors' : range(1, 20), 'leaf_size' : range(1, 50)}

knn_grid = GridSearchCV(KNeighborsClassifier(), params, scoring='roc_auc')
knn_grid.fit(X_train, y_train)

print("Best GridSearchCV params: {}".format(knn_grid.best_params_))
print("Best score on train data: {:.4f}".format(knn_grid.best_score_))
print("Best score on test data: {}\n".format(roc_auc_score(knn_grid.predict(X_test), y_test)))

Seems like the scores are actually better than RandomForestClassifier's

### RidgeClassifier

We will use GridSearchCV to find the best parameters for Ridge.

In [None]:
params = {'alpha' : [0.00001, 0.0001, 0.001, 0.01, 1, 10, 100, 1000], 'normalize' : [True, False], 'random_state' : [0, 50, 100, 150, 200]}

r_grid = GridSearchCV(RidgeClassifier(), params, scoring='roc_auc')
r_grid.fit(X_train, y_train)

print("Best GridSearchCV params: {}".format(r_grid.best_params_))
print("Best score on train data: {:.4f}".format(r_grid.best_score_))
print("Best score on test data: {}\n".format(roc_auc_score(r_grid.predict(X_test), y_test)))

Finding out which features are important.

In [None]:
for i in np.arange(len(r_grid.best_estimator_.coef_[0])):
    print("{} : {}".format(X_train.columns[i], r_grid.best_estimator_.coef_[0][i]))

Almost the same features are important for both RandomForestClassifier and RidgeClassifier, but RidgeClassifier also has Pclass feature as important one.

### LogisticRegression

We will use GridSearchCV to find the best parameters for LogisticRegression.

In [None]:
params = {'penalty' : ['l1', 'l2', 'elasticnet', 'none'], 'C' : [0.00001, 0.0001, 0.001, 0.01, 1, 10, 100, 1000]}

log_reg_grid = GridSearchCV(LogisticRegression(), params, scoring='roc_auc')
log_reg_grid.fit(X_train, y_train)

print("Best GridSearchCV params: {}".format(r_grid.best_params_))
print("Best score on train data: {:.4f}".format(log_reg_grid.best_score_))
print("Best score on test data: {}\n".format(roc_auc_score(log_reg_grid.predict(X_test), y_test)))

As we can see LogisticRegression makes better results, than RandomForest and RidgeClassifier, but it is still worse than KNeighbors.

### Neural Network

The last model will be neural network. We will use keras to build it.

In [None]:
nn = Sequential()

nn.add(Dense(10, kernel_initializer=keras.initializers.glorot_uniform, activation='tanh'))
nn.add(Dense(16, kernel_initializer=keras.initializers.he_normal, activation='elu'))
nn.add(Dropout(0.3))
nn.add(Dense(32, kernel_initializer=keras.initializers.glorot_uniform, activation='tanh'))
nn.add(Dropout(0.3))
nn.add(Dense(4, kernel_initializer=keras.initializers.glorot_uniform, activation='tanh'))
nn.add(Flatten())
nn.add(Dense(1, activation='sigmoid'))

nn.compile(loss='binary_crossentropy', optimizer='adam', metrics=tf.keras.metrics.AUC(curve='ROC'))
nn.fit(X_train, y_train, epochs=500, verbose=0)
scores = nn.evaluate(X_train, y_train, verbose=0)
print("\nAccuracy on train data : {}".format(scores[1]))
scores = nn.evaluate(X_test, y_test, verbose=0)
nn_pred = np.where(nn.predict(X_test) > 0.5, 1, 0)
print("Accuracy on test data : {}".format(roc_auc_score(nn_pred, y_test)))

Looks like neural networks' scores are pretty fine, but still worse then KNeighborsClassifier's

### Result

As can be seen from the results, the most important features are : Pclass, Age, Sex and Fare.


The best acuracy is given by KNeighborsClassifier. 