RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early hours of 15 April 1912, after colliding with an iceberg during her maiden voyage from Southampton to New York City. There were an estimated 2,224 passengers and crew aboard, and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history. RMS Titanic was the largest ship afloat at the time she entered service and was the second of three Olympic-class ocean liners operated by the White Star Line.


<img src="https://miro.medium.com/max/750/0*wKmr2Sffqkr9FimI.gif" alt="vac" border="0"></a>

The “Titanic: Machine Learning from Disaster” is a classical problem for beginners in Machine Learning.
The challenge is to predict, based on a set of training data, which people would survive the disaster and which would not. Obviously, this is just a challenge to try discover a correlation between the features of the people who survived and not and use it to make predictions.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Understanding problem/ Discovering Data
In this notebook, we will build a predictive model that is going to predict what sort of people were more likely to survive in the shipwreck Titatic. The data is provided to us in 2 sets : train dataset and test set.

In [None]:
#titanic_features = pd.read_csv('train.csv')
df = pd.read_csv("../input/titanic/train.csv")
df_test = pd.read_csv("../input/titanic/test.csv")

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Exploratory Data Analysis:
The Beginning When you download the data from the Kaggle, they make available 2 datasets. One corresponds to train, with data to train your model, this dataset already have the answer of who survived and who not. And before you ask me, I don’t know if this is true. But we have some problems in this data …

# Checking missing values in the dataset

In [None]:
df.isnull().sum()

The Embarked feature has only 2 missing values, which can easily be filled. It will be much more tricky, to deal with the ‘Age’ feature, which has 177 missing values. The ‘Cabin’ feature needs further investigation.

In [None]:
import seaborn as sns
sns.heatmap(df.isnull(), yticklabels = False, cmap="YlGnBu")

<img src="https://miro.medium.com/max/660/0*aF7XRizN9CFij6cK.gif" alt="vac" border="0"></a>

We have few data for this problem. But that doesn’t mean we can’t solve the problem!

# Features:
+ PassengerId: Unique Id of a passenger
+ Survival: Survival
+ Pclass: Ticket class
+ Name:
+ Sex: Sex
+ Age: Age in years
+ Sibsp: # of siblings / spouses aboard the Titanic
+ Parch: # of parents / children aboard the Titanic
+ Ticket: Ticket number
+ Fare: Passenger fare
+ Cabin: Cabin number
+ Embarked: Port of Embarkation

From the table above, we can note a few things:
+ We have a few categorical variabes that need to be either converted to numerical or one-hot encoded, so that the machine learning algorithms can process them.
+ The features have widely different ranges, and we will need to convert into roughly the same scale.
+ Some features contain missing values (NaN = not a number), that we need to deal with.

In [None]:
df.head(2)

As the observation above, we already see Age and Cabin have numbers of missing data. Depends on the dataset we would consider how to deal with these missing data. Since our dataset is not a big data, dropping columns might not be a good idea in this case but also depends on how much that data meaningful contribute to our dataset. Now, check in detail ...

Overall look on Survived data we have in our training set as we know the main point of this analyzation is about who were more likely to survive and who were not.

In [None]:
sns.set_style('ticks')
sns.countplot(x='Survived', data = df)

Usign countplot helps us to see the overall of Survived / Not survived and as we can see the number of people who were not survived is higher than people survived :(

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived', hue='Sex', data=df)

Lets see one more feature, Pclass and see how Pclass tells us about who were survived or not.

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived', hue='Pclass', data=df)

So obviously we could give a unfailry conclude here that people in 1st class were likely have more chance to survive and there were many people in the 3rd class died.

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='SibSp', data=df)

In [None]:
import pandas as pd
import plotly.express as px
temp = df.groupby(by="Cabin").count()
name = temp.Survived.index
val = temp.Survived.values

fig = px.scatter_polar(temp, r=val, theta=name,color=name, symbol=val, 
                       size=val,color_discrete_sequence=px.colors.sequential.Plasma_r, title='Survived by Cabin')
fig.show()

In [None]:
import plotly.express as px
t0 = df.iloc[(df["Survived"]==0).values]
t1 = df.iloc[(df["Survived"]==1).values]

temp0 = t0.groupby(by="Embarked").count()
name0 = temp0.Survived.index
val0 = temp0.Survived.values

temp1 = t1.groupby(by="Embarked").count()
name1 = temp1.Survived.index
val1 = temp1.Survived.values

fig = px.scatter_polar(r=val1+val0, theta=name1,color=np.round(val1/(val0+val1)*100,0), 
                       symbol=val0, size=np.round(val1/(val0+val1)*100,0),
                       color_discrete_sequence=px.colors.sequential.Plasma_r, title='Survived by Cabin')
fig.show()

+ Over 72% of the passengers embarked from the port ‘Southampton’, 18% from the port ‘Cherbourg’ and the rest from the port ‘Queenstown’.
+ Passengers from port ‘Southampton’ have a low survival rate of 34%, while those from the port ‘Cherbourg’ have a survival rate of 55%.

In [None]:
import plotly.express as px
man = df.iloc[((df["Survived"]==1)&(df["Sex"]=="male")).values]
woman = df.iloc[((df["Survived"]==1)&(df["Sex"]=="female")).values]

M = man.groupby(by="Embarked").count()
name = M.Survived.index
Mn = M.Survived.values

F = woman.groupby(by="Embarked").count()
Fn = F.Survived.values

fig = px.scatter_polar(r=Mn + Fn, theta=name,color=np.round(Mn/(Mn+Fn)*100,0), 
                       symbol=Fn, size=np.round(Fn/(Mn+Fn)*100,0),
                       color_discrete_sequence=px.colors.sequential.Plasma_r, title='Survived by Cabin')
fig.show()

+ Women have a survival rate of 65%, while men have a survival rate of about 35%.
+ Women on port Q and on port S have a higher chance of survival (90%). The inverse is true, if they are at port C. 
+ Men have a high survival probability if they are on port C, but a low probability if they are on port Q or S.

In [None]:
S = df.iloc[(df["Survived"]==1).values]
temp = S.groupby(by="Pclass").count()
name = temp.Survived.index
val = temp.Survived.values
fig = px.pie(temp, values=val, names=name, title='Survived by Pclass')
fig.show()

In [None]:
S = df.iloc[(df["Survived"]==1).values]
temp = S.groupby(by="Sex").count()
name = temp.Survived.index
val = temp.Survived.values
fig = px.pie(temp, values=val, names=name, title='Survived by Sex')
fig.show()

So among more than thousands of people in the ship, lets see what was the Age of these people through distribution plot.
+ You can see that men have a high probability of survival when they are between 18 and 30 years old, which is also a little bit true for women but not fully. 
+ For women the survival chances are higher between 14 and 40.
+ For men the probability of survival is very low between the age of 5 and 18, but that isn’t true for women. 
+ Another thing to note is that infants also have a little bit higher probability of survival.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
grid = sns.FacetGrid(df, col='Survived', row='Pclass', size=3.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()

The plot above confirms our assumption about pclass 1, but we can also spot a high probability that a person in pclass 3 will not survive.

# Adapting the Data and solving Inconsistencies
I’ll just resume the problems of these data and how we can solve it, the focus of this post is build a model who will works to our problem.

In [None]:
df.shape

In [None]:
df_test.shape

In [None]:
df_train = df.drop(columns="Survived")
df_train.head(2)

In [None]:
ntrain = df_train.shape[0]
ntest = df_test.shape[0]
y = df.Survived

data = pd.concat((df_train, df_test)).reset_index(drop=True) 
data.shape

In [None]:
data.head(2)

In [None]:
data.isnull().sum()

In [None]:
data.head(2)

Now, it's time to deal with our cabin column. As we saw above, Cabin column is the column contains many NaN values. As I mentioned this is not only a messy data but also not a huge big data with a lot of features, so fill in the gaps by leverage the data instead of losing observations. I will fill the missing values by character "U" for unknow in Cabin column.

The rest values contain a character and numbers follow behind that character, I decide to just extract the first letter of the Cabin column value to get a general information out of this big missing value column.

We will use the Name feature to extract the Titles from the Name, so that we can build a new feature out of that.

In [None]:
data['Title'] = data['Name'].apply(lambda x: x.split(",")[1].split(".")[0])

As a reminder, we have to deal with Cabin (687), Embarked (2) and Age (177). First I thought, we have to delete the ‘Cabin’ variable but then I found something interesting. A cabin number looks like ‘C123’ and the letter refers to the deck. Therefore we’re going to extract these and create a new feature, that contains a persons deck. Afterwords we will convert the feature into a numeric variable. 

In [None]:
data.columns

In [None]:
data['Cabin'] = data['Cabin'].fillna("C")

#Turning cabin number into Deck
data['Deck'] = data['Cabin'].str[:1]

data["Cabin"] = data["Cabin"].factorize()[0]

data['Age'] = data['Age'].fillna(data.Age.mean())
data['Embarked'] = data['Embarked'].fillna("S")
data['Fare'] = data['Fare'].fillna(-999)
data["Fare"] = (data["Fare"] - data["Fare"].min()) / (data["Fare"].max() - data["Fare"].min())

data["Avg_Fare"] = data["Fare"] / (1 + data["SibSp"] + data["Parch"])
data["relatives"] = data["SibSp"] + data["Parch"]

data["Sex"] = data["Sex"].factorize()[0]
data["Embarked"] = data["Embarked"].factorize()[0]
data["Name"] = data["Name"].factorize()[0]
data["Avg_Age"] = (data["Age"] - data["Age"].mean()) / data["Age"].std()
data["Ticket"] = data["Ticket"].factorize()[0]

In [None]:
data.head(2)

Considering theory that Name, PassengerId are not having any significant contribution for our data, we decide to drop them.

In [None]:
data = data.drop(columns = ["PassengerId"])

In [None]:
data.isnull().sum()

In [None]:
from scipy.stats import norm, skew

numeric_feats = data.dtypes[data.dtypes != 'object'].index
skewed_feats = data[numeric_feats].apply(lambda x: skew(x)).sort_values(ascending=False)
high_skew = skewed_feats[abs(skewed_feats) > 0.5]
high_skew

We converted these "high skew" columns to "normal" by taking logarith:

In [None]:
for feature in high_skew.index:
    data[feature] = np.log1p(np.abs(data[feature]))

Fill null values with random numbers, which are computed based on the mean age value in regards to the standard deviation.

In [None]:
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict

# build dictionary function
cols=np.array(data.columns[data.dtypes != object])
d = defaultdict(LabelEncoder)

# only for categorical columns apply dictionary by calling fit_transform 
trainL = data.apply(lambda x: d[x.name].fit_transform(x))
trainL[cols] = data[cols]

In [None]:
trainL.head(2)

In [None]:
#from sklearn.preprocessing import StandardScaler
# Capture all the numerical features so that we can scale them later
#numerical_features = list(trainL.select_dtypes(include=['int64', 'float64', 'int32']).columns)
#numerical_features

# Feature scaling - Standard scaler
#SS = StandardScaler()
#dataS = pd.DataFrame(data = trainL)
#dataS[numerical_features] = SS.fit_transform(dataS[numerical_features])

In [None]:
dataS = trainL
dataS.head(2)

In [None]:
dataS.shape

# ML Modelling

Now, we finished the worse part of your problem. It’s time to make Machine Learning, baby!

<img src="https://i.gifer.com/3Y0.gif" alt="vac" border="0"></a>

In [None]:
X = dataS[:ntrain].values
Z = dataS[ntrain:].values
y = df['Survived'].values

In [None]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from numpy import where
import collections

counter = collections.Counter(y)
print(counter)

smt = SMOTE(random_state=0)
X, y = smt.fit_sample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
counter = collections.Counter(y)
print(counter)
# scatter plot of examples by class label
for label, _ in counter.items():
    row_ix = where(y == label)[0]
    plt.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
    plt.legend()
    plt.show()

Now, we creat a function to calculate the deviations and to illustrate the results:

In [None]:
def Models(models, title, X_train, X_test, y_train, y_test, X, y):
    model = models
    model.fit(X_train,y_train)
    
    train_matrix = pd.crosstab(y_train, model.predict(X_train), rownames=['Actual'], colnames=['Predicted'])    
    test_matrix = pd.crosstab(y_test, model.predict(X_test), rownames=['Actual'], colnames=['Predicted'])
    matrix = pd.crosstab(y, model.predict(X), rownames=['Actual'], colnames=['Predicted'])
    
    f,(ax1,ax2,ax3) = plt.subplots(1,3,sharey=True, figsize=(15, 2))
    
    g1 = sns.heatmap(train_matrix, annot=True, fmt=".1f", cbar=False,annot_kws={"size": 18},ax=ax1)
    g1.set_title(title)
    g1.set_ylabel('Total = {}'.format(y_train.sum()), fontsize=14, rotation=90)
    g1.set_xlabel('Accuracy score (TrainSet): {}'.format(accuracy_score(model.predict(X_train), y_train)))
    g1.set_xticklabels(['Not Survived','Survived'],fontsize=12)

    g2 = sns.heatmap(test_matrix, annot=True, fmt=".1f",cbar=False,annot_kws={"size": 18},ax=ax2)
    g2.set_ylabel('Total = {}'.format(y_test.sum()), fontsize=14, rotation=90)
    g2.set_xlabel('Accuracy score (TestSet): {}'.format(accuracy_score(model.predict(X_test), y_test)))
    g2.set_xticklabels(['Not Survived','Survived'],fontsize=12)

    g3 = sns.heatmap(matrix, annot=True, fmt=".1f",cbar=False,annot_kws={"size": 18},ax=ax3)
    g3.set_ylabel('Total = {}'.format(y.sum()), fontsize=14, rotation=90)
    g3.set_xlabel('Accuracy score (Total): {}'.format(accuracy_score(model.predict(X), y)))
    g3.set_xticklabels(['Not Survived','Survived'],fontsize=12)
    plt.show()

And other function to show the precision and recovery curves of each model:

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
def ROCs(model):
    y_scores = model.predict_proba(X_train)
    y_scores = y_scores[:,1]
    precision, recall, threshold = precision_recall_curve(y_train, y_scores)
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, y_scores)

    f,(ax1,ax2,ax3) = plt.subplots(1,3,sharey=True, figsize=(25, 7))

    ax1.plot(threshold, precision[:-1], "r-", label="precision", linewidth=2)
    ax1.plot(threshold, recall[:-1], "b", label="recall", linewidth=2)
    ax1.legend(loc="upper right", fontsize=14)
    ax1.set_xlabel("threshold", fontsize=14)
    ax1.axis([0, 1., 0, 1.])

    ax2.plot(recall, precision, "g--", linewidth=2)
    ax2.set_ylabel("recall", fontsize=14, rotation=90)
    ax2.set_xlabel("precision", fontsize=14)
    ax2.axis([0, 1., 0, 1.])

    ax3.plot(false_positive_rate, true_positive_rate, linewidth=2, label=label)
    ax3.plot([0, 1], [0, 1], 'r', linewidth=4)
    ax3.axis([0, 1, 0, 1])
    ax3.set_xlabel('False Positive Rate (FPR)', fontsize=14)
    ax3.set_ylabel('True Positive Rate (TPR)', fontsize=14)

    plt.show()

We start first with the CatBoostClassifier model:

In [None]:
%%time
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score

params = {'loss_function':'Logloss','eval_metric':'AUC','verbose': False,
          'learning_rate': 0.05,'depth': 2,'l2_leaf_reg': 1,'n_estimators': 100}
CBC = CatBoostClassifier(**params)
Models(CBC, "CatBoostClassifier", X_train, X_test, y_train, y_test, X, y)

<img src="https://i.gifer.com/4j.gif" alt="vac" border="0"></a>

# Precision Recall Curve
For each person the model has to classify, it computes a probability based on a function and it classifies the person as survived (when the score is bigger the than threshold) or as not survived (when the score is smaller than the threshold). That’s why the threshold plays an important part. We will plot the precision and recall with the threshold:

In [None]:
ROCs(CBC)

+ Above you can clearly see that the recall is falling of rapidly at a precision of around 82%. Because of that you may want to select the precision/recall tradeoff before that.
+ You are now able to choose a threshold, that gives you the best precision/recall tradeoff for your current machine learning problem. If you want for example a precision of 80%, you can easily look at the plots and see that you would need a threshold of around 0.4. Then you could train a model with exactly that threshold and would get the desired accuracy.

Another way is to plot the precision and recall against each other:

# ROC AUC Curve
Another way to evaluate and compare your binary classifier is provided by the ROC AUC Curve. This curve plots the true positive rate (also called recall) against the false positive rate (ratio of incorrectly classified negative instances), instead of plotting the precision versus the recall.

+ The red line in the middel represents a purely random classifier (e.g a coin flip) and therefore your classifier should be as far away from it as possible. Our  model seems to do a good job.
+ We also have a tradeoff here, because the classifier produces more false positives, the higher the true positive rate is.

# ROC AUC Score
+ The ROC AUC Score is the corresponding score to the ROC AUC Curve. It is simply computed by measuring the area under the curve, which is called AUC.
+ A classifiers that is 100% correct, would have a ROC AUC Score of 1 and a completely random classiffier would have a score of 0.5.

# Gradient Boost

In [None]:
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import fbeta_score, make_scorer

param_test1 = {
    'n_estimators': [120, 130,140],
    'max_depth': [1, 2, 3],
    'subsample':[0.3, 0.5, 0.7, 1],
    'learning_rate': [0.006, 0.008, 0.01],
    'max_features': [0.3, 0.5, 0.7, 1]
}

scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}

gsearch1 = GridSearchCV(estimator = GradientBoostingClassifier(), 
                       param_grid = param_test1, scoring=scoring, iid=False, cv=3, verbose = 5, refit='Accuracy')
gsearch1.fit(X_train, y_train)
gsearch1.best_params_, gsearch1.best_score_

In [None]:
GBC = GradientBoostingClassifier(learning_rate=0.008, n_estimators=130,max_depth= 3,subsample=.7, max_features=.3)
Models(GBC, "GradientBoostingClassifier", X_train, X_test, y_train, y_test, X, y)

In [None]:
ROCs(GBC)

# XGBoost

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import fbeta_score, make_scorer

param_test1 = {'n_estimators': [80,90,100],'max_depth': [1, 2],
               'min_child_weight': [1,2],'subsample':[1],
               'colsample_bytree':[1],'reg_alpha':[0],
               'learning_rate': [0.013, 0.015, 0.017]}
scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}

gsearch1 = GridSearchCV(estimator = XGBClassifier(), 
                       param_grid = param_test1, 
                       scoring=scoring, iid=False, cv=3, verbose = 5, refit='Accuracy')
gsearch1.fit(X_train, y_train)
gsearch1.best_params_, gsearch1.best_score_

In [None]:
XGBC = XGBClassifier(learning_rate=0.017, n_estimators=90, max_depth= 2, min_child_weight= 1, colsample_bytree= 1.,reg_alpha= 0,subsample= 1)
Models(XGBC, "GradientBoostingClassifier", X_train, X_test, y_train, y_test, X, y)

In [None]:
ROCs(XGBC)

In [None]:
%%time
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score

params = {'loss_function':'Logloss','eval_metric':'AUC','verbose': False,
          'learning_rate': 0.025,'depth': 2,'l2_leaf_reg': 1,'n_estimators': 110}
CBC = CatBoostClassifier(**params)
Models(CBC, "CatBoostClassifier", X_train, X_test, y_train, y_test, X, y)

In [None]:
CBC.fit(X_train, y_train)
result = CBC.predict(Z)
sub = pd.DataFrame()
sub = pd.DataFrame({'PassengerId':df_test.PassengerId,'Survived':result}) 
sub.to_csv('my_submission.csv', index=False)
sub.head(2)

Nice ! We think that score is good enough to submit the predictions for the test-set to the Kaggle leaderboard.

# Conclusion:
Through out the data visualization with the help of the dataset, we could se that some groups of people were more likely to survive such as women, children and upper class. CatBoostClassifier, XGBoost, GradientBoostingClassifier yielded similar result in this case on the test set.