## Titanic - Data Visualization and Predictions (Top 10% solution)

The purpose of this notebook was to create a simple solution to the Titanic problem and visualize its data and also a way to show how to structure the machine learning solution development process.

If you find it helpful, please upvote! :)

## Imports

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import KFold,StratifiedKFold, cross_val_score,GridSearchCV
import warnings
warnings.filterwarnings('ignore')
sns.set_style('white')

## Read data

In [None]:
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
train.head()

In [None]:
train.info()

In [None]:
test.info()

In [None]:
colors = ['#D92938','#A60522','#273859','#071526']

## Missing values

In [None]:
print("Train: ")
for col in train.columns:
    if train[col].isnull().values.any():
        print(col, train[col].isnull().sum())
      
print("\nTest:")
for col in test.columns:
    if test[col].isnull().values.any():
        print(col, test[col].isnull().sum())

In [None]:
#filling na values
train['Cabin'] = train['Cabin'].fillna('U')
test['Cabin'] = test['Cabin'].fillna('U')

train['Age'] = train['Age'].fillna( np.mean(train['Age']) )
test['Age'] = test['Age'].fillna( np.mean(test['Age']) )

train['Fare'] = train['Fare'].fillna( np.mean(train['Fare']) )
test['Fare'] = test['Fare'].fillna( np.mean(test['Fare']) )

## EDA and some Feature Engineering

Looking at the chart below, we can notice that we got a higher percentage of males on Titanic!

In [None]:
plt.figure(figsize=(10,5))

p = [train['Sex'].value_counts()[0]/sum(train['Sex'].value_counts())*100,
     train['Sex'].value_counts()[1]/sum(train['Sex'].value_counts())*100]

g=sns.barplot(x=p,y=['Male','Female'], palette=colors,orient = 'h')

g.text(0, -0.7, 'Percentage of males and females on Titanic', 
       fontsize=14, fontweight='bold', fontfamily='Serif',color='black')

for i in ['top', 'left', 'right', 'bottom']:
    g.spines[i].set_visible(False)
    
for i in range(2):
    g.annotate(f'{round(p[i])}%', 
                xy=(p[i]/2, i),
                ha = 'center', va='center',fontsize=50, fontweight='bold', 
                fontfamily='Serif', color='white')
    g.annotate('Male' if i==0 else 'Female', 
                xy=(p[i]/2, i+0.25),
                ha = 'center', va='center',fontsize=16, fontweight='bold', 
                fontfamily='Serif', color='white')
    
g.set(xticklabels=[],yticklabels=[])
plt.ylabel('')
plt.xlabel('')

In [None]:
#boolean column for sex
train["bool_sex"] = (train["Sex"] == "male").astype(int)
test["bool_sex"] = (test["Sex"] == "male").astype(int)

Unfortunately we got a higher percentage of people who not Survived!

In [None]:
plt.figure(figsize=(10,5))

p = [train['Survived'].value_counts()[0]/sum(train['Survived'].value_counts())*100,
     train['Survived'].value_counts()[1]/sum(train['Survived'].value_counts())*100]

g = sns.barplot(x=p, y=['Survived','Not Survived'],palette=colors)

g.text(0, -0.7, 'Percentage of survivors and victims on Titanic', 
       fontsize=14, fontweight='bold', fontfamily='Serif',color='black')

for i in ['top', 'left', 'right', 'bottom']:
    g.spines[i].set_visible(False)
    
for i in range(2):
    g.annotate(f'{round(p[i])}%', 
                xy=(p[i]/2, i),
                ha = 'center', va='center',fontsize=50, fontweight='bold', 
                fontfamily='Serif', color='white')
    g.annotate('Survived' if i==0 else 'Not survived', 
                xy=(p[i]/2, i+0.25),
                ha = 'center', va='center',fontsize=16, fontweight='bold', 
                fontfamily='Serif', color='white')
    
g.set(xticklabels=[],yticklabels=[])
plt.ylabel('')
plt.xlabel('')

Clearly the sex of the passengers was crucial. The percentage of women surviving is much higher than that of men surviving, as we can see!

In [None]:
fig, ax = plt.subplots(1,1, figsize=(10,5))

d = train['Sex'].value_counts()
d1 = train[ train['Sex'] == 'male' ]['Survived'].value_counts()
d2 = train[ train['Sex'] == 'female' ]['Survived'].value_counts()

g = plt.bar(d.index,d,label='Survived',color=colors[3])
g1 = plt.bar(d.index, [d1[0],d2[0]],label='Not Survived',color=colors[1])

ax.text(0, 700, 'Amount of survivors and victims by Sex ', 
       fontsize=14, fontweight='bold', fontfamily='Serif',color='black')

for i in ['top', 'left', 'right']:
    ax.spines[i].set_visible(False)
    
for i in g1.patches:
    ax.text(i.get_x()+i.get_width()/2,
           i.get_height()/2.5,
           round(i.get_height()),
           fontsize='18',
           fontfamily='Serif',
           color='white',va='center',ha='center')
c=0
for i in g.patches:
    ax.text(i.get_x()+i.get_width()/2,
           g1.patches[c].get_height()+(i.get_height()-g1.patches[c].get_height())/2.5,
           round(i.get_height()-g1.patches[c].get_height()),
           fontsize='18',
           fontfamily='Serif',
           color='white',va='center',ha='center')
    c+=1
    
plt.legend(loc='upper right',prop={'size': 12, 'family': 'Serif'})
ax.set(xticklabels=['Male','Female'],yticklabels=[])
plt.xlabel('')
plt.ylabel('')

Looking at the next chart it's possible to notice that most passengers are adults up to 50 years. It sounds like a good idea stratify the ages in groups to get better visualizations and maybe help in our model.

In [None]:
fig, ax = plt.subplots(1, figsize=(10,5))

g=plt.hist(train['Age'],color=colors[3])

ax.text(0, 430, 'Histogram of passengers age on Titanic', 
       fontsize=14, fontweight='bold', fontfamily='Serif',color='black')

for i in ax.patches:
    ax.text(i.get_x()+i.get_width()/2,
           i.get_height()+6,
           round(i.get_height()),
           fontsize='13',
           fontfamily='Serif',va='center',ha='center')
    
for i in ['top', 'right','left']:
    ax.spines[i].set_visible(False)

ax.set(yticklabels=[])
plt.xlabel('')
plt.ylabel('')

In [None]:
train['Child'] = (train['Age'] <= 12).astype(int)
test['Child'] = (test['Age'] <= 12).astype(int)

train['Young'] = train['Age'].apply(lambda x: x> 12 and x<=18).astype(int)
test['Young'] = test['Age'].apply(lambda x: x> 12 and x<=18).astype(int)

train['Adult'] = train['Age'].apply(lambda x: x> 18 and x<=55).astype(int)
test['Adult'] = test['Age'].apply(lambda x: x> 18 and x<=55).astype(int)

train['Old'] = train['Age'].apply(lambda x: x>55).astype(int)
test['Old'] = test['Age'].apply(lambda x: x>55).astype(int)

Looking again at the ages of our passengers, we can clearly see that we have a larger number of adults.



In [None]:
plt.figure(figsize=(10,5))

g= sns.barplot(x=['Child','Young','Adult','Old'],
               y=[sum(train['Child']),sum(train['Young']),
               sum(train['Adult']),sum(train['Old'])],
               palette=colors)

g.text(0, 900,'Amount of passengers in each age group', 
       fontsize=14, fontweight='bold', fontfamily='Serif',color='black')

for i in ['top', 'left', 'right']:
    g.spines[i].set_visible(False)
    
for i in g.patches:
    g.text(i.get_x()+i.get_width()/2,
           i.get_height()+25,
           round(i.get_height()),
           fontsize='18',
           fontfamily='Serif',va='center',ha='center')
    
g.set(yticklabels=[])
plt.ylabel('')
plt.xlabel('')

In the fare histogram we see that the absolute majority of tickets cost less than $ 100. However, we have some tickets that have exceeded that amount.



In [None]:
fig, ax = plt.subplots(1, figsize=(10,5))

g=plt.hist(train['Fare'], color=colors[3])

ax.text(0, 850, 'Histogram of passengers fare on Titanic', 
       fontsize=14, fontweight='bold', fontfamily='Serif',color='black')

for i in ax.patches:
    ax.text(i.get_x()+i.get_width()/2,
           i.get_height()+16,
           round(i.get_height()),
           fontsize='13',
           fontfamily='Serif',va='center',ha='center')
    
for i in ['top', 'right','left']:
    ax.spines[i].set_visible(False)

ax.set(yticklabels=[])
plt.xlabel('')
plt.ylabel('')

Categorizing Fare column in >100 or not.

In [None]:
train['cat_fare'] = train['Fare'].apply(lambda x: x>100)
test['cat_fare'] = test['Fare'].apply(lambda x: x>100)

The data also tell us which port a particular passenger embarked on. In a chart that relates this information to the survived information, we realized that the only port in which the majority survived was port "C".

In [None]:
plt.figure(figsize=(10,5))

g=sns.countplot(train['Embarked'],hue=train['Survived'],palette=colors[1:])

g.text(0, 550,'Amount of survivors and victims in each port', 
       fontsize=14, fontweight='bold', fontfamily='Serif',color='black')

for i in ['top', 'left', 'right']:
    g.spines[i].set_visible(False)
    
for i in g.patches:
    g.text(i.get_x()+i.get_width()/2,
           i.get_height()+16,
           round(i.get_height()),
           fontsize='18',
           fontfamily='Serif',va='center',ha='center')
    
g.set(yticklabels=[])
plt.legend(['Not Survived','Survived'],loc='upper right',prop={'size': 15, 'family': 'Serif'})
plt.ylabel('')
plt.xlabel('')

New feature to use the embarked column information.

In [None]:
def emb_feature(x):
    if x=='S':
        return 0
    elif x=='C':
        return 1
    else:
        return 2
train['emb'] = train['Embarked'].apply(emb_feature)
test['emb'] = test['Embarked'].apply(emb_feature)


Looking at the Pclass column, most of the passengers in the first class survived, as we could imagine. The vast majority of passengers in the third class did not survive.



In [None]:
plt.figure(figsize=(10,5))

g=sns.countplot(train['Pclass'],hue=train['Survived'],palette=colors[1:])


g.text(0, 500,'Amount of survivors and victims in each class', 
       fontsize=14, fontweight='bold', fontfamily='Serif',color='black')

for i in ['top', 'left', 'right']:
    g.spines[i].set_visible(False)
    
for i in g.patches:
    g.text(i.get_x()+i.get_width()/2,
           i.get_height()+16,
           round(i.get_height()),
           fontsize='18',
           fontfamily='Serif',va='center',ha='center')
    
plt.legend(['Not Survived','Survived'],loc='upper left',prop={'size': 15, 'family': 'Serif'})
g.set(yticklabels=[])
plt.ylabel('')
plt.xlabel('')

Knowing "sibsp" column counts the number of siblings/spouses aboard the Titanic and "parch" counts the number of parents/children aboard the Titanic I will create a new column to define whether the passenger is boarding alone or not.

In [None]:
train['is_alone'] = (train['SibSp']>0) | (train['Parch']>0)
test['is_alone'] = (test['SibSp']>0) | (test['Parch']>0)

train['fml_size'] = train['SibSp'] + train['Parch'] 
test['fml_size'] = test['SibSp'] + test['Parch'] 

Now, we can see that our graph shows the information that passengers who were not alone died more than those who were.

In [None]:
fig, ax = plt.subplots(1,1, figsize=(10,5))

d = train['is_alone'].value_counts()
d1 = train[ train['is_alone'] == 0 ]['Survived'].value_counts()
d2 = train[ train['is_alone'] == 1 ]['Survived'].value_counts()

d.index = ['Not alone', 'Alone']
g = plt.bar(d.index,d,label='Survived',color=colors[3])
g1 = plt.bar(d.index, [d1[0],d2[0]],label='Not Survived',color=colors[1])

ax.text(-0.35, 700, 'Amount of survivors and victims for passengers alone and not alone', 
       fontsize=14, fontweight='bold', fontfamily='Serif',color='black')

for i in ['top', 'left', 'right']:
    ax.spines[i].set_visible(False)
    
for i in g1.patches:
    ax.text(i.get_x()+i.get_width()/2,
           i.get_height()/2.5,
           round(i.get_height()),
           fontsize='18',
           fontfamily='Serif',
           color='white',va='center',ha='center')
c=0
for i in g.patches:
    ax.text(i.get_x()+i.get_width()/2,
           g1.patches[c].get_height()+(i.get_height()-g1.patches[c].get_height())/2.5,
           round(i.get_height()-g1.patches[c].get_height()),
           fontsize='18',
           fontfamily='Serif',
           color='white',va='center',ha='center')
    c+=1
    
plt.legend(loc='upper right',prop={'size': 12, 'family': 'Serif'})
ax.set(yticklabels=[])
plt.xlabel('')
plt.ylabel('')

## Validation


In [None]:
cols = ['Pclass','bool_sex','Child','Young','Adult','Old',
        'fml_size','is_alone','cat_fare','emb','SibSp','Parch']
SEED = 30
X = train[cols]
y = train['Survived']

cv = StratifiedKFold(n_splits=5, shuffle=True)

models = [ LogisticRegression(), LinearSVC(), RandomForestClassifier()]
m_name = ['Logistic Regression', 'Linear SVC', 'Random Forest']

i=0
for item in models:
    np.random.seed(SEED)
    results = cross_val_score(item, X, y, cv = cv,scoring = 'accuracy')
    mean = results.mean()
    dv = results.std()
    print('Accuracy - {}: {:.2f}%'.format(m_name[i], mean*100))
    print('Expected accuracy - Model {}: [{:.2f}% ~ {:.2f}%]\n'.format(m_name[i],(mean - 2*dv)*100, (mean + 2*dv)*100))
    i += 1

Random Forest is the best model in our validation. Now, we will do a hyperparameter tuning for the final model. 

This block of code is commented because this cell may take a little while to run (~30 minutes).


 



In [None]:
# param_grid = { 
#     "criterion" : ["gini", "entropy"], 
#     "min_samples_leaf" : [1, 5, 10, 25, 50, 70], 
#     "min_samples_split" : [2, 4, 10, 12, 16, 18, 25, 35], 
#     "n_estimators": [100, 400, 700, 1000, 1500]
# }
# CV_rfc = GridSearchCV(estimator=RandomForestClassifier(), 
#                       param_grid=param_grid, 
#                       cv= 5)
# CV_rfc.fit(X, y)

# CV_rfc.best_params_

## Model

In [None]:
cols = ['Pclass','bool_sex','Child','Young','Adult','Old',
        'fml_size','is_alone','cat_fare','emb','SibSp','Parch']
X = train[cols]
y = train['Survived']

Final model with tuned Random Forest Classifier.

In [None]:
model = RandomForestClassifier(
     min_samples_leaf= 1,
     min_samples_split= 4,
     criterion= 'gini',
     n_estimators= 100
)
model.fit(X,y)
p = model.predict(test[cols])

Generating submission.

In [None]:
sub = pd.Series(p,index=test['PassengerId'],name='Survived')
sub.to_csv('submission.csv',header=True)