# Titanic machine learning analysis

### Goal

The RMS Titanic was the world's largest passenger ship when it entered service. Its terrible story is one of the best known disaster in the world. In our files we have the dataset which describes people taking a part in that voyage, we also know eather if they survived or not.
Our goal is to check our model paying attention to answering the quistion: “what sorts of people were more likely to survive?” We will try to predict who would survive that disaster. At the end we will compare results our model to the built-in one of logistic regression.

### Loading necessary modules for our task

In [1]:
# import necessary modules for data analysis and data visualization
# data analysis modules

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

import warnings                                                            # importing warnings library
warnings.filterwarnings('ignore')                                          # ignore warning

from sklearn.preprocessing import StandardScaler                           # import standard scaler for transformation
from sklearn.ensemble import RandomForestRegressor                         # import random dorest regressor
from sklearn.model_selection import train_test_split                       # import train test split
from sklearn.metrics import mean_absolute_error, classification_report     # import needed functions drom metrics

# Models
from core.models import LogisticRegression                                 # import our implementation of logistic regression
from sklearn.linear_model import LogisticRegression as LR                  # import built-in logistic regression model 

### Basic data description

In [2]:
# loading data to data frame
data = pd.read_csv("./datasets/titanic/titanic.csv")

| Variable Name | Description                       | Type    |
|---------------|-----------------------------------|---------|
| PassengerId   | Passenger's ID                    | int64   |
| Survived      | Survived (1) or died (0)          | int64   |
| Pclass        | Passenger’s class                 | int64   |
| Name          | Passenger’s name                  | object  |
| Sex           | Passenger’s sex                   | object  |
| Age           | Passenger’s age                   | float64 |
| SibSp         | Number of siblings/spouses aboard | int64   |
| Parch         | Number of parents/children aboard | int64   |
| Ticket        | Ticket number                     | object  |
| Fare          | Fare                              | float64 |
| Cabin         | Cabin                             | object  |
| Embarked      | Port of embarkation               | object  |

### Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

In [3]:
# amount of our data samples
print('dataset shape: \t', data.shape)

dataset shape: 	 (891, 12)


In [4]:
# 5 samples of the dataset
data.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
182,183,0,3,"Asplund, Master. Clarence Gustaf Hugo",male,9.0,4,2,347077,31.3875,,S
237,238,1,2,"Collyer, Miss. Marjorie ""Lottie""",female,8.0,0,2,C.A. 31921,26.25,,S
283,284,1,3,"Dorking, Mr. Edward Arthur",male,19.0,0,0,A/5. 10482,8.05,,S
312,313,0,2,"Lahtinen, Mrs. William (Anna Sylfven)",female,26.0,1,1,250651,26.0,,S
632,633,1,1,"Stahelin-Maeglin, Dr. Max",male,32.0,0,0,13214,30.5,B50,C


### Data preprocessing

In [5]:
# replacing the null values in the Embarked column with the mode
data.Embarked.fillna("C", inplace=True)

In [6]:
# estimating cabin name for NaN values

data.Cabin.fillna("N", inplace=True)
data.Cabin = [i[0] for i in data.Cabin]

with_N = data[data.Cabin == "N"]
without_N = data[data.Cabin != "N"]                   
data.groupby("Cabin")['Fare'].mean().sort_values()    

def cabin_estimator(i):
    a = 0
    if i<16:
        a = "G"
    elif i>=16 and i<27:
        a = "F"
    elif i>=27 and i<38:
        a = "T"
    elif i>=38 and i<47:
        a = "A"
    elif i>= 47 and i<53:
        a = "E"
    elif i>= 53 and i<54:
        a = "D"
    elif i>=54 and i<116:
        a = 'C'
    else:
        a = "B"
    return a

with_N['Cabin'] = with_N.Fare.apply(lambda x: cabin_estimator(x))
data = pd.concat([with_N, without_N], axis=0)

In [7]:
# replace the test.fare null values with test.fare mean and remove outliers

missing_value = data[(data.Pclass == 3) & (data.Embarked == "S") & (data.Sex == "male")].Fare.mean()
data.Fare.fillna(missing_value, inplace=True)
data = data[data.Fare < 500]

In [8]:
# sex column mapping
# 0 for female and 1 for male

data['Sex'] = data.Sex.map({'male': 1, 'female': 0})

In [9]:
# extracting title of name column 

data["title"] = [i.split('.')[0] for i in data.Name]
data["title"] = [i.split(',')[1] for i in data.title]


# we have to map some special names 
data["title"] = [i.replace('Ms', 'Miss') for i in data.title]
data["title"] = [i.replace('Mlle', 'Miss') for i in data.title]
data["title"] = [i.replace('Mme', 'Mrs') for i in data.title]
data["title"] = [i.replace('Dr', 'rare') for i in data.title]
data["title"] = [i.replace('Col', 'rare') for i in data.title]
data["title"] = [i.replace('Major', 'rare') for i in data.title]
data["title"] = [i.replace('Don', 'rare') for i in data.title]
data["title"] = [i.replace('Jonkheer', 'rare') for i in data.title]
data["title"] = [i.replace('Sir', 'rare') for i in data.title]
data["title"] = [i.replace('Lady', 'rare') for i in data.title]
data["title"] = [i.replace('Capt', 'rare') for i in data.title]
data["title"] = [i.replace('the Countess', 'rare') for i in data.title]
data["title"] = [i.replace('Rev', 'rare') for i in data.title]

In [10]:
# checking who is alone on the board

data['family_size'] = data.SibSp + data.Parch+1

def family_group(size):
    a = ''
    if (size <= 1):
        a = 'loner'
    elif (size <= 4):
        a = 'small'
    else:
        a = 'large'
    return a

data['family_group'] = data['family_size'].map(family_group)
data['is_alone'] = [1 if i<2 else 0 for i in data.family_size]

In [11]:
# deleting info about ticket, it's just redundant
data.drop(['Ticket'], axis=1, inplace=True)

In [12]:
# calculating fare based on family size

data['calculated_fare'] = data.Fare/data.family_size

def fare_group(fare):
    a= ''
    if fare <= 4:
        a = 'Very_low'
    elif fare <= 10:
        a = 'low'
    elif fare <= 20:
        a = 'mid'
    elif fare <= 45:
        a = 'high'
    else:
        a = "very_high"
    return a

# creating fare group column
data['fare_group'] = data['calculated_fare'].map(fare_group)

In [13]:
# passenger ID is redundant information, it won't help us in prediction
data.drop(['PassengerId'], axis=1, inplace=True)

In [14]:
# get dummies will help our model to make use of our categorical data
data = pd.get_dummies(data, columns=['title','Pclass', 'Cabin','Embarked', 'family_group', 'fare_group'], drop_first=False)

# now we can drop redundant columns
data.drop(['family_size','Name', 'Fare'], axis=1, inplace=True)

In [15]:
# rearranging the columns so that we can easily use the dataframe to predict the missing age val
data = pd.concat([data[["Survived", "Age", "Sex","SibSp","Parch"]], data.loc[:,"is_alone":]], axis=1)

def predict_age(df):
    age_df = df.loc[:,"Age":]                            # gettting all the features except survived
    
    temp_train = age_df.loc[age_df.Age.notnull()]        # df with age values
    temp_test = age_df.loc[age_df.Age.isnull()]          # df without age values
    
    y = temp_train.Age.values                            # setting target variables(age) in y 
    x = temp_train.loc[:, "Sex":].values
    
    rfr = RandomForestRegressor(n_estimators=1500, n_jobs=-1)
    rfr.fit(x, y)
    
    predicted_age = rfr.predict(temp_test.loc[:, "Sex":])
    
    df.loc[df.Age.isnull(), "Age"] = predicted_age
    return df

# age prediction for null cells
data = predict_age(data)

In [16]:
def age_group(age):
    a = ''
    if age <= 1:
        a = 'infant'
    elif age <= 4: 
        a = 'toddler'
    elif age <= 13:
        a = 'child'
    elif age <= 18:
        a = 'teenager'
    elif age <= 35:
        a = 'Young_Adult'
    elif age <= 45:
        a = 'adult'
    elif age <= 55:
        a = 'middle_aged'
    elif age <= 65:
        a = 'senior_citizen'
    else:
        a = 'old'
    return a
        
# applying "age_group_fun" function to the "Age" column
data['age_group'] = data['Age'].map(age_group)

# creating dummies for "age_group" feature
data = pd.get_dummies(data,columns=['age_group'], drop_first=True)

In [17]:
# 10 samples of our dataset
data.sample(10)

Unnamed: 0,Survived,Age,Sex,SibSp,Parch,is_alone,calculated_fare,title_ Master,title_ Miss,title_ Mr,...,fare_group_mid,fare_group_very_high,age_group_adult,age_group_child,age_group_infant,age_group_middle_aged,age_group_old,age_group_senior_citizen,age_group_teenager,age_group_toddler
250,0,30.742502,1,0,0,1,7.25,0,0,1,...,0,0,0,0,0,0,0,0,0,0
132,0,47.0,0,1,0,0,7.25,0,0,0,...,0,0,0,0,0,1,0,0,0,0
571,1,53.0,0,2,0,0,17.159733,0,0,0,...,1,0,0,0,0,1,0,0,0,0
729,0,25.0,0,1,0,0,3.9625,0,1,0,...,0,0,0,0,0,0,0,0,0,0
796,1,49.0,0,0,0,1,25.9292,0,0,0,...,0,0,0,0,0,1,0,0,0,0
707,1,42.0,1,0,0,1,26.2875,0,0,1,...,0,0,1,0,0,0,0,0,0,0
674,0,32.544197,1,0,0,1,0.0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
494,0,21.0,1,0,0,1,8.05,0,0,1,...,0,0,0,0,0,0,0,0,0,0
783,0,23.375333,1,1,2,0,5.8625,0,0,1,...,0,0,0,0,0,0,0,0,0,0
363,0,35.0,1,0,0,1,7.05,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# creating our final prediction datasets
X = data.drop(['Survived'], axis = 1)
y = data['Survived']

# train-test split with test size = 0.2 and random state = 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 0)

## Comparison - Our model vs built-in model

### Creating models

In [19]:
# creating logistic regression model of our implementation
our_model = LogisticRegression()

# fit the model with "train_x" and "train_y"
our_model.fit(X_train,y_train)

# once the model is trained we want to find out how well the model is performing, so we test the model
# we use "test_x" portion of the data (this data was not used to fit the model) to predict model outcome
y_pred = our_model.predict(X_test)

# once predicted we save that outcome in "y_pred" variable.
# then we compare the predicted value ("y_pred") and actual value ("test_y") to see how well our model is performing
print ("Our model accuracy Score is: \n", classification_report(y_test, y_pred), '\n')

# when we get the result of our model we can compare it to built-in version of logistic regression
# we have to create and train the model on the same data, then predict the outcome and see the result
bi_model = LR()
bi_model.fit(X_train, y_train)
y_pred1 = bi_model.predict(X_test)
print ("Built in model accuracy Score is: \n", classification_report(y_test, y_pred1))

Our model accuracy Score is: 
               precision    recall  f1-score   support

           0       0.89      0.78      0.83       115
           1       0.68      0.83      0.74        63

    accuracy                           0.80       178
   macro avg       0.78      0.80      0.79       178
weighted avg       0.81      0.80      0.80       178
 

Built in model accuracy Score is: 
               precision    recall  f1-score   support

           0       0.86      0.85      0.86       115
           1       0.73      0.75      0.74        63

    accuracy                           0.81       178
   macro avg       0.80      0.80      0.80       178
weighted avg       0.82      0.81      0.81       178



## Summary

Both models were trained and predicted on the same datasets. As we can see on the Classification Report there is a small difference between our model and built-in model. The results were very similar in favor of the built-in model.

Taking into account that our model implementation was much more basic we can consider the results as satisfactory.