# Titanic machine learning analysis

### Goal

The RMS Titanic was the world's largest passenger ship when it entered service. Its terrible story is one of the best known disaster in the world. In our files we have the dataset which describes people taking a part in that voyage, we also know eather if they survived or not.
Our goal is to check our model paying attention to answering the question: “what sorts of people were more likely to survive?”. First we need to preprocess the raw data, decide what sort of data will be useful, maybe we will need to create some data using those we have.  We will try to predict who would survive that disaster. At the end we will compare results our model to the built-in one of logistic regression.

### Loading necessary modules for our task

In [1]:
# import necessary modules for data analysis and data visualization
# data analysis modules

import pandas as pd
import numpy as np
from time import clock
from matplotlib import pyplot as plt
import seaborn as sns

import warnings                                                            # importing warnings library
warnings.filterwarnings('ignore')                                          # ignore warning

from sklearn.preprocessing import StandardScaler                           # import standard scaler for transformation
from sklearn.ensemble import RandomForestRegressor                         # import random dorest regressor
from sklearn.model_selection import train_test_split                       # import train test split
from sklearn.metrics import mean_absolute_error, classification_report     # import needed functions drom metrics

# Models
from core.models import LogisticRegression                                 # import our implementation of logistic regression
from sklearn.linear_model import LogisticRegression as LR                  # import built-in logistic regression model 

### Basic data description

loading data to data frame

In [2]:
data = pd.read_csv("./datasets/titanic/titanic.csv")

| Variable Name | Description                       | Type    |
|---------------|-----------------------------------|---------|
| PassengerId   | Passenger's ID                    | int64   |
| Survived      | Survived (1) or died (0)          | int64   |
| Pclass        | Passenger’s class                 | int64   |
| Name          | Passenger’s name                  | object  |
| Sex           | Passenger’s sex                   | object  |
| Age           | Passenger’s age                   | float64 |
| SibSp         | Number of siblings/spouses aboard | int64   |
| Parch         | Number of parents/children aboard | int64   |
| Ticket        | Ticket number                     | object  |
| Fare          | Fare                              | float64 |
| Cabin         | Cabin                             | object  |
| Embarked      | Port of embarkation               | object  |

### Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

Amount of our data samples:

In [3]:
print('dataset shape: \t', data.shape)

dataset shape: 	 (891, 12)


10 samples of the dataset, short meaning of particular columns was described above:

In [4]:
data.sample(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
497,498,0,3,"Shellard, Mr. Frederick William",male,,0,0,C.A. 6212,15.1,,S
810,811,0,3,"Alexander, Mr. William",male,26.0,0,0,3474,7.8875,,S
529,530,0,2,"Hocking, Mr. Richard George",male,23.0,2,1,29104,11.5,,S
141,142,1,3,"Nysten, Miss. Anna Sofia",female,22.0,0,0,347081,7.75,,S
429,430,1,3,"Pickard, Mr. Berk (Berk Trembisky)",male,32.0,0,0,SOTON/O.Q. 392078,8.05,E10,S
400,401,1,3,"Niskanen, Mr. Juha",male,39.0,0,0,STON/O 2. 3101289,7.925,,S
475,476,0,1,"Clifford, Mr. George Quincy",male,,0,0,110465,52.0,A14,S
337,338,1,1,"Burns, Miss. Elizabeth Margaret",female,41.0,0,0,16966,134.5,E40,C
573,574,1,3,"Kelly, Miss. Mary",female,,0,0,14312,7.75,,Q
847,848,0,3,"Markoff, Mr. Marin",male,35.0,0,0,349213,7.8958,,C


### Data preprocessing

Replacing the null values in the Embarked column with the mode

In [5]:
data.Embarked.fillna("C", inplace=True)

Estimating cabin name for NaN, we do it using our fare column:

In [6]:
data.Cabin.fillna("N", inplace=True)
data.Cabin = [i[0] for i in data.Cabin]

with_N = data[data.Cabin == "N"]
without_N = data[data.Cabin != "N"]                   
data.groupby("Cabin")['Fare'].mean().sort_values()    

def cabin_estimator(i):
    a = 0
    if i<16:
        a = "G"
    elif i>=16 and i<27:
        a = "F"
    elif i>=27 and i<38:
        a = "T"
    elif i>=38 and i<47:
        a = "A"
    elif i>= 47 and i<53:
        a = "E"
    elif i>= 53 and i<54:
        a = "D"
    elif i>=54 and i<116:
        a = 'C'
    else:
        a = "B"
    return a

with_N['Cabin'] = with_N.Fare.apply(lambda x: cabin_estimator(x))
data = pd.concat([with_N, without_N], axis=0)

Replace the fare null values with fare mean and remove outliers with earlier checked value:

In [7]:
missing_value = data[(data.Pclass == 3) & (data.Embarked == "S") & (data.Sex == "male")].Fare.mean()
data.Fare.fillna(missing_value, inplace=True)
data = data[data.Fare < 500]

Sex column mapping. We use 0 for female and 1 for male so that our model could learn on that data:

In [8]:
data['Sex'] = data.Sex.map({'male': 1, 'female': 0})

Here we do some operations on names. We will make some classes of names which we will use later for learning.

In [9]:
# extracting title of name column 

data["title"] = [i.split('.')[0] for i in data.Name]
data["title"] = [i.split(',')[1] for i in data.title]


# we have to map some special names 
data["title"] = [i.replace('Ms', 'Miss') for i in data.title]
data["title"] = [i.replace('Mlle', 'Miss') for i in data.title]
data["title"] = [i.replace('Mme', 'Mrs') for i in data.title]
data["title"] = [i.replace('Dr', 'rare') for i in data.title]
data["title"] = [i.replace('Col', 'rare') for i in data.title]
data["title"] = [i.replace('Major', 'rare') for i in data.title]
data["title"] = [i.replace('Don', 'rare') for i in data.title]
data["title"] = [i.replace('Jonkheer', 'rare') for i in data.title]
data["title"] = [i.replace('Sir', 'rare') for i in data.title]
data["title"] = [i.replace('Lady', 'rare') for i in data.title]
data["title"] = [i.replace('Capt', 'rare') for i in data.title]
data["title"] = [i.replace('the Countess', 'rare') for i in data.title]
data["title"] = [i.replace('Rev', 'rare') for i in data.title]

Checking who is alone on the board. We think it may have an effect on results.

In [10]:
data['family_size'] = data.SibSp + data.Parch+1

def family_group(size):
    a = ''
    if (size <= 1):
        a = 'loner'
    elif (size <= 4):
        a = 'small'
    else:
        a = 'large'
    return a

data['family_group'] = data['family_size'].map(family_group)
data['is_alone'] = [1 if i<2 else 0 for i in data.family_size]

Deleting info about ticket, it's just redundant

In [11]:
data.drop(['Ticket'], axis=1, inplace=True)

Calculating fare based on family size and creating fare group column.

In [12]:
data['calculated_fare'] = data.Fare/data.family_size

def fare_group(fare):
    a= ''
    if fare <= 4:
        a = 'Very_low'
    elif fare <= 10:
        a = 'low'
    elif fare <= 20:
        a = 'mid'
    elif fare <= 45:
        a = 'high'
    else:
        a = "very_high"
    return a

data['fare_group'] = data['calculated_fare'].map(fare_group)

Passenger ID is redundant information, it won't help us in prediction, it's just absolutely random string.

In [13]:
data.drop(['PassengerId'], axis=1, inplace=True)

Now we can make the data which we have made earlier. Get dummies will help our model make use of our categorical data. Later we can drop redundant columns.

In [14]:
data = pd.get_dummies(data, columns=['title','Pclass', 'Cabin','Embarked', 'family_group', 'fare_group'], drop_first=False)

data.drop(['family_size','Name', 'Fare'], axis=1, inplace=True)

Now we will try to predict age for cells where the value is NaN so that we can make use of age later on too. For this task we will use built-in predictor: random forest regression

In [15]:
# rearranging the columns so that we can easily use the dataframe to predict the missing age val
data = pd.concat([data[["Survived", "Age", "Sex","SibSp","Parch"]], data.loc[:,"is_alone":]], axis=1)

def predict_age(df):
    age_df = df.loc[:,"Age":]                            # gettting all the features except survived
    
    temp_train = age_df.loc[age_df.Age.notnull()]        # df with age values
    temp_test = age_df.loc[age_df.Age.isnull()]          # df without age values
    
    y = temp_train.Age.values                            # setting target variables(age) in y 
    x = temp_train.loc[:, "Sex":].values
    
    rfr = RandomForestRegressor(n_estimators=1500, n_jobs=-1)
    rfr.fit(x, y)
    
    predicted_age = rfr.predict(temp_test.loc[:, "Sex":])
    
    df.loc[df.Age.isnull(), "Age"] = predicted_age
    return df

# age prediction for null cells
data = predict_age(data)

When we don't have any NaN cells we can get age group for everyone.

In [16]:
def age_group(age):
    a = ''
    if age <= 1:
        a = 'infant'
    elif age <= 4: 
        a = 'toddler'
    elif age <= 13:
        a = 'child'
    elif age <= 18:
        a = 'teenager'
    elif age <= 35:
        a = 'Young_Adult'
    elif age <= 45:
        a = 'adult'
    elif age <= 55:
        a = 'middle_aged'
    elif age <= 65:
        a = 'senior_citizen'
    else:
        a = 'old'
    return a
        
# applying "age_group_fun" function to the "Age" column
data['age_group'] = data['Age'].map(age_group)

# creating dummies for "age_group" feature
data = pd.get_dummies(data,columns=['age_group'], drop_first=True)

Here we can see results of work we have made so far. In these 10 samples of our dataset we can see 42 columns and a lot of zeros in there. It is result of get dummies functions which "gives" us one only under a category which the row belongs to and zeros under other categories.

In [17]:
data.sample(10)

Unnamed: 0,Survived,Age,Sex,SibSp,Parch,is_alone,calculated_fare,title_ Master,title_ Miss,title_ Mr,...,fare_group_mid,fare_group_very_high,age_group_adult,age_group_child,age_group_infant,age_group_middle_aged,age_group_old,age_group_senior_citizen,age_group_teenager,age_group_toddler
823,1,27.0,0,0,1,0,6.2375,0,0,0,...,0,0,0,0,0,0,0,0,0,0
438,0,64.0,1,1,4,0,43.833333,0,0,1,...,0,0,0,0,0,0,0,1,0,0
112,0,22.0,1,0,0,1,8.05,0,0,1,...,0,0,0,0,0,0,0,0,0,0
508,0,28.0,1,0,0,1,22.525,0,0,1,...,0,0,0,0,0,0,0,0,0,0
711,0,45.366579,1,0,0,1,26.55,0,0,1,...,0,0,0,0,0,1,0,0,0,0
695,0,52.0,1,0,0,1,13.5,0,0,1,...,1,0,0,0,0,1,0,0,0,0
806,0,39.0,1,0,0,1,0.0,0,0,1,...,0,0,1,0,0,0,0,0,0,0
316,1,24.0,0,1,0,0,13.0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
217,0,42.0,1,1,0,0,13.5,0,0,1,...,1,0,1,0,0,0,0,0,0,0
350,0,23.0,1,0,0,1,9.225,0,0,1,...,0,0,0,0,0,0,0,0,0,0


Last step: creating final datasets, creating learning data and labels for the data. Later we need to split the data for training dataset and testing dataset.

In [18]:
# creating our final prediction datasets
X = data.drop(['Survived'], axis = 1)
y = data['Survived']

# train-test split with test size = 0.2 and random state = 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 0)

## Comparison - Our model vs built-in model

### Creating models

Her is a final part of our work: checking the results. We will create the model, train it with the data we created before and check its results using classification report.
Later on we will compare it with built-in model of logistic regression and measure training and prediction time.

In [19]:
# creating logistic regression model of our implementation
start_my_lr = clock()
our_model = LogisticRegression()

# fit the model with "train_x" and "train_y"
our_model.fit(X_train,y_train)

# once the model is trained we want to find out how well the model is performing, so we test the model
# we use "test_x" portion of the data (this data was not used to fit the model) to predict model outcome
y_pred = our_model.predict(X_test)

end_my_lr = clock()

# once predicted we save that outcome in "y_pred" variable.
# then we compare the predicted value ("y_pred") and actual value ("test_y") to see how well our model is performing
print ("Our model accuracy Score is: \n", classification_report(y_test, y_pred), '\n')

# when we get the result of our model we can compare it to built-in version of logistic regression
# we have to create and train the model on the same data, then predict the outcome and see the result
start_bi_lr = clock()
bi_model = LR()
bi_model.fit(X_train, y_train)
y_pred1 = bi_model.predict(X_test)
end_bi_lr = clock()
print ("Built in model accuracy Score is: \n", classification_report(y_test, y_pred1))

Our model accuracy Score is: 
               precision    recall  f1-score   support

           0       0.89      0.78      0.83       115
           1       0.68      0.83      0.74        63

    accuracy                           0.80       178
   macro avg       0.78      0.80      0.79       178
weighted avg       0.81      0.80      0.80       178
 

Built in model accuracy Score is: 
               precision    recall  f1-score   support

           0       0.87      0.85      0.86       115
           1       0.74      0.76      0.75        63

    accuracy                           0.82       178
   macro avg       0.80      0.81      0.80       178
weighted avg       0.82      0.82      0.82       178



In [20]:
my_lr_time = end_my_lr - start_my_lr
bi_lr_time = end_bi_lr - start_bi_lr

print(f'Our logistic regression time: \t{my_lr_time}')
print(f'Built-in logistic regression time: \t{bi_lr_time}')

Our logistic regression time: 	0.16573189999999993
Built-in logistic regression time: 	0.07042479999999962


As we can see above we got two times bigger time from our logistic regression in comparison to built-in model. That's a lot but we will think about it in summary.

## Summary

Both models were trained and predicted on the same datasets. As we can see on the Classification Report there is a small difference between our model and built-in model. The results were very similar in favor of the built-in model.

In last cell we can see that there is about two times bigger processing time in our logistic regression model. That is actually a lot, especially when we think about predicting some bigger amounts of data. On the other hand remember that our model is really simple representation of logistic regression and doesn't have any special features which would make it more efficient.

Taking into account that our model implementation was much more basic, even taking the time result into account we can consider the results as satisfactory.