#                                          Titanic Dataset

### This is a legendary dataset that every beginners do in order to get a hang of the machine learning algorithms.
### The sinking of the Titanic is one of the most infamous shipwrecks in history.
### We need to create a model that predicts which passengers survived the Titanic shipwreck.


## Variable Description

<span style='font-weight:bold;color:#561225'>Passenger Id:</span> The ID numbers of the passengers

<span style='font-weight:bold;color:#561225'>Survived:</span> Whether the person survived or not

<span style='font-weight:bold;color:#561225'>PClass:</span> The ticket class in which the passenger travelled

<span style='font-weight:bold;color:#561225'>Name:</span> Name of the passenger

<span style='font-weight:bold;color:#561225'>Sex:</span> Male or Female passenger

<span style='font-weight:bold;color:#561225'>Age:</span> Age of the passenger

<span style='font-weight:bold;color:#561225'>SibSp:</span> Number of siblings / spouses aboard the Titanic

<span style='font-weight:bold;color:#561225'>Parch:</span> Number of parents / children aboard the Titanic

<span style='font-weight:bold;color:#561225'>Ticket:</span> Ticket number

<span style='font-weight:bold;color:#561225'>Fare:</span> Passenger fare

<span style='font-weight:bold;color:#561225'>Cabin:</span> Cabin number

<span style='font-weight:bold;color:#561225'>Embarked:</span> Port of Embarkation --- C = Cherbourg, Q = Queenstown, S = Southampton

In [5]:
## Importing basic libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [61]:
# import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

In [6]:
##Reading the data
df=pd.read_csv("https://raw.githubusercontent.com/manishanker/Statistics_ML_26Aug/master/titanic_data.csv")

## Dataset link = "https://www.kaggle.com/c/titanic/data?select=train.csv"

In [7]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [8]:
# Describing the data
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,891.0,446.0,257.353842,1.0,223.5,446.0,668.5,891.0
Survived,891.0,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Age,714.0,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
SibSp,891.0,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292


In [9]:
#Checking the correlation of the columns. 
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


In [10]:
### Checking NaN values.
print("The percentage of null values in each column are:\n")
print(df.isna().sum()/df.shape[0]*100)

The percentage of null values in each column are:

PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64


### It is seen that 'PassengerId', 'Name', 'Ticket' & 'Cabin' columns does not contribute anything when creating a model. Hence these columns can be dropped.

In [11]:
### Dropping 'PassengerId','Name','Ticket' & 'Cabin' columns.

df.drop(['PassengerId','Name','Ticket','Cabin'],axis=1,inplace=True)

### We can see that 'Age' and 'Embarked' has some null values. As the % of null values are less lets replace these null values with median of age and mode of Embarked respectively.

In [12]:
print(df.groupby(['Sex', 'Pclass'])['Age'].agg(['mean', 'median']).round(3))

                 mean  median
Sex    Pclass                
female 1       34.612    35.0
       2       28.723    28.0
       3       21.750    21.5
male   1       41.281    40.0
       2       30.741    30.0
       3       26.508    25.0


In [13]:
df.Age.median()

28.0

In [14]:
## Replacing null values of age with its median 
df.Age.fillna(value=df.Age.median(),inplace=True)

In [15]:
df.isna().sum()/df.shape[0]*100

Survived    0.000000
Pclass      0.000000
Sex         0.000000
Age         0.000000
SibSp       0.000000
Parch       0.000000
Fare        0.000000
Embarked    0.224467
dtype: float64

In [16]:
## Replacing null values of Embarked with its mode 

df.Embarked.fillna(value=df.Embarked.mode().values[0],inplace=True)

In [17]:
df.isna().sum()/df.shape[0]*100.

Survived    0.0
Pclass      0.0
Sex         0.0
Age         0.0
SibSp       0.0
Parch       0.0
Fare        0.0
Embarked    0.0
dtype: float64

In [18]:
# Q1 = df.quantile(0.25)
# Q3 = df.quantile(0.75)
# IQR = Q3 - Q1
# print(IQR)
# df_out = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

### Converting categorial variables into continous variables.

In [19]:
df

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.2500,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.9250,S
3,1,1,female,35.0,1,0,53.1000,S
4,0,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S
887,1,1,female,19.0,0,0,30.0000,S
888,0,3,female,28.0,1,2,23.4500,S
889,1,1,male,26.0,0,0,30.0000,C


### Here we can see that Embarked ,Sex and Pclass are categorial variables.
### Lets convert these into continous variables.

In [20]:
# embarked=pd.get_dummies(data=df.Embarked,prefix='Embarked_')
# df=pd.concat([df,embarked],axis=1)
# df.drop("Embarked",axis=1,inplace=True)

In [21]:
def One_hot_encoding(df,datacolumn):
    a=pd.get_dummies(data=df[datacolumn],prefix=datacolumn)
    df=pd.concat([df,a],axis=1)
    df=df.drop(datacolumn,axis=1)
    return df

In [22]:
df=One_hot_encoding(df,'Embarked')
df=One_hot_encoding(df,'Sex')
df=One_hot_encoding(df,'Pclass')

In [23]:
df

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3
0,0,22.0,1,0,7.2500,0,0,1,0,1,0,0,1
1,1,38.0,1,0,71.2833,1,0,0,1,0,1,0,0
2,1,26.0,0,0,7.9250,0,0,1,1,0,0,0,1
3,1,35.0,1,0,53.1000,0,0,1,1,0,1,0,0
4,0,35.0,0,0,8.0500,0,0,1,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,27.0,0,0,13.0000,0,0,1,0,1,0,1,0
887,1,19.0,0,0,30.0000,0,0,1,1,0,1,0,0
888,0,28.0,1,2,23.4500,0,0,1,1,0,0,0,1
889,1,26.0,0,0,30.0000,1,0,0,0,1,1,0,0


### Here our target variable is 'Survived'. It is a binary classification. Hence  Logistic Regression Model is adopted.

## Logistic Regression Model

In [24]:
X = df.drop(["Survived"], axis=1)
y=df["Survived"]

In [25]:
X

Unnamed: 0,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3
0,22.0,1,0,7.2500,0,0,1,0,1,0,0,1
1,38.0,1,0,71.2833,1,0,0,1,0,1,0,0
2,26.0,0,0,7.9250,0,0,1,1,0,0,0,1
3,35.0,1,0,53.1000,0,0,1,1,0,1,0,0
4,35.0,0,0,8.0500,0,0,1,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
886,27.0,0,0,13.0000,0,0,1,0,1,0,1,0
887,19.0,0,0,30.0000,0,0,1,1,0,1,0,0
888,28.0,1,2,23.4500,0,0,1,1,0,0,0,1
889,26.0,0,0,30.0000,1,0,0,0,1,1,0,0


In [26]:
y

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [46]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,random_state=45)

In [47]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [48]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.8208955223880597

In [49]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.85      0.86       178
           1       0.72      0.77      0.74        90

    accuracy                           0.82       268
   macro avg       0.80      0.81      0.80       268
weighted avg       0.82      0.82      0.82       268



In [50]:
from collections import Counter
Counter(y_pred)

Counter({0: 172, 1: 96})

In [51]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[151,  27],
       [ 21,  69]], dtype=int64)

### Using Standardization technique to standardise X.

In [52]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std=sc.transform(X_test)

In [53]:
model1 = LogisticRegression(max_iter=1000)
model1.fit(X_train_std, y_train)

LogisticRegression(max_iter=1000)

In [54]:
y_pred_std = model1.predict(X_test_std)

In [55]:
accuracy_score(y_test, y_pred_std)

0.8208955223880597

In [56]:
from sklearn.metrics import accuracy_score

## XGBoost Classifier

In [62]:
# import xgboost as xgb
model=xgb.XGBClassifier(n_estimators=1000,random_state=None,learning_rate=0.01)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)



0.8582089552238806

In [63]:
# Classification report of XGBoost
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.89      0.89       178
           1       0.79      0.79      0.79        90

    accuracy                           0.86       268
   macro avg       0.84      0.84      0.84       268
weighted avg       0.86      0.86      0.86       268



In [64]:
from collections import Counter
Counter(y_pred)

Counter({0: 178, 1: 90})

In [65]:
## Confusion Matrix
confusion_matrix(y_test, y_pred)

array([[159,  19],
       [ 19,  71]], dtype=int64)

### Conclusion

1. We can clearly see that there is very less difference in the accuracy value before and after standardizing the training features. 
2. When XGBoost Classifier is used there is a increase in the accuracy value.