The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
import cufflinks as cf
import chart_studio.plotly as py
%matplotlib inline

from plotly.offline import download_plotlyjs, init_notebook_mode,plot,iplot
init_notebook_mode(connected=True)
#to make everything locally
cf.go_offline()

In [2]:
# Loading the data
titanic_train = pd.read_csv("train.csv")

In [3]:
# Viewing the data
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# Datatypes of the variables
titanic_train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [5]:
#understanding the dataframe-numerical columns
titanic_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
#Checking null values of all variables
titanic_train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [7]:
px.imshow(titanic_train.isnull())

In [8]:
# shape of the dataframe
titanic_train.shape

(891, 12)

### Understanding1: Dropping Columns
- Since PassengerId and Ticket and Name are variables which doesnt provide any contribution in information gathering/analysis we can drop them
- Also the total rows are 891 out of which 687 cells/data of Cabin is missing wwhich means it is better to drop them

In [9]:
titanic_cleaned = titanic_train.copy()
titanic_cleaned.drop(["PassengerId","Name","Cabin","Ticket"], inplace = True, axis =1)

In [10]:
titanic_cleaned.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


## Categorical Columns 

In [11]:
#Inspecting Gender
# Inspecting P class
# Inspecting Survived- must only have 0or1
# inspectingEmbarked must have definite portnames
print(titanic_cleaned[["Survived","Sex","Pclass"]].value_counts())
titanic_cleaned.Embarked.value_counts()

Survived  Sex     Pclass
0         male    3         300
                  2          91
1         female  1          91
0         male    1          77
          female  3          72
1         female  3          72
                  2          70
          male    3          47
                  1          45
                  2          17
0         female  2           6
                  1           3
dtype: int64


S    644
C    168
Q     77
Name: Embarked, dtype: int64

- Varibale - Survived with entries 0 and 1 only
- Variable - Sex with entries male and female only
- VAriable - Pclass with entries 1,2,3 only
- Varibale - Embarked with three entries S,C,Q

### Converting Categorical Column into numerical column for better understanding - female-1 and male -0
### and Embarked as S=1, C=2,Q=3

In [12]:
# Converting Categorical Column into numerical column for better understanding - female-1 and male -0
titanic_cleaned['Sex'] = titanic_cleaned['Sex'].map({'male':0,'female':1})
titanic_cleaned['Embarked'] = titanic_cleaned['Embarked'].map({'S':1,'C':2, 'Q':3})

In [13]:
# Validating the datatypes
titanic_cleaned.dtypes

Survived      int64
Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked    float64
dtype: object

# Age

In [14]:
# Outlier Analysis = Age
px.box(data_frame = titanic_cleaned, y = 'Age', color = 'Sex')

- There were some males who were very much older than the others ie 66 years (from the baove graph, range is from 0-66), we will remove them

In [15]:
print(titanic_cleaned[titanic_cleaned.Age>66]) # there are total 7 ppl with age not in the range of 0-70, are categorised as outliers
# removing them

titanic_cleaned.drop(titanic_cleaned.index[titanic_cleaned.Age>70], inplace = True)


     Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked
96          0       1    0  71.0      0      0  34.6542       2.0
116         0       3    0  70.5      0      0   7.7500       3.0
493         0       1    0  71.0      0      0  49.5042       2.0
630         1       1    0  80.0      0      0  30.0000       1.0
672         0       2    0  70.0      0      0  10.5000       1.0
745         0       1    0  70.0      1      1  71.0000       1.0
851         0       3    0  74.0      0      0   7.7750       1.0


In [16]:
# Handling Missing Value:
print("Total missing values in Age Variable", titanic_cleaned.Age.isnull().sum())
px.imshow(titanic_cleaned.isnull())


Total missing values in Age Variable 177


In [17]:
px.histogram(titanic_cleaned, x = 'Age')

- the age is not normally distributed we can impute nan values with just one single mean-median-mode, we need to check whether age has any relation with other variables

In [18]:
px.box(titanic_cleaned, x = 'Embarked', y = 'Age', color = 'Sex')

In [19]:
px.box(titanic_cleaned, x = 'Sex', y = 'Age')

- Median Age for male is 29 and female is 27, Not much significant difference

In [20]:
px.box(titanic_cleaned, x = 'Pclass', y = 'Age', color = 'Sex')

#### IMP: we see that there is some significant relation from the above graph:
- If the Pclass is 1 and Sex is Male the median age is 40
- If the Pclass is 2 and Sex is Male the median age is 30
- If the Pclass is 3 and Sex is Male the median age is 25
- If the Pclass is 1 and Sex is Female the median age is 35
- If the Pclass is 2 and Sex is Female the median age is 28
- If the Pclass is 3 and Sex is Female the median age is 21.5
 

In [21]:
def age_NanTreatment(columns):
    Age = columns[0]
    Pclass = columns[1]
    Sex = columns[2]
    if pd.isnull(Age):
        if Pclass == 1 and Sex == 0:
            return 40
        elif Pclass == 2 and Sex == 0:
            return 30
        elif Pclass == 3 and Sex == 0:
            return 25
        elif Pclass == 1 and Sex == 1:
            return 35
        elif Pclass == 2 and Sex == 1:
            return 28
        elif Pclass == 3 and Sex == 1:
            return 21.5
    else:
        return Age

In [22]:
titanic_cleaned['Age'] = titanic_cleaned[['Age', 'Pclass','Sex']].apply(age_NanTreatment, axis = 1)

In [23]:
px.imshow(titanic_cleaned.isnull())

# Fare

In [24]:
# # Outlier Analysis = Fare
px.box(data_frame=titanic_cleaned , y = 'Fare')

- Reading and Assumptions: Assuming that there were some ppl who were travelling on  FREE Pass  and that is wy there fair is adjusted to zero

In [25]:
# Count of outliers
titanic_cleaned[titanic_cleaned.Fare>65]
# we cannot remove 116 rows 

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
1,1,1,1,38.0,1,0,71.2833,2.0
27,0,1,0,19.0,3,2,263.0000,1.0
31,1,1,1,35.0,1,0,146.5208,2.0
34,0,1,0,28.0,1,0,82.1708,2.0
52,1,1,1,49.0,1,0,76.7292,2.0
...,...,...,...,...,...,...,...,...
846,0,3,0,25.0,8,2,69.5500,1.0
849,1,1,1,35.0,1,0,89.1042,2.0
856,1,1,1,45.0,1,1,164.8667,1.0
863,0,3,1,21.5,8,2,69.5500,1.0


# Embarked

In [26]:
px.imshow(titanic_cleaned.isnull())

In [27]:
titanic_cleaned.dropna(inplace =True)

In [28]:
titanic_cleaned.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

# Hurray! Our TrainData is finally Cleaned, we will do the same with test data

In [29]:
# lets check our test data
titanic_test = pd.read_csv('test.csv')
titanic_test.shape

(418, 11)

In [30]:
titanic_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [31]:
titanic_test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [32]:
px.histogram(titanic_test, x = 'Fare')


In [33]:
# Categrical DAta Conversion
titanic_test['Sex'] = titanic_test.Sex.map({'male':0,'female':1})
titanic_test['Embarked'] = titanic_test.Embarked.map({'S':1,'C':2, 'Q':3})
titanic_test['Age'] = titanic_test[['Age', 'Pclass','Sex']].apply(age_NanTreatment, axis = 1)
titanic_test['Fare'] = titanic_test.Fare.fillna(value = titanic_test.Fare.median())
#dropping the columns of least use ['PassengerId','Name','Ticket','Cabin']
test_x = titanic_test.drop(['PassengerId','Name','Ticket','Cabin'],axis = 1)
print(test_x.shape)
test_x.isnull().sum()

(418, 7)


Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

# Training the model:::Logistic regression: We need to decide for passengers in titanic_test whether they survived(1) or not(0)

In [34]:
train_x = titanic_cleaned.drop('Survived', axis =1)
train_y = titanic_cleaned['Survived']


In [35]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter = 200)
model.fit(train_x,train_y)

LogisticRegression(max_iter=200)

In [36]:
# prediction
test_y = model.predict(test_x)

In [37]:
titanic_test['Survived']=test_y

In [38]:
titanic_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",0,34.5,0,0,330911,7.8292,,3,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47.0,1,0,363272,7.0,,1,0
2,894,2,"Myles, Mr. Thomas Francis",0,62.0,0,0,240276,9.6875,,3,0
3,895,3,"Wirz, Mr. Albert",0,27.0,0,0,315154,8.6625,,1,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22.0,1,1,3101298,12.2875,,1,1


# writing to CSV

In [39]:
df = titanic_test[['PassengerId','Survived']]
df.to_csv('Predictions.csv', index = False, header = True)