`TITANIC`

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

`INFO ABOUT DATA`

There are two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.

Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”. So 1 means that person survived and 0 means that person did not survived.

The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

In [40]:
#Importing the libraries
import pandas as pd 
import numpy as np 

`DATA UNDERSTANDING AND EXPLORATION`

In [95]:
#Loading the data
train = pd.read_csv("/Users/sultanbeishenkulov/Programming/Projects/kaggle/titanic/train.csv")
test = pd.read_csv("/Users/sultanbeishenkulov/Programming/Projects/kaggle/titanic/test.csv")

train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [44]:
#Checking dublicates 
print('duplicates in train set: {}, duplicates in test set: {}'.format(train.duplicated().sum(), test.duplicated().sum()))

duplicates in train set: 0, duplicates in test set: 0


In [104]:
print('missing values in test set: {}\n missing values in test set:\n {}'.format(train.isnull().sum(), test.isnull().sum()))

missing values in test set: PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
 missing values in test set:
 PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


In [67]:
#Checking the ratio of survived man vs woman
train[(train['Survived']==1) & (train['Sex']=="female")]['Sex'].count()/len(train)

print('ratio between survived men and women is: men: {} and women: {}'.format(round(1-train[(train['Survived']==1) & (train['Sex']=="female")]['Sex'].count()/len(train), 2), round(train[(train['Survived']==1) & (train['Sex']=="female")]['Sex'].count()/len(train))),2)

ratio between survived men and women is: men: 0.74 and women: 0 2


In [83]:
man = train.loc[train.Sex =='male']['Survived']
woman = train.loc[train.Sex =='female']['Survived']
print('Out of {} men only {} survived \n out of {} of women only {} survived'.format(sum(man), sum(man)/len(man), sum(woman), sum(woman)/len(woman)))


Out of 109 men only 0.18890814558058924 survived 
 out of 233 of women only 0.7420382165605095 survived


`DATA PREPROCESSING AND FEATURE ENGINEERING`

In [25]:
# train['Sex'] = train['Sex'].map(dict(male=1, female=0))
# train['Sex']

0      1
1      0
2      0
3      0
4      1
      ..
886    1
887    0
888    0
889    1
890    1
Name: Sex, Length: 891, dtype: int64

In [41]:
train[train['Age'].isnull()].head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",1,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",1,,0,0,244373,13.0,,S
19,20,1,3,"Masselmani, Mrs. Fatima",0,,0,0,2649,7.225,,C
26,27,0,3,"Emir, Mr. Farred Chehab",1,,0,0,2631,7.225,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",0,,0,0,330959,7.8792,,Q
