# Titanic Dataset Exploration

In this notebook we will explore Titanic dataset, inorder to complete the Kaggle competion [link](https://www.kaggle.com/c/titanic)

### Goal
Our job is to predict if a passenger survived the sinking of the Titanic or not. For each PassengerId in the test set, we must predict a 0 or 1 value for the Survived variable.

### Metric
The score is the percentage of passengers we correctly predict. This is known simply as "accuracy”.

## Step 1 - Load the data as pandas dataframe

In [1]:
import pandas as pd
import numpy as np
# Load train and test data into pandas dataframe
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')
combine_data = [train_data, test_data]

In [2]:
# print the shape of the train and test data
print("Shape of train data: rows - %d, columns - %d"%(train_data.shape[0], train_data.shape[1]))
print("Shape of test data: rows - %d, columns - %d"%(test_data.shape[0], test_data.shape[1]))

Shape of train data: rows - 891, columns - 12
Shape of test data: rows - 418, columns - 11


## Step 2 - Analyzing the dataset

### Features in train dataset

In [3]:
for feature, data_type in zip(train_data, train_data.dtypes):
    print("%-15s -  %s"%(feature, data_type))

PassengerId     -  int64
Survived        -  int64
Pclass          -  int64
Name            -  object
Sex             -  object
Age             -  float64
SibSp           -  int64
Parch           -  int64
Ticket          -  object
Fare            -  float64
Cabin           -  object
Embarked        -  object


### Features in test dataset

In [4]:
for feature, data_type in zip(test_data, test_data.dtypes):
    print("%-15s -  %s"%(feature, data_type))

PassengerId     -  int64
Pclass          -  int64
Name            -  object
Sex             -  object
Age             -  float64
SibSp           -  int64
Parch           -  int64
Ticket          -  object
Fare            -  float64
Cabin           -  object
Embarked        -  object


In [5]:
#sample data in train_data dataframe
from IPython.display import display
display(train_data.head(5))

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
#sample data in train_data dataframe
from IPython.display import display
display(test_data.head(5))

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [7]:
# total number of people survives in train_data
survived_train = train_data[train_data['Survived'] == 1].shape[0]
print("Number of people survived in train_data - %d"%survived_train)

Number of people survived in train_data - 342


In [31]:
# Survived columun data attributes
print(train_data['PassengerId'].describe())


count    891.000000
mean     446.000000
std      257.353842
min        1.000000
25%      223.500000
50%      446.000000
75%      668.500000
max      891.000000
Name: PassengerId, dtype: float64


In [20]:
# Survived columun data attributes
print(train_data['Survived'].value_counts())
print('-'*50)
print(train_data['Survived'].describe())

0    549
1    342
Name: Survived, dtype: int64
--------------------------------------------------
count    891.000000
mean       0.383838
std        0.486592
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64


In [9]:
# Pclass colomun data attributes
print(train_data['Pclass'].value_counts())
print('-'*50)
print("count %5d"%train_data['Pclass'].count())

3    491
1    216
2    184
Name: Pclass, dtype: int64
--------------------------------------------------
count   891


In [32]:
# Age colomun data attributes
print(train_data['Name'].describe())

count                         891
unique                        891
top       Jarvis, Mr. John Denzil
freq                            1
Name: Name, dtype: object


In [10]:
# Sex colomun data attributes
print(train_data['Sex'].value_counts())
print('-'*50)
print(train_data['Sex'].describe())

male      577
female    314
Name: Sex, dtype: int64
--------------------------------------------------
count      891
unique       2
top       male
freq       577
Name: Sex, dtype: object


In [18]:
# Age colomun data attributes
print(train_data['Age'].describe())

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64


In [19]:
# SibSp colomun data attributes
print(train_data['SibSp'].value_counts())
print('-'*50)
print(train_data['SibSp'].describe())

0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64
--------------------------------------------------
count    891.000000
mean       0.523008
std        1.102743
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        8.000000
Name: SibSp, dtype: float64


In [15]:
# Parch colomun data attributes
print(train_data['Parch'].value_counts())
print('-'*50)
print(train_data['Parch'].describe())

0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64
--------------------------------------------------
count    891.000000
mean       0.381594
std        0.806057
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        6.000000
Name: Parch, dtype: float64


In [14]:
# Fare colomun data attributes
train_data['Fare'].describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

In [24]:
# Cabin colomun data attributes
print(train_data['Cabin'].describe())

count             204
unique            147
top       C23 C25 C27
freq                4
Name: Cabin, dtype: object


In [27]:
# Embarked colomun data attributes
print(train_data['Embarked'].value_counts())
print('-'*50)
print(train_data['Embarked'].describe())

S    644
C    168
Q     77
Name: Embarked, dtype: int64
--------------------------------------------------
count     889
unique      3
top         S
freq      644
Name: Embarked, dtype: object


## Feature description

|Feature| Type |Description| Missing value|
|-----  |:-----|:-----------:|------------|
|**PassengerId**|Continious Numerical data (range 1 to 891)| Unique id for each passenger| NA|
|**Survived**|Categorical(0, 1)| Lable for survived 1-Survived 0-Not survived | NA|
|**Pclass**| Categorical (1, 2, 3)|Ticket class|NA|
|**Name**|Alpabetical|Name of the person      |NA|
|**Sex**   |Categorical (male, female)| Gender of the person       |NA|
|**Age**     |Continious (min: 0.42 max: 80)| Age of the person     | 714 present, 177 missing|
|**SibSp**     |categorical number (0, 1, 2, 3, 4, 5, 8)|  # of siblings / spouses aboard the Titanic |NA  |
|**Parch**       |categorical number (0, 1, 2, 3, 4, 5, 6)|# of parents / children aboard the Titanic|   NA|
|**Ticket**       |Alphanumeric|Ticket|NA|
|**Fare**            |Continious numeric (range in 0 to 512)|Fere spent by reach person|NA|
|**Cabin**         |categorical alphanumeric| Cabin number of the person|204 present|
|**Embarked**        |Categorical (S, C, Q)|Port of Embarkation|889 present 2 missing

### MIssing Values
* Missing values present in following features: **Age, Cabin, Embarked**

### Continious features
* Numerical continious data is present in: **PassengerId, Age, Fare**

### Categorical features
* Features containing categories: **Survived, Pclass, Sex, SibSp, Parch, Cabin, Embarked**


In [34]:
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [35]:
train_data.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Jarvis, Mr. John Denzil",male,CA. 2343,C23 C25 C27,S
freq,1,577,7,4,644
