# Titanic : What it took to survive

## Introduction:

RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after it collided with an iceberg during her maiden voyage from Southampton to New York City. There were an estimated 2,224 passengers and crew aboard the ship, and more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history. 

In this report, I will analyze the titanic dataset from kaggle with the aim of understanding what factors played an important role in passenger survival. I will go through the entire data science process - from posing a question to wrangling the data to the exploration phase and finally drawing conclusions. 

The dataset is available at https://www.kaggle.com/c/titanic/data.  

##        Data Wrangling:

### Importing libraries:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline

Loading the data and taking a look at what the first 5 entries look like

In [3]:
data = pd.read_csv('titanic-data.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### A brief description of the dataset variables:

| Variable   | Description                                                          |
|----------  |----------------------------------------------------------------------       |
| PassenegrId| The specific identification number assigned to each passenger|
| Survived   | Whether the passenger survived or not (0 = No; 1 = Yes)                                           |
| Pclass     | Ticket class,(1 = 1st; 2 = 2nd; 3 = 3rd)                          |
| Name       | Name of the passenger                                                               |
| Sex        | Passenger's sex                                                                  |
| Age        | Passenger's Age                                                                  |
| SibSp      | Number of siblings/spouses on board                                    |
| Parch      | Number of parents/children on board                                    |
| Ticket     | Passenger's ticket number                                                        |
| Fare       | Passenger's fare                                                       |
| Cabin      | Cabin number                                                                |
| Embarked   | Port of Embarkation,(C = Cherbourg; Q = Queenstown; S = Southampton) |

#### Now that we know what each variable represents, let's get some descriptive statistics on the dataset:

In [5]:
data.describe(include = 'all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Goodwin, Master. Harold Victor",male,,,,347082.0,,C23 C25 C27,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


### Cleaning the data:


#### Removing misleading/incomplete inforamtion:

What I am most interested in here is finding out what specific information is unavailable for a large number of passengers. I would not want those variables to factor into my analysis since we wont have information about that specific variable for a lot of the passengers in our data.

After some trivial calculations using the counts we got from the last table, we can see that about 20% of data is missing for age and a staggering 77% of data is missing for the cabin information. 

I would like to drop the cabin information from my dataframe

In [None]:
data = data.drop('Cabin', 1)

In [9]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


Succesfully dropped the cabin column

#### Replacing keys for Embarked , Survived and Pclass with actual port/class names and whethere survived or not:

In [13]:
def full_port_name(data):
    if data == "C":
        return "Cherbourg"
    elif data == "Q":
        return "Queenstown"
    elif data == "S":
        return "Southhampton"
    else:
        return data
    
data["Embarked"] = data["Embarked"].apply(full_port_name)

In [14]:
def full_class_name(data):
    if data == 1:
        return "Upper"
    elif data == 2:
        return "Middle"
    elif data == 3:
        return "Lower"
    else:
        return pclass
    
data["Pclass"] = data["Pclass"].apply(full_class_name)

In [15]:
def survive(data):
    if data == 1:
        return "Yes"
    elif data == 0:
        return "No"
    else:
        return pclass
    
data["Survived"] = data["Survived"].apply(survive)

In [16]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,No,Lower,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Southhampton
1,2,Yes,Upper,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,Cherbourg
2,3,Yes,Lower,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Southhampton
3,4,Yes,Upper,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,Southhampton
4,5,No,Lower,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Southhampton


The data looks in much more promising condition now so we can start the analysis phase