# Titanic Data Exploration

This project was one of assignments for Data Science Nanodegree on Udacity. Here I selected Titanic data to do analysis, which contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic. 

## Background

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. (Cited from Kaggle: https://www.kaggle.com/c/titanic/)

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. (Cited from Kaggle: https://www.kaggle.com/c/titanic/)

In this report, I made analysis of *what sorts of people were likely to survive, namely which variables had strong influence over survival status.*

## Overview


In [6]:
import pandas as pd
import numpy as np
%matplotlib inline
path = r'titanic_data.csv'
titanic_data = pd.read_csv(path)
titanic_data.shape

(891, 12)

In [4]:
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


There are 891 passengers in the dataset, with 12 features(PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked. 

In [6]:
titanic_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB


Name, Ticket, Cabin should be strings, whereas Survived, Pclass, Sex, Embarked should be categrical data. The rest features are numeric. 
- PassengerId: the number indicating different passengers, it is not important.
- Survived: survival status of a passenger, '0' means deceased, '1' means survived.
- Pclass: class of a passenger, 1 means first class, 2 means second class, 3 means third class.
- Name: name of a passengers, usually including titles.
- Sex: gender, male or female.
- Age: age.
- SibSp: number of siblings and spouses, ranging from 0 to 8.
- Parch: number of parents and children, ranging from 0 to 6.
- Ticket: labels in a ticket, unique.
- Fare: ticket fare.
- Cabin: cabin labels.
- Embarked: to identify where a passenger boards.

PassengerId, Name and Ticket are just labels for a passenger, they are unique for each passenger and do not provide much useful information directly, here we will leave them out in next analysis. Also, there are missing values in Age, Cabin and Embarked, perhaps they were not recorded then or the information was lost along with the ship. Actually most values of Cabin are missing, and we can skip this feature as well.

In this problem, we focus on what factors can be contributed to the survival status of passengers. Therefore, *Survived* is a dependant variable, which was determined by other factors during the crisis. We'll treat *Pcalss, Sex, Age, SibSp, Parch, Fare and Embarked* as independent variables which might have certain influence over Survived, and explore the relationship between them and Survived through statistic tools and plots.

##  Univariate Analysis

In [13]:
print 'Number of Survivors:', sum(titanic_data['Survived'] == 1), 'Number of Deceased:', sum(titanic_data['Survived'] == 0)

Number of Survivors: 342 Number of Deceased: 549


There are less survivors than deceased.