## Background

**Data Link** : [link](https://www.kaggle.com/c/titanic/data) 

In this exercise, I will investigate the Titanic Data Set. The Data is obtained from Kaggle, and the sample set consists of the characteristics of passengers in the Titanic. 

I will be following the **Data Analysis Process** and based on the reading of the Data I came up with the below initial set of Questions 

1. How much did the class(Upper/Middle/Lower) affect the survival rate of the passenger
2. Does the fact that some passengers were travelling with siblings/parents impact the travel rate?
	- Was it more prominent for passengers travelling with siblings or,
	- Was it more prominent for passengers travelling with parents
3. Did young passengers survive more than the older passengers?
4. Did the cabin number where the passengers were put up, play a role in whether they survived or not. 



In [13]:
# Standard Imports
import numpy as np
import pandas as pd

In [18]:
#Import the Titanic Data and Load a Sample 

passenger_df = pd.read_csv('P2 - Investigate a Dataset using Python/Data/train.csv')
passenger_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Step 1 Understand the Data
This is a Data Set of **891** passengers

Data Dictionary

1. Passenger Id - Id of the Passenger
2. Survived - Showed if the Passenger Survived or Died
	- 0 = Died
	- 1 = Survived
3. Pclass - Class on which the passenger has travelled
	- 1 = 1st Class
	- 2 = 2nd Class
	- 3 = 3rd Class
4. Name - Name of the Passenger
5. Sex (Gender) of the Passenger 
	- male
	- femal
6. Age - Age of the Passenger (in Years) - See note
7. sibsp - 	# of siblings / spouses aboard the Titanic	
8. parch - 	# of parents / children aboard the Titanic	
9. ticket -	Ticket number	
10. fare -	Passenger fare	
11. cabin -	Cabin number	
12. embarked	Port of Embarkation	
	- C = Cherbourg, 
	- Q = Queenstown, 
	- S = Southampton

###Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
	- Sibling = brother, sister, stepbrother, stepsister
	- Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
	- Parent = mother, father
	- Child = daughter, son, stepdaughter, stepson
	Some children travelled only with a nanny, therefore parch=0 for them.

In [27]:
# Printing a Basic Summary 
# Will give us an idea if any cleansing / wrangling is required
passenger_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


## Data Cleansing

Data Cleansing is required in the Age, Cabin and Embarked Fields

age_df = passenger_df['Age']
age_df.dropna()

age_df.describe()

In [35]:
#Cleansing AGE - No Values can be substituted for the Age
# So we will take it out seperately so that we can look at it in isolation, and see if we can derive any insights. 
age_df = passenger_df['Age']
age_df =age_df.dropna()
print ("The No of Age Records is : ", age_df.count())
print (" The Mean is : ", age_df.mean())
age_df.describe()

('The No of Age Records is : ', 714)
(' The Mean is : ', 29.69911764705882)


count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [45]:
#Data Cleansing for Embarked
print ("Values for Embarked : ", passenger_df['Embarked'].unique())

print ("The max occuring value for EMBARKED is %s, So We can replace it is that" %passenger_df['Embarked'].mode())

passenger_df['Embarked'].fillna(value="S",inplace=True)



('Values for Embarked : ', array(['S', 'C', 'Q', nan], dtype=object))
The max occuring value for EMBARKED is 0    S
dtype: object, So We can replace it is that
['S' 'C' 'Q']


In [46]:
#DATA EXPLORATION 
# Let us start by describing the Data Set
passenger_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292
