# Data Analysis
Performing initial understanding analysis of all three datasets for the Kaggle ML Titanic competition. Using this competition to explore applying: 
- ML Modelling
- Statistics
- Python

In [2]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [6]:
# Read in each dataset
gender_submission_df = pd.read_csv("../data/gender_submission.csv")
train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")

## Gender Submission Dataset
A deep dive into the gender submission dataset.

In [7]:
# Look at the head of the dataset
gender_submission_df.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [8]:
# Look at the shape of the dataset
gender_submission_df.shape

(418, 2)

The gender submission dataset has the following properties:
- 2 attributes = Passenger ID, Survived
- Passenger ID identifies passengers uniquely
- Survived is a binary value (0 = non survivors, 1 = survivors)
- Total of 418 rows

## Train and Test Dataset
Both train and test are identical datasets that will require a deep dive.

In [14]:
# Look at the head of the dataset
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [15]:
# Look at the head of the dataset
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [12]:
# Look at the shape of the dataset
print(train_df.shape)
print(test_df.shape)

(891, 12)
(418, 11)


From the above, we can deduce the following:
- Train and Test datasets are identical with the exception of one attribute missing (Survived)
- We can also see that gender_submission data is the survived column for the test dataset
- The train and test is split up into 891 and 418 (68% and 32%) split of data
- Besides Passenger ID and Survived, there rae 10 attributes for both of the datasets
- These are:
    - Pclass (Ticket Class: 1 = 1st, 2 = 2nd, 3 = 3rd)
    - Name 
    - Sex
    - Age
    - SibSp (Number of siblings/spouses aboard the titanic)
    - Parch (Number of parents/children aboard the titanic)
    - Ticket (Ticket number)
    - Fare (Passenger fare)
    - Cabin (Cabin number)
    - Embarked (Port of embarkation: C = Cherbourg, Q = Queenstown, S = Southampton)
    
Looking at the competition details, there are the following noteable details for the attributes:
- pclass is a proxy for socio-economic status: 1st = Upper, 2nd = Middle, 3rd = Lower
- age is fractional if < 1 and xx.5 format if estimated
- sibsp notes that siblings/spouses are defined in the following way: siblings = bro, sis, stepbro, stepsis & spouses = husband, wife (misstresses and fiances are ignored)
- parch notes that parent/children are defined in the following way: Parent = mother, father & Child = daughter, son, stepdaughter, stepson (some children who travelled with nanny have parch = 0)

My initial thoughts
- Some attributes will have some correlation with the feature of interest (Survived)
- Some attributes may overlap (e.g. Fare price may also be a proxy for SES)

## Investigating Attributes
Looking at correlation between attributes and the feature to gauge what might be intersting features to look at. Furthermore, will look at the correlation between attributes as well to see what might have an overlap. Ask the question: what might be driving this trend