# **Exploratory Data Analysis**: The Titanic Dataset (Extended version)

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

In this demo, we will explore the likelihood of surviving the sinking of the RMS Titanic. The Titanic hit a iceberg in 1912 and quickly sank, killing 1,502 out of the 2,224 passengers and crew on board. One most important reason so many people died was that there were not enough lifeboats to serve everyone. Accordingly, it has also been frequently noted that the most **_likely_** people to survive the disaster were women, children, and members of the upper-class. Let's see if that is true.

The Titanic case is a classic problem in data science, and it is still an ongoing [Kaggle competition](https://www.kaggle.com/c/titanic). There are many other examples of the Titanic dataset in introductory statistics and Data Science courses, so we also encourage you to look around and see how others have approached the problem.

<img src="https://upload.wikimedia.org/wikipedia/commons/9/95/Titanic_sinking%2C_painting_by_Willy_St%C3%B6wer.jpg" width="500" height="500" align="center"/>

Image source: https://upload.wikimedia.org/wikipedia/commons/9/95/Titanic_sinking%2C_painting_by_Willy_St%C3%B6wer.jpg

### Introduction

We will conduct our EDA and visualization analysis in three parts:

  1. Analyze and visualize base-rates 
  2. Calculate new predictors that can help the analysis ('Feature Engineering)
  3. Visualize advanced data characteristics
  
To prepare, let's first review the data tools, visualization tools, and actual data for this problem.

### Data Structures for Python

There are three basic options for loading and working with data in Python:

  * Pure `Python 3.x`
  
    In this approach, you load data directly into "pure" Python data objects, such as lists, sets, and dictionaries (or nested hiearchies of such objects, such as lists-within-lists, lists-within-dicts, dicts-within-dicts, and so on). Although operation on "pure Python" objects can be slow, it also is extremely flexible.
    
  * `NumPy`
  
    The basic data structures for holding arrays, vectors, and matrices of data are provided by a core package called `NumPy`. NumPy also has a set of linear algebra and numerical fuinctions, but in general such functions are now provided by another package called `scipy` (for scientific computing) and numerical computation is usually done there. `NumPy` has been optimized to run with primitive routines written in `C` (or even `fortran`) and so is orders-of-magnitude faster-running than doing calculations with pure Python. Nevertheless, the data access, subscripting, and slicing of elements for `NumPy` still conforms to the same syntax as pure Python.
    
  * `pandas`
  
    Most applied Data Science projects (that fit into memory/RAM), now use an "Excel-like" package called `pandas`. Pandas stores data in objects called **dataframes**. Dataframes will become the central type of data object for your work in Data Science (much as Excel Spreadsheets often were for Business Analysts). Dataframes provide many different properties and methods that help you to work with your data more effectively. We will use `pandas` for most of the examples and problems in the class.   

### Visualization for Python

There are many ways to visualize data and results in Python. Sometimes that is a good thing - and sometimes it is a bad thing, for there are many ways to do it. Eventually you will want to learn multiple methods, as data scientists often use many different libraries. The following are the most common libraries, with links to their documentation:

  * `pandas` https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html
  
    If you are in a hurry, it also is possible to generate many simple plots directly from the `pandas` library and `dataframe` object. This is a good option when you are moving fast and all you need to so is see a simple histogram or trend line and you already have your data in a pandas dataframe.  
  
  * `seaborn` https://seaborn.pydata.org/
  
    Seaborn is a simplified and better looking interface that sits on top of the standard `matplotlib` library. Seaborn is often used because it looks great, but also gives you all the ability to go into `matplotlib` to customize graphs for a particular need.
  
  * `plotly` and `plotly_express` https://www.plotly.express/   
  
    The commercial package `plotly` is a comprehensive toolkit for building interactive, D3 and WebGL charts. Interactive graphs are particularly good for online (web-based) dashboards. As you can see in the plot of Python visualization options below, `plotly` is quickly rising in popularity. To use plotly, however, you need to sign up for a [plotly account](https://plot.ly/python/) and you need an active internet connection. We won't need all of the features in plotly, however, and so we will develop examples using just `plotly_express`. Plotly express is a free, local-to-your-machine, and easier-to-use, API to the plotly service. [See here for documentation](https://www.plotly.express/plotly_express/) for the plotly express API.  
    
  * `matplotlib` and `pyplot` https://matplotlib.org/  
  
    Matplotlib is the core package for graphing in Python, and many other packages build on top of it (i.e., pyplot, pandas, and seaborn). The `matplotlib` object model can be somewhat confusing, which often means writing many lines of code and hours of debugging. To help, matplotlib also comes with an interface library called `pyplot` ([documentation here](https://matplotlib.org/api/pyplot_api.html)) that mimics the MatLab approach to graphing (helpful for many engineers). In general, however, `pyplot` has now been supplanted by the other options above. Although it is faster to get started with plotting by using one of the other options above, eventually you will find that you often need to return to matplotlib in order to "tweak" a layout or to work with more complicated graphs.  

<img src="viz-options.png" width="600" height="600" align="center"/>


Image source: EPFL TIS Lab

### The Titanic Dataset

The data is taken from the [Kaggle Titanic Competition](https://www.kaggle.com/c/titanic). It is split between a "training" dataset (where you know the actual outcome) and a "testing" dataset (where you do not know the outcome). If we were actually competing in the Kaggle competition, then we would be trying to predict the unknown testing cases and submitting our predictions to Kaggle to see if we could win. But in this case, the objective is simply to get you started with Python and to familiarize you with the basic data structures and graphing libraries of the Data Science stack. We therefore will ignore the testing dataset and work only with the training data.

All of the data that you will need for this demo is in the `titanic.csv` file, located within the same directory as this notebook.

We don't know very much about the 891 passengers in the training dataset. The following features are available.

  Feature name | Description  |
  -------- | -------------- |
  Survived | Target variable, i.e. survival, where 0 = No, 1 = Yes |
  PassengerId | Id of the passenger |
  Pclass   | Ticket class, wher 1 = 1st, 2 = 2nd, 3 = 3rd |
  Name     | Passenger name |
  Sex      | Sex |
  Age      | Age in years |
  SibSp    | Num of siblings or spouses aboard the Titanic |
  Parch    | Num of parents or children aboard the Titanic |
  Ticket   | Ticket number, i.e. record ID |
  Fare     | Passenger fare |
  Cabin    | Cabin number |
  Embarked | Port of Embarkation, where C = Cherbourg, Q = Queenstown, S = Southampton |
  


**Special Notes**

  * **Pclass**: A proxy for socio-economic status (SES): 1st = Upper class; 2nd = Middle class; 3rd = Lower class.  
  * **Age**: Age is fractional if less than 1.  If the age is estimated, is it in the form of xx.5  
  * **SibSp**: The dataset defines family relations as: Sibling = brother, sister, stepbrother, stepsister; Spouse = husband, wife (mistresses and fiancés were ignored)  
  * **Parch**: The dataset defines family relations as: Parent = mother, father; Child = daughter, son, stepdaughter, stepson; Some children travelled only with a nanny, therefore parch=0 for them.  

--------

## **Part 0**: Setup

In [None]:
# Put all import statements at the top of your notebook -- import some basic and important ones here

# Standard imports
import numpy  as np
import pandas as pd
import pandas_profiling
import os
import sys

# Visualization packages 
import matplotlib
import matplotlib.pyplot as plt  
import seaborn           as sns
from plotly.offline      import init_notebook_mode
init_notebook_mode(connected=True)  

# Special code to ignore un-important warnings 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

## **Part 1**: Analyze Base Rates and EDA

The most basic "model" to run for any problem is the "average" (or "base rate") outcome. Before we work on more complicated models, it is a good idea to understand how a very simple heuristic will perform -- and then you can determine how much a more complicated model really improves the predictions. Your first model may be too simple and therefore highly "biased" (i.e., systematically low or systematically high for different groupings or levels within the data); but your simple model should also not vary much if/when we pull a new sample from the same underlying population/data generating function.  

We will therefore conduct some **_exploratory data analysis_** ("**EDA**") and **_visualization_** in this demo to understand the distribution of the outcome for this case.

### Open and Inspect the Data

In [None]:
# Open dataset with Pandas

data = pd.read_csv('titanic.csv')

# Count rows and coumns
data.shape

## **Part 1a**: EDA - the automated approach

Instead of performing all of the steps above manually, you can also run a "profile" on the dataset first, and then drill down into specific cases of interest.

In [None]:
# Use the automated pandas profiling utility to examine the dataset
data.profile_report()

## **Part 1b**: EDA - the manual approach

It is generally a good idea to go beyond the automated approach to EDA. Here are some useful steps for understanding and plotting the data in more detail.

In [None]:
# Inspect the data (variable names, count non-missing values, review variables types)

data.info()

In [None]:
# Look at the "head" of the dataset

data.head()

In [None]:
# Look at the "tail" of the dataset

data.tail()

In [None]:
# Calculate summary statistics (mean, min, max, std, etc.) for all variables and transpose the output

data.describe().T

In [None]:
# Count missing data per feature

data.isnull().sum()

In [None]:
# Plot missing data  (Hint: use a seaborn heatmap to see distribution of isnull() for the dataframe

sns.heatmap(data.isnull(), cbar=False, cmap="YlGnBu_r")

### Aside: selecting columns in pandas

As we will select columns from a pandas dataframe many times, it is important to note that there are generally two equivalent ways of doing this. We will use the first approach of passing the column name as a string in square brackets. This distinguishes column selection from method calls.

In [None]:
# First approach (recommended): pass the column name as a string in square brackets

data['Survived'].describe()

In [None]:
# Second approach: pass the column name as a method call

data.Survived.describe()

### Analyze Survival (the basic outcome)

In [None]:
# Count number of Survivals and Deaths

survivals = sum(data['Survived'])
deaths = len(data[data['Survived'] == False])
assert survivals + deaths == len(data)     # not necessary, but this should be true
assert survivals + deaths == data.shape[0] # not necessary, but this also should be true
print('Survivals: ', survivals)
print('Deaths: ', deaths)

In [None]:
print('The base-rate likelihood of survival: ', survivals/(survivals+deaths))

### Plotting the target feature "Survived": 4 approaches

In [None]:
# Plot a histogram of "Survived" using Pandas 
# The fastest way - this just uses the pandas dataframe  

data['Survived'].hist()  

In [None]:
# Plot a histogram of "Survived" using Seaborn
# The fastest way is to reference Column Names as Properties of the Dataframe

sns.countplot(data["Survived"])

# Note that another, "safe" way to code is to pass parameters by name
# sns.countplot(x='Survived', data=data) 

### Analyze Base-Rate Outcomes, by Condition

Again, let's start by calculating the frequency (or "base rate") of the outcome variable for different conditions of interest. You can do this most easily by using the `.crosstab()` function from the `pandas` module.

In [None]:
# Use the pandas crosstab() function to count outcomes by condition

pd.crosstab(data['Pclass'], data['Survived'], margins=False)

In [None]:
# Now show the totals (so you can calculate a conditional marginal base rate).

pd.crosstab(data['Pclass'], data['Survived'], margins=True)

In [None]:
# Use the style() function (chain it onto the end of prior code) to also overlay a heatmap 
# i.e., you can still do this in one line of code with .style.background_gradient()

pd.crosstab(data['Pclass'], data['Survived'], margins=True).style.background_gradient()

In [None]:
# If you can't read the results, select your own color scheme

pd.crosstab(data['Pclass'], data['Survived'], margins=True).style.background_gradient(cmap='autumn_r')

In [None]:
# Now do a "three-way" crosstab for Class, Sex, Survival

pd.crosstab([data['Sex'], data['Survived']], data['Pclass'], margins=True)

In [None]:
# Use the pandas dataframe to plot a histogram of Age 

data['Age'].hist()

In [None]:
# Increase the number of histogram bins

data['Age'].hist(bins = 40)

In [None]:
# Use seaborn to plot the kernel density (a kdeplot) for Age

facet = sns.FacetGrid(data, aspect = 4)
facet.map(sns.kdeplot,'Age', shade = True)
facet.set(xlim = (0, data['Age'].max()))

In [None]:
# Use Seaborn to plot the kernel density of Fare

facet = sns.FacetGrid(data, aspect=4)
facet.map(sns.kdeplot,'Fare', shade=True)
facet.set(xlim = (0, data['Fare'].max()))

In [None]:
# Redo the above in just 1 line of code, but show both a frequency histogram of counts, 
# and a kernel density of the probability density function

# There usually is a simple way to do it with seaborn...

facet = sns.distplot(data['Fare'])

In [None]:
# Use pandas to plot a histogram of Survived, separated by Class
# Hint: figsize is defined in inches

lived = data[data['Survived'] == 1]['Pclass'].value_counts()
died  = data[data['Survived'] == 0]['Pclass'].value_counts()
df = pd.DataFrame([lived, died])
df.index = ['Lived', 'Died']
df.plot(kind = 'bar', stacked=True, figsize=(12, 5), title='Survival by Social Economic Class (1st, 2nd, 3rd)')

In [None]:
# Use seaborn barplots to plot Survival as a Function of Class

sns.barplot(x='Pclass', y='Survived', data=data)
plt.ylabel("Survival Rate")
plt.title("Survival as function of Pclass")
plt.show() # this removes the annoying line that references the final object, e.g. "<matplotlib.axes._subplots.AxesSubplot at 0x1a1d5f4e48>"

In [None]:
# Use pandas to plot a histogram of Survived, separated by Sex

lived = data[data['Survived'] == 1]['Sex'].value_counts()
died  = data[data['Survived'] == 0]['Sex'].value_counts()
df = pd.DataFrame([lived, died])
df.index = ['Lived', 'Died']
df.plot(kind = 'bar', stacked = True, figsize = (12, 5), title = 'Survival by Gender')
plt.show()  

In [None]:
# Use Pandas (with matplotlib customization to make it look good) to draw a pie chart of Survival by Sex

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(16,7))
data['Survived'][data['Sex'] == 'male'].value_counts().plot.pie(ax = ax1)
data['Survived'][data['Sex'] == 'female'].value_counts().plot.pie(ax = ax2, colors = ['C1', 'C0'])

In [None]:
# Now use some matplotlib customization to make your previous plot look cool 

f, ax = plt.subplots(1, 2, figsize = (16, 7))
data['Survived'][data['Sex'] == 'male'].value_counts().plot.pie(explode=[0,0.2], autopct='%1.1f%%', ax = ax[0], shadow = True)
data['Survived'][data['Sex'] == 'female'].value_counts().plot.pie(explode=[0,0.2], autopct='%1.1f%%', ax = ax[1], shadow = True, colors = ['C1', 'C0'])
ax[0].set_title('Survived (male)')
ax[1].set_title('Survived (female)')
plt.show()

In [None]:
# Use a seaborn facet grid to jointly examine Sex, Class, and Survival

g = sns.FacetGrid(data, row = 'Sex', col = 'Pclass', hue = 'Survived', margin_titles = True, height = 3, aspect = 1.1)
g.map(sns.distplot, 'Age', kde = False, bins = np.arange(0, 80, 5), hist_kws = dict(alpha=0.6))
g.add_legend()  
plt.show()  

In [None]:
# Examine the disribution of Fare as a function of Pclass, Sex and Survived

g = sns.FacetGrid(data, row = 'Sex', col = 'Pclass', hue = 'Survived', margin_titles = True, height = 3, aspect = 1.1)
g.map(sns.distplot, 'Fare', kde = False, bins = np.arange(0, 550, 50), hist_kws = dict(alpha = 0.6))
g.add_legend()  
plt.show()  

In [None]:
# Use the plt.subplots() function from pyplot to capture the figure and subplot objects
# so you can work with both seaborn and matplot lib to make a fully customized distribution plot

LABEL_SURVIVED = 'Survived'
LABEL_DIED = 'Did Not Survive'
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (12, 6))
women = data[data['Sex'] == 'female']
men = data[data['Sex'] == 'male']

ax = sns.distplot(women[women['Survived'] == 1]['Age'].dropna(), bins = 18, label = LABEL_SURVIVED, ax = axes[0], kde = False)
ax = sns.distplot(women[women['Survived'] == 0]['Age'].dropna(), bins = 40, label = LABEL_DIED, ax = axes[0], kde = False)
ax.legend()
ax.set_title('Female')

ax = sns.distplot(men[men['Survived'] == 1]['Age'].dropna(), bins = 18, label = LABEL_SURVIVED, ax = axes[1], kde = False)
ax = sns.distplot(men[men['Survived'] == 0]['Age'].dropna(), bins = 40, label = LABEL_DIED, ax = axes[1], kde = False)
ax.legend()
_ = ax.set_title('Male')

In [None]:
# ADVANCED: combine histograms for many variables related to survival into one composite figure

R = 2
C = 3
fields = ['Survived', 'Sex', 'Pclass', 'SibSp', 'Parch', 'Embarked']

fig, axs = plt.subplots(R, C, figsize = (12, 8))

for row in range(0, R):
    for col in range(0, C):  
        i = row * C + col       
        ax = axs[row][col]
        sns.countplot(data[fields[i]], hue = data["Survived"], ax = ax)
        ax.set_title(fields[i], fontsize = 14)
        ax.legend(title = "survived", loc = 'upper center') 
        
plt.tight_layout()  

## **Part 2**: Feature Engineering

Using domain knowledge, we can create new features that might improve performance of our model at a later stage. 

In [None]:
# Extract the leading "title" from the passenger name, and summarize (count) the different titles 

data['Title'] = data['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
data['Title'].value_counts()

## **Part 3**: Explore Swarm and Violin Plots

Swarm and violin plots show the same data as plots of counts and/or density distributions (graphs you did earlier), but they also happen to look very cool and can also draw attention to details that you do not see in other plots. If you have extra time, try to make a few of these below.

In [None]:
# Define some constants (such as PALLET and FIGSIZE) so that all of your figures look consistent

PALETTE = ["lightgreen" , "lightblue"]  # you can set a custom palette with a simple list of named color values
FIGSIZE = (13, 7)

In [None]:
# Use a seaborn "swarmplot" to examine survival by age and class.

fig, ax = plt.subplots(figsize = FIGSIZE)
sns.swarmplot(x = 'Pclass', y = 'Age', hue = 'Survived', dodge = True, data = data, palette = PALETTE, size = 7, ax = ax)
plt.title('Survival Events by Age and Class ')
plt.show()

In [None]:
# Use a seaborn "violinplot" to examine survival by age and class. 

fig, ax = plt.subplots(figsize = FIGSIZE)
sns.violinplot(x = "Pclass", y = "Age", hue = 'Survived', data=data, split=True, bw = 0.05 , palette = PALETTE, ax = ax)
plt.title('Survival Distributions by Age and Class ')
plt.show()

In [None]:
# Use the catplot function (in just one line of code!) to show the comparable distributions for Class, Age, Sex and Survived

g = sns.catplot(x = "Pclass", y = "Age", hue = "Survived", col = "Sex", data = data, kind = "violin", split = True, bw = 0.05, palette = PALETTE, height = 7, aspect = 0.9, s = 7)

---------------

## Further Reading

- Recap of basic Python visualization methods with matplotlib: https://machinelearningmastery.com/data-visualization-methods-in-python/ 
- Guide to Bokeh, an increasingly popular Python interactive visualization package: https://realpython.com/python-data-visualization-bokeh/