## Prequels/sequels

- **ChaiEDA sessions: Titanic: initial (manual) EDA with explanations**
- [ChaiEDA sessions: Titanic - Using tools](https://www.kaggle.com/neomatrix369/chaieda-sessions-titanic-using-tools/): analysis using ready-to-use analysis and profiling tools

<a id='ToC'></a>

----------

# Table of contents

- [Tasks](#tasks)
- [Conjectures and observations](#conjectures)
  - [Tips](#conjectures)
- [Data dictionary | Variable notes](#data-defininition)
- [Potential stats/graphs from fields in the training dataset](#potentail-stats)
   - [Field combinations for stats/graphs](#field-combinations)
- [Summary](#summary)
- [Import libraries/packages](#import-libraries)
- [Training Dataset](#training-dataset)
- Analysis
  - [General (corelated to survival)](#analysis-general)
  - [Circumstantial/chance (only stats)](#analysis-circumstantial) 
- [Test dataset](#test-dataset)
- [Dataset summary](#dataset-summary) 
- [Resources](#resources)
- [Credits](#credits)


<a id='tasks'></a>

----------

# Tasks

- [x] Read competition description
- [X] Read dataset fields/columns descriptions
- [X] Open datasets using pandas.read_csv
   - [X] Compute the total number of passengers from the given data
- [X] List of fields to plot (continous process)
- [X] Graph of fields and combination of fields (continous process)
   - [X] Add more stats (even if they don't corelate to Survival) - just to see the stats 
     - [X] Age (3-levels: young, adult, elderly and age group categories)
     - [X] Fare (price-range categories)
     - [X] Marital status (even if a large number of them are unknown)
     - [X] Maybe sort Ticket number and Pclass to see if fares are not sequential (maybe we can see who paid more to get a ticket/bought late)
     - [X] Travelling with company or alone (3/4 categories)?

<a id='conjectures'></a>

----------

# Conjectures and observations

- 1502 out of 2224 passengers and crew, died
- 891 entries in training dataset
- 418 entries in test dataset
- total entries: 1309
- difference = 2224 - 1309 = where are the remaining 915 entries?
- Purpose of the gender_submission - slightly misleading name, biased towards gender (mild comment)
- unsure of the dataset contains crew information and how many survived or died


## Tips

- Age, gender and class show survival rate
- Look into pivot tables and usage of percentile with describe()
- See https://github.com/ashleysmart/mlgym/blob/master/tuts/Onboarding_Fundementals.ipynb
- Apply interesting graphs/stats from Lokesh's notebook
- Add more stats (even if they don't corelate to Survival) - just to see the stats
  - Age (3-levels: young, adult, elderly and age group categories)
  - Fare (price-range categories)
  - Marital status (even if a large number of them are unknown)
  - Maybe sort Ticket number and Pclass to see if fares are not sequential (maybe we can see who paid more to get a ticket/bought late)
  - Travelling with company or alone (3/4 categories)?

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='Data-defininition'></a>

----------

# Data Dictionary
|Variable|Definition|Key|
|---|---|---|
|survival|Survival|0 = No, 1 = Yes
|pclass|Ticket class|1 = 1st, 2 = 2nd, 3 = 3rd
|sex   |Sex|	
|Age   |Age in years	
|sibsp|# of siblings / spouses aboard the Titanic	
|parch|# of parents / children aboard the Titanic	
|ticket|Ticket number	
|fare|Passenger fare	
|cabin|Cabin number	
|embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton

# Variable Notes

**pclass**: A proxy for socio-economic status (SES)

      1st = Upper
      2nd = Middle
      3rd = Lower

**age**: 

      Age is fractional if less than 1. 
      If the age is estimated, is it in the form of xx.5

**sibsp**: The dataset defines family relations in this way...

      Sibling = brother, sister, stepbrother, stepsister

      Spouse = husband, wife (mistresses and fiancés were ignored)


**parch**: The dataset defines family relations in this way...

     Parent = mother, father

     Child = daughter, son, stepdaughter, stepson

     Some children travelled only with a nanny, therefore parch=0 for them.

<a id='Potential-stats'></a>

----------


# Potential stats/graphs from fields in the training dataset

## General (corelated to survival)
- Survived/died (and combinations with other fields)
- PClass (and combinations with other fields)
- Gender (and combinations with other fields)
- Age (and combinations with other fields)
- Fare (and combinations with other fields)

## Circumstantial/chance (only stats)
- SibSp and	Parch (and combinations with other fields) - summing them up to check for total number of passengers present (discussable)
- Married or single female passengers (age >= 18), rest minor children
- Extracting tokens from ticket number
- Embarked (and combinations with other fields)
- Cabin (and combinations with other fields)

<a id='field-combinations'></a>

----------


## Field combinations for stats/graphs

Use https://textmechanic.com/text-tools/combination-permutation-tools/combination-generator/

### General (corelated to survival)
- Survived/died
- PClass 
- Gender 
- Age
- Fare
- Survived/died, PClass
- Survived/died, Gender 
- Survived/died, Age
- Survived/died, Fare
- _...(others)..._

### Circumstantial/chance (only stats) 
- Survived/died, Married or single female passengers (age >= 18)
- Survived/died, minor female children
- Survived/died, Embarked
- Survived/died, Cabin 
- _...(others)..._

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='summary'></a>

----------

> # Summary

> I'll start of with a summary of ideas I will be presenting in the next handful of sections:

> The dataset from data quality point of view has some missing data, but they do not impact the analysis much as we will see how some of the feature engineerings adopted in the kernel help to overcome these abnormalities and see a bit more clearer. Although there is a descrepancy between the number of tickets and passengers and the sum of the records (training plus test datasets), and the numbers reported who travelled via onboard and this might need rectification. There has also been no indication or information about the crew members and their details, so nothing can be know about them but the goal of the analysis and question is only which passengers survived as opposed to crew and passengers together.

> Now to predict survival rate, there could have been more information, as through various analysis of numbers, tables and graphs we can conclude:
> - more men were lost onboard for reasons we will cover soon
> - most adult men between the ages 18 and 65 did not survive (even though we have a percentage of data where the age is not known)
> - first class passengers were given or automatically got priority over the other classes
> - marital status of a passenger has little correlation to survival, but added to that we could not not guess/judge/know the marital status of the male passengers (apart from those not-of-legal age to marry male passengers)
> - another feature i.e. fare is a misleading feature and cannot be easily corelated to survival, as it is linked to few other features like Passenger class, Cabin and Ticket
> - Passenger class, Cabin, Ticket and Fare combine give a better picture of survival
> - we know a number of passengers across all three passenger classes have overpaid for their tickets and we can easily point them
> - a strange fact that ones who boarded from Southhampton took the most of the death toll
> - another strange observation there are passengers who have travelled with a ticket priced £0.00 across all three classes
> - **to sum it up there were more female survivors and first class travellers stood a better chance of surviving**
> - **those that travelled with a company i.e. family or friend had a higher chance of survival than those who travelled by themselves, although ones with less able dependents suffered to a visible degree**
> - **it appears that most men were helping women, children and elderly and other less able passengers to board off the ship and in turn minimised their own chances of survival, leading to 81% male casualties**

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='import-libraries'></a>

----------

# Import libraries/packages

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Import library and dataset
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid", font_scale=1.75)

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import math

import matplotlib.pyplot as plt

# prettify plots\n
plt.rcParams['figure.figsize'] = [20.0, 5.0]
    
%matplotlib inline

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id='training-dataset'></a>

----------


# Training Dataset

In [None]:
training_dataset = pd.read_csv('/kaggle/input/titanic/train.csv')
print("Column count:", len(training_dataset.columns))
training_dataset.dtypes

In [None]:
sensible_columns = ['Survived', 'Died', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin']

In [None]:
training_dataset = pd.read_csv('/kaggle/input/titanic/train.csv')
training_dataset

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

### Missing data

In [None]:
!pip install -U missingno

In [None]:
import missingno as msno

In [None]:
msno.bar(training_dataset)

In [None]:
sns.heatmap(training_dataset.isnull(), yticklabels=False, cbar=False,cmap='viridis')

<i><p style="font-size:16px; background-color: #66cdde; border: 2px dotted black; margin: 20px; padding: 20px;">From the above two graphs we can see that **Age (177)**, **Cabin (687)** and **Embarked (2: marginal)** have missing values.We know that there are only a finite number of **Cabins** on board and so the rest of the passengers are not travelling in a **Cabin**, and if there are missing values among those, then it's neglible as there are only a handful of cabins across the ship.Two passenger's **Embarked** status is not known. Two passenger's **Embarked** status are not known. So the only field that is unknown and could be something to think about (or investigate or impute) would be **Age**. For example, there is a tendency to impute missing numeric fields with average values or medians, and these could be tricky at time. In addition, changing any data would impact the reality of the dataset and also add/remove exisitng bias. It's best for the moment we note these details and when necessary take actions. In any case if we used this dataset to train a model using tree-based models we would come out fine, as they handle *missing/null/bad data* pretty well. Also generating features (summarised classes) out of all these fields would help to get a good picture of the class balance across these particular fields. During the process we could consider the missing values to be of class "Unknown".

### Feature engineering

In [None]:
def expand_embark_acronym(embarked):
    result = []
    mapping = {
            "C": "Cherbourg",
            "S": "Southampton",
            "Q": "Queenstown"
    }    
    for each in embarked.values:
        if len(str(each)) > 1:
            result.append(each)
        else:        
            if each in mapping:
                result.append(mapping[each])
            else:
                result.append("Unknown")
    return result

def expand_pclass_acronym(pclass):
    result = []
    mapping = {
            1: "1st class",
            2: "2nd class",
            3: "3rd class"
    }    
    for each in pclass.values:
        if len(str(each)) > 1:
            result.append(each)
        else:
            if each in mapping:
                result.append(mapping[each])
            else:
                result.append("Unknown")
    return result

def is_a_minor(age):
    if math.isnan(age):
        return "Unknown"
    
    if age < 18:
        return "Under 18 (minor)"
    
    return "Adult"

# See https://help.healthycities.org/hc/en-us/articles/219556208-How-are-the-different-age-groups-defined-
def apply_age_groups(age):
    result = []
    mapping = {
            1: "Infant",      # Infants: <1
           13: "Child",       # Children: <18, <11 or K - 7th grade
           18: "Teen",        # Teens: 13-17 (Teens, who are not Adults)
           66: "Adult",       # Adults: 20+ (includes adult teens: 18+)
           123: "Elderly"     # Elderly: 65+ (123 is the oldest age known till date)
    }    
    for each_age in age.values:
        if type(each_age) == str:
            result.append(category)
        else:
            category = "Unknown"
            if each_age != np.nan:
                for each_age_range in mapping:
                    if  each_age < each_age_range:
                        category = mapping[each_age_range]
                        break
            result.append(category)
    return result

def apply_age_ranges(age):
    result = []
    mapping = {
            6: "00-05 years",
           12: "06-11 years",     
           19: "12-18 years",
           31: "19-30 years",
           41: "31-40 years",
           51: "41-50 years",
           61: "51-60 years",
           71: "61-70 years",
           81: "71-80 years",
           91: "81-90 years",
           124: "91+ years",  # (123 is the oldest age known till date)
    }
            
    for each_age in age.values:
        if type(each_age) == str:
            result.append(category)
        else:
            category = "Unknown"
            if each_age != np.nan:
                for each_age_range in mapping:
                    if  each_age < each_age_range:
                        category = mapping[each_age_range]
                        break
            result.append(category)
    return result

def is_married_of_single(names, ages, sexes):
    result = []
    for name, age, sex in zip(names.values, ages.values, sexes.values):
        if age < 18:
            result.append("Not of legal age")
        else:
            if ('Mrs.' in name) or ('Mme.' in name):
                result.append("Married")
            elif ('Miss.' in name) or ('Ms.' in name) or ('Lady' in name) or ('Mlle.' in name):
                result.append("Single")
            else:
                result.append("Unknown")
    
    return result

def apply_travel_companions(siblings_spouse, parent_children):
    result = []
    for siblings_spouse_count, parent_children_count in zip(siblings_spouse.values, parent_children.values):
        if (siblings_spouse_count > 0) and (parent_children_count > 0):
            result.append("Parent/Children & Sibling/Spouse")
        else:
            if (siblings_spouse_count > 0):
                result.append("Sibling/Spouse")
            elif (parent_children_count > 0):
                result.append("Parent/Children")
            else:
                result.append("Alone")
    
    return result

def apply_fare_ranges(fare):
    result = []
    mapping = {
           11: "£000 - 010",
           21: "£011 - 020",     
           41: "£020 - 040",
           81: "£041 - 080",
          101: "£081 - 100",
          201: "£101 - 200",
          301: "£201 - 300",
          401: "£301 - 400",
          515: "£401 & above"  # in this case the max fare is around £512
    }    
    for each_fare in fare.values:
        if type(each_fare) == str:
            result.append(category)
        else:
            category = "Unknown"
            if each_fare != np.nan:
                for each_fare_range in mapping:
                    if  each_fare < each_fare_range:
                        category = mapping[each_fare_range]
                        break
            result.append(category)

    return result

def were_in_a_cabin_or_not(row):
    if type(row) is str:
        return "In a Cabin"
    return "Not in a Cabin"

In [None]:
## Loading the table again to regenerate the feature engineered columns from scratch
training_dataset = pd.read_csv('/kaggle/input/titanic/train.csv')

## Survived (or Died)
training_dataset['Died'] = abs(1 - training_dataset['Survived'])

## Embarked: Place of embarkation
training_dataset['Embarked'] = expand_embark_acronym(training_dataset['Embarked'])

# Pclass: Passenger Class
training_dataset['Pclass'] = expand_pclass_acronym(training_dataset['Pclass'])

# Age
training_dataset['Adult_or_minor'] = training_dataset['Age'].apply(is_a_minor)

females_filter = training_dataset['Sex'] == 'female'
adult_filter = training_dataset['Adult_or_minor'] == '2. Adult'

training_dataset['Marital_status'] = is_married_of_single(training_dataset['Name'], training_dataset['Age'], training_dataset['Sex']) 
training_dataset['Age_group'] = apply_age_groups(training_dataset['Age'])
training_dataset['Age_ranges'] = apply_age_ranges(training_dataset['Age'])

# SibSp and Parch: Sibling/Spouse counts, Parent/Children counts
training_dataset['Travel_companion'] = apply_travel_companions(training_dataset['SibSp'], training_dataset['Parch'])

# Fare: ticket fare across the different classes
training_dataset['Fare_range'] = apply_fare_ranges(training_dataset['Fare'])

# Cabin: ticket holder has a cabin or not
training_dataset['In_Cabin'] = training_dataset['Cabin'].apply(were_in_a_cabin_or_not)
training_dataset['Cabin'] = training_dataset['Cabin'].fillna('No cabin')

In [None]:
training_dataset

In [None]:
training_dataset[sensible_columns].describe()

In [None]:
### This is novel use of the percentiles param of the describe function
training_dataset[sensible_columns].describe(percentiles=np.arange(10)/10.0)

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Summing up number of parents/children and siblings across passengers 

In [None]:
def print_count_of_passengers(dataset):
    total_ticket_holders = dataset.shape[0]
    siblings_count = dataset['SibSp'].sum()
    parents_children_count = dataset['Parch'].sum()

    print("siblings_count:", siblings_count)
    print("parents_children_count:", parents_children_count)
    print("total_ticket_holders:", total_ticket_holders)
    print("total (siblings, parents and children count):", siblings_count + parents_children_count)

    grand_total = total_ticket_holders + siblings_count + parents_children_count
    print("grand total (ticket holders, siblings, parents, children count):", grand_total)
    
    return grand_total

training_dataset_passengers_count = print_count_of_passengers(training_dataset)

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='analysis-general'></a>

----------

# Analysis: General (corelated to survival)

## Survived or died

In [None]:
g = sns.countplot(x=training_dataset['Survived'])
plt.legend(loc='upper right')
g.set(xlabel="Survival", xticklabels=["Died", "Survived"]) # "0=Died", "1=Survived"

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Gender

In [None]:
training_dataset.pivot_table(values=['Survived', 'Died'], index=['Sex'], aggfunc=np.mean)

In [None]:
gender_pivot_table = training_dataset.pivot_table(values=['Survived', 'Died'], index=['Sex'], aggfunc=np.sum)
gender_pivot_table

In [None]:
gender_pivot_table.plot(kind='barh')

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">We can conclude more females survived than males, as it's likely many of the men were engaged in trying to save the lives of others (especially the old, women and children) and risked their own lives in the process.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Passenger Class (PClass)

In [None]:
training_dataset.pivot_table(values=['Survived', 'Died'], index=['Pclass'], aggfunc=np.mean)

In [None]:
passenger_class_pivot_table = training_dataset.pivot_table(values=['Survived', 'Died'], index=['Pclass'], aggfunc=np.sum)
passenger_class_pivot_table

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">We could conclude **first class** passengers were given preference, over the other classes - given the time in the history of the event.

In [None]:
passenger_class_pivot_table.plot(kind='barh')
plt.ylabel('Passenger Class')

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Combining Gender and Passenger Class (PClass)

In [None]:
training_dataset.pivot_table(values=['Survived', 'Died'], index=['Sex', 'Pclass'], aggfunc=np.mean)

In [None]:
gender_passenger_class_pivot_table = training_dataset.pivot_table(values=['Survived', 'Died'], index=['Sex', 'Pclass'], aggfunc=np.sum)
gender_passenger_class_pivot_table

In [None]:
g = sns.catplot(x="Survived", hue="Pclass", col='Sex', data=training_dataset.sort_values(by='Pclass'), kind='count')
g.set(xticklabels=['Died', 'Survived'], xlabel="Survival")

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Combining the two we could conclude further that first class passengers of both genders were given higher preference over the other classes, and men were lost saving others in the process (mainly women and children) - given the time in the history of the event.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Fare

In [None]:
g = sns.catplot(x="Survived", y="Fare", data=training_dataset, kind="bar");
g.set(xticklabels=['Died', 'Survived'], xlabel="Survival", title="Sum of fares collected and Survival")

In [None]:
g = sns.catplot(x="Survived", y="Fare", hue="Pclass", data=training_dataset.sort_values(by='Pclass'), kind="bar");
g.set(xticklabels=['Died', 'Survived'], xlabel="Survival", title="Sum of fares collected across the three Passenger Classes and Survival")

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Fare does not conclusively say if ones who paid more were more likely to survive, although there are outliers that need to also be considered. Maybe looking at a slightly higher-level class could help.

In [None]:
g = sns.catplot(y="Fare_range", hue="Survived", data=training_dataset.sort_values(by='Fare'), kind="count")
g.set(ylabel="Fare range", title="Fare ranges and Survival")
new_labels = ['Died', 'Survived']
for t, l in zip(g._legend.texts, new_labels): 
    t.set_text(l)

g.fig.set_figwidth(30)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">As we look closer it appears that there are more survivors around the higher fare price range as opposed to the lower ones. But this we already know from the other features.

In [None]:
def passenger_class_filtered_dataset(passenger_class):
    dataset = training_dataset.copy()
    class_filter = dataset['Pclass'] == passenger_class
    return dataset[class_filter]

def draw_passenger_class_chart(passenger_class, title):
    dataset = passenger_class_filtered_dataset(passenger_class)
    g = sns.catplot(y="Fare_range", hue="Survived", data=dataset.sort_values(by='Pclass'), kind="count")
    g.set(ylabel="Fare range", title=title)
    new_labels = ['Died', 'Survived']
    for t, l in zip(g._legend.texts, new_labels): 
        t.set_text(l)

    g.fig.set_figwidth(30)

In [None]:
draw_passenger_class_chart('1st class', "First class passengers and Fare ranges")

In [None]:
draw_passenger_class_chart('2nd class', "Second class passengers and Fare ranges")

In [None]:
draw_passenger_class_chart('3rd class', "Third class passengers and Fare ranges")

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Again, if we take a closer look, we can see a spread in the number of survivors across the different Passenger classes and within them the different Fare ranges. We see that low to high Fare ranges exists across all three Passenger classes. Hence it is not easy to conclude if survival depended on who paid more to get a ticket (from the Fare range information) but along with the Passenger class information we can say those in lower-classes suffered more than those in the higher classes, and sometimes irrespective of how much they paid to get their tickets.

### Ticket holders with a zero fare

In [None]:
zero_fare_filter = training_dataset['Fare'] == 0.0
training_dataset[zero_fare_filter].pivot_table(index=['Pclass', 'Cabin', 'Ticket'])

In [None]:
zero_filter_columns = ['Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Survived', 'Died', 'Pclass', 'Cabin', 'Ticket', 'Marital_status', 'Age_group', 'Age_ranges']
training_dataset[zero_fare_filter][zero_filter_columns]

In [None]:
training_dataset[zero_fare_filter][zero_filter_columns].describe()

In [None]:
training_dataset[zero_fare_filter].pivot_table(values=['Survived', 'Died'], index=['Pclass', 'Cabin'], aggfunc=np.sum)

In [None]:
zero_fare_pivot_table = training_dataset[zero_fare_filter].pivot_table(values=['Survived', 'Died'], index=['Pclass'], aggfunc=np.sum)
zero_fare_pivot_table

In [None]:
zero_fare_pivot_table.plot(kind='barh')

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Taking a closer look at the Fare feature, there is evidence that a handful of ticket holders spanning all three passenger classes, i.e. 15 passengers did not have a Fare assigned to their tickets (**zero priced tickets**). Either this is missing data or maybe they won tickets out of a lottery that might have been held or got free tickets from the Titanic company. To arrive to a definitive conclusion this would need to looked into. So taking this new information into mind, we can't put much weight on the corelation between Fare and Survival, let's discuss this further when we touch Fare again.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Passenger Class (PClass) (to compare with Fare above)

In [None]:
passenger_class_pivot_table = training_dataset.pivot_table(values=['Survived', 'Died'], index=['Pclass'], aggfunc=np.sum)
passenger_class_pivot_table

In [None]:
passenger_class_pivot_table.plot(kind='barh')

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">The Passenger class does give clues that those who travelled first class stood a higher chance of survival. As there were a lot more deaths among those who travelled the two lower-classes.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Combining Fare and Pclass

In [None]:
g = sns.catplot(x="Pclass", y="Fare", hue="Survived", data=training_dataset.sort_values(by='Pclass'))
g.set(xlabel="Passenger Class")

new_labels = ['Died', 'Survived']
for t, l in zip(g._legend.texts, new_labels): 
    t.set_text(l)

g.fig.set_figwidth(16)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Passenger class and fare (which are collinear) combined give more clues that those who travelled first class stood a higher chance of survival than those who travelled the two lower-classes.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Age

In [None]:
g = sns.catplot(y="Age", x="Sex", hue="Survived", data=training_dataset)
new_labels = ['Died', 'Survived']
for t, l in zip(g._legend.texts, new_labels): 
    t.set_text(l)
g.fig.set_figwidth(16)

In [None]:
g = sns.catplot(x="Age_group", hue="Survived", data=training_dataset.sort_values(by='Age'), kind='count')
new_labels = ['Died', 'Survived']
for t, l in zip(g._legend.texts, new_labels): 
    t.set_text(l)
g.fig.set_figwidth(16)
g.set(xlabel="Age groups")

<i><p style="font-size:16px; background-color: #66cdde; border: 2px dotted black; margin: 20px; padding: 20px;">The above graph is provided for statistical purposes only, they don't provide anything new that we already don't know.

In [None]:
g = sns.catplot(col="Sex", x="Survived", hue="Age_group", data=training_dataset.sort_values(by='Age'), kind='count')
g.set(xlabel="Survival", xticklabels=['Died', 'Survived'])
g.fig.set_figwidth(16)

<i><p style="font-size:16px; background-color: #66cdde; border: 2px dotted black; margin: 20px; padding: 20px;">The above graph is provided for statistical purposes only, they don't provide anything new that we already don't know.

In [None]:
g = sns.catplot(x="Survived", hue="Age_ranges", data=training_dataset.sort_values(by='Age'), kind='count')
g.set(xlabel="Survival", xticklabels=['Died', 'Survived'])
g.set(xlabel="Survival")
g.fig.set_figwidth(16)

<i><p style="font-size:16px; background-color: #66cdde; border: 2px dotted black; margin: 20px; padding: 20px;">The above graph is provided for statistical purposes only, they don't provide anything new that we already don't know.
    
<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Age does give as much of a clue and also has outliers - both of which indicate maybe it needs to be investigated further. But it's clear that among single or married women there is no clear indicator if their age lead to their survival or not. 

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">But among men it's clear -- most adult males (18 and above) didn't make it (did not survive, as they were helping and fighting for the survival of the others: women, children and elderly).</p></i>

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='analysis-circumstantial'></a>

----------

# Analysis: Circumstantial/chance (only stats)

## SibSp (Siblings and spouse count: brother, sister, husband, wife, etc...)

In [None]:
sibling_spouse_pivot_table = training_dataset.pivot_table(values=['Survived', 'Died'], index='SibSp', aggfunc=np.sum)
sibling_spouse_pivot_table

In [None]:
sibling_spouse_pivot_table.plot(kind='barh')
plt.ylabel('Sibling/spouse count')

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">It appears that those who travelled with spouses and/or siblings had a better chance of survival than those travelling by themselves. But also in slight contradiction among those who travelled with 1 or more siblings/spouse had less chances of survival, as the number of dependents to take care of grew in number.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Parch (Parents and Children count)

In [None]:
parent_children_pivot_table = training_dataset.pivot_table(values=['Survived', 'Died'], index='Parch', aggfunc=np.sum)
parent_children_pivot_table

In [None]:
parent_children_pivot_table.plot(kind='barh')
plt.ylabel('Parent/children count')

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">It appears that those who travelled with parents, grandparents and/or children had a better chance of survival than those travelling by themselves. But also in slight contradiction among those who travelled with 1 or more parents/children had near 50/50 chance of survival, even if the number of dependents to take care of grew past 1.

## SibSp and Parch combined

In [None]:
travel_companion_pivot_table = training_dataset.pivot_table(values=['Survived', 'Died'], index='Travel_companion', aggfunc=np.sum)
travel_companion_pivot_table

In [None]:
travel_companion_pivot_table.plot(kind='barh')
plt.ylabel('Travel companion')

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">It appears that those who travelled alone took a bigger toll than those who travelled with sibling/spouse or parent/children. Looking closely with those who travelled with their whole family (i.e. sibling/spouse and parent/children) didn't do that well. But the number of deaths are much higher for those travelling alone at the same time, the numer of survivers are equally higher compared to the individual categories (and nearly the same as all of them combined togethered).

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Minors/children (age < 18)

<i><p style="font-size:16px; background-color: #FFFFD7; solid black; margin: 20px; padding: 20px;">We are assuming that 18 years of age was considered to be the age when one is legally recognised as an adult in those times.

In [None]:
training_dataset.pivot_table(values=['Survived', 'Died'], index='Adult_or_minor', aggfunc=np.sum)

In [None]:
adult_or_minor_pivot_table = training_dataset.pivot_table(values=['Survived', 'Died'], index=['Adult_or_minor', 'Sex'], aggfunc=np.sum)
adult_or_minor_pivot_table

In [None]:
g = sns.catplot(x="Adult_or_minor", col='Sex', hue="Survived", kind="count", data=training_dataset.sort_values(by='Age'));
new_labels = ['Died', 'Survived']
for t, l in zip(g._legend.texts, new_labels): 
    t.set_text(l)
g.fig.set_figwidth(16)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">We know that most adults were trying to help women, children and elderly get off board to save their lives, in return may have forfeited their own. Among the adults were elderly who may not have survived due to various reasons unknown to us. Less than half of the minors could not make it again due to reasons unknown to us. Maybe they split from their parents or elders or could not swim or were impacted by the harsh weather and could not stay warm, same could apply to the elderly as well.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Single, or Married female passengers (age >= 18)

<i><p style="font-size:16px; background-color: #FFFFD7; solid black; margin: 20px; padding: 20px;">Adult single or married female passengers were easier to filter from their names and age, but it is not possible to know from the names of the adult males if they were married or not, except for male children under 18. Again we are assuming that the legal age for marriage was 18 by law, in those days.

### Marital status could not be known alone from just their name(s)

In [None]:
unknown_marital_status_filter = training_dataset['Marital_status'] == 'Unknown'
training_dataset[females_filter & unknown_marital_status_filter]

### Marital status does not apply for those who are not of legal age

In [None]:
not_legal_marital_status_filter = training_dataset['Marital_status'] == "Not of legal age"
training_dataset[not_legal_marital_status_filter]

In [None]:
female_marital_status_pivot_table = training_dataset[females_filter].pivot_table(
    values=['Survived', 'Died'], index='Marital_status', aggfunc=np.sum
)
female_marital_status_pivot_table

In [None]:
female_marital_status_pivot_table.plot(kind='barh')
plt.ylabel('Marital status')
plt.title('Adult females (Single or married)')

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">From the above, we can hardly conclude that about survival rate is related to marital status or not. Especially because a marital status of the males are not known (or have been provided in anyway).

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Marital status
<i><p style="font-size:16px; background-color: #FFFFD7; solid black; margin: 20px; padding: 20px;">Legal age for marriage was assumed to be 18 by law (not sure if this was the case in those days). Marital status where known has been applied, in other cases, it has been set to unknown.

In [None]:
marital_status_pivot_table = training_dataset.pivot_table(values=['Survived', 'Died'], index='Marital_status', aggfunc=np.sum)
marital_status_pivot_table

In [None]:
marital_status_pivot_table.plot(kind='barh')
plt.ylabel('Marital status')
plt.title('All ages')

<i><p style="font-size:16px; background-color: #66cdde; border: 2px dotted black; margin: 20px; padding: 20px;">The above graph is provided for statistical purposes only, they don't provide anything new that we already don't know.

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Marital status may not have been a strong factor here (or a factor at all), unless they travelled with their partner/spouse which is covered by other features (factors) in the dataset. There isn't a way to find that out for a large number of records if there are no indicators in the name or especially if they are adult males.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Embarked

In [None]:
embarked_pivot_table = training_dataset.pivot_table(values=['Survived', 'Died'], index='Embarked', aggfunc=np.sum)
embarked_pivot_table

In [None]:
embarked_pivot_table.plot(kind='barh')

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">It appears those who embarked from Southampton took the biggest toll followed by Cherbourg and then Queenstown. This might need further investigation. Does it have to do with the class they travelled in or if they were in a cabin or not?

In [None]:
embarked_passenger_class_pivot_table = training_dataset.pivot_table(
    values=['Survived', 'Died'], index=['Embarked', 'Pclass'], aggfunc=np.sum
)
embarked_passenger_class_pivot_table

In [None]:
g = sns.catplot(col="Embarked", x='Pclass', hue="Survived", kind="count", data=training_dataset.sort_values(by='Pclass'));
new_labels = ['Died', 'Survived']
for t, l in zip(g._legend.texts, new_labels): 
    t.set_text(l)
g.fig.set_figwidth(16)
g.set(xlabel="Passenger Class")

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Now when their passenger class status is taken into consideration along with their port of embarkation, it's clear. It's the lower-class travellers that were affected. It could also be due to the position or location of their cabins/tiers, but also could be that they were the ones left to help the other more priviledged travellers off board first, in the process loose their own chance of survival -- all of this will never be clear or known to us.

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Cabin

In [None]:
in_cabin_pivot_table = training_dataset.pivot_table(values=['Survived', 'Died'], index='In_Cabin', aggfunc=np.sum)
in_cabin_pivot_table

In [None]:
in_cabin_pivot_table.plot(kind='barh')
plt.ylabel('Cabin')

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Is it that those not travelling in a cabin were the ones who took the most of the death toll. Above show being in a cabin may have helped.

In [None]:
training_dataset.pivot_table(values=['Survived', 'Died'], index=['In_Cabin', 'Pclass'], aggfunc=np.sum)

In [None]:
g = sns.catplot(col="In_Cabin", x='Pclass', hue="Survived", kind="count", data=training_dataset.sort_values(by='Pclass'));
new_labels = ['Died', 'Survived']
for t, l in zip(g._legend.texts, new_labels): 
    t.set_text(l)
g.fig.set_figwidth(16)
g.set(xlabel="Passenger Class")

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">So it appears, that being in a cabin helped quite a bit and added to it if it was a upper-class, that helped further. There have been more survivers or less deaths with these two combinations.</p></i>

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

## Ticket
<i><p style="font-size:16px; background-color: #FFFFD7; solid black; margin: 20px; padding: 20px;">There is some interesting information about tickets. And when the Passenger class (Pclass), Cabin and Fare information is combined together we get some nice insights. Although this is purely statistical and there is nothing conclusive about survival from this information (we have already covered other factors that link up the ticket. So Ticket as a feature falls under multicollinearity.

In [None]:
training_dataset['Ticket'].describe()

In [None]:
training_dataset['Ticket'].value_counts()

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Above show that in many cases a single ticket can hold multiple persons, or be shared across a family or group of passengers. While many of the tickets are held under a single owner/passenger.

### First class

In [None]:
sorted_training_dataset = training_dataset.sort_values(by=['Pclass', 'Ticket', 'Cabin', 'Fare'], ascending=True)
first_class_filter = sorted_training_dataset['Pclass'] == '1st class'
second_class_filter = sorted_training_dataset['Pclass'] == '2nd class'
third_class_filter = sorted_training_dataset['Pclass'] == '3rd class'

In [None]:
first_class_sorted = sorted_training_dataset[first_class_filter]
first_class_pivot_table = first_class_sorted.pivot_table(values=['Fare'], index=['Cabin', 'Ticket'], aggfunc=np.mean)
print("Tickets count:", first_class_sorted.shape[0])
first_class_pivot_table

In [None]:
first_class_sorted['Fare'].describe(percentiles=np.arange(10)/10.0)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Fares on the **first class** range from £26.00 all the way to £1052.00. Although, a number of tickets were sold at a much higher price (5x to 9x normal price) than the average price of approximately £108. Could they have been sold at a higher price closer to the sail date?

In [None]:
plt.figure(figsize=(20,4))
first_class_sorted['Fare'].hist(bins=40)

### Second class

In [None]:
second_class_sorted = sorted_training_dataset[second_class_filter]
second_class_pivot_table = second_class_sorted.pivot_table(values=['Fare'], index=['Cabin', 'Ticket'], aggfunc=np.mean)
print("Tickets count:", second_class_sorted.shape[0])
second_class_pivot_table

In [None]:
second_class_sorted['Fare'].describe(percentiles=np.arange(10)/10.0)

In [None]:
plt.figure(figsize=(20,4))
second_class_sorted['Fare'].hist(bins=40)

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Fares on the **second class** range from £10.50 all the way to £73.50. Although a number of tickets were sold at a much higher price - nearly 7x the normal price). Could they have been sold at a higher price closer to the sail date?

### Third class

In [None]:
third_class_sorted = sorted_training_dataset[third_class_filter]
third_class_pivot_table = third_class_sorted.pivot_table(values=['Fare'], index=['Cabin', 'Ticket'], aggfunc=np.mean)
print("Tickets count:", third_class_sorted.shape[0])
third_class_pivot_table

In [None]:
third_class_sorted['Fare'].describe(percentiles=np.arange(10)/10.0)

In [None]:
plt.figure(figsize=(20,4))
third_class_sorted['Fare'].hist(bins=20)


<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Fares on the **third class** range from £7.65 all the way to £69.55. Although a number of tickets have a higher price than the regular tickets - almost 8x to 9x the normal prices, could they have been sold at a higher price closer to the sail date?

<i><p style="font-size:16px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">As we go lower in the passenger classes (1st to 2nd to 3rd) we see that tickets are shared across multiple passengers. And cabins are assigned or shared by 2nd and 3rd class travellers holding the same ticket. So the tickets are assigned fixed cabins in many instances, except for instances where there is no Cabin number assigned, maybe those tickets are held by ones seated in an open area without an enclosure.</p></i>

<hr>
<p style="font-size:16px"><b>So we can conclude these things about Ticket (and related fields: <b>Pclass</b>, <b>Cabin</b>, and <b>Fare</b>):</b></p>
<ul>
  <li><b>Pclass</b>, <b>Cabin</b>, and <b>Ticket</b> are related, in fact <b>Tickets</b> are assigned to <b>Cabins</b> wherever they are applicable</li>
    <li>From <b>Fare</b> we cannot always guess which <b>Pclass</b> it maybe representing (due to various reasons)</li>
  <li>Each <b>Pclass</b> has it's its own Fare ranges and sometimes tickets from a lower <b>Pclass</b> can be more expensive than the higher Pclass (vice-versa applies automatically)</li>
  <li>There have been tickets that have been sold at multiple times the average Ticket price for a particular Pclass and we can more or less spot those tickets and passengers with them (not highly accurate and there is no way to verify for correctness)</li>
  <li>2nd and 3rd Class passengers might be assigned the same ticket so multiple passengers may be travelling on the same <b>Ticket</b> (not sure how this is accounted or managed during those times). And these may not be family members but just groups assigned to the same ticket looks like - further analysis can be done.</li>
  <li>With the above information, it's possible to know who was seated where and if we have the map (or layout) of the ship (for all three <b>Pclass</b> classes we can estimate some sort of risk factor)</li>
  <li>From all the above we can say <b>Fare</b> isn't a reliable indicator of Survival - it's a hit or miss but when combined with other factors it does help improve accuracy. We can also say this because under the <b>Fare</b> section above we have seen various analysis, one of them being a list of passengers who didn't have a fare price on their tickets, and we also have some who are holders of tickets that have been sold at multiple times the Passenger Class average value. These passengers hail from all three Passenger classes.</li>
</ul>


<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='test-dataset'></a>

----------

# Test dataset

In [None]:
test_dataset = pd.read_csv('/kaggle/input/titanic/test.csv')
test_dataset

## Summing up number of parents/children and siblings across passengers 

In [None]:
test_dataset_passengers_count = print_count_of_passengers(test_dataset)


<a id='dataset-summary'></a>

----------

# Dataset summary

In [None]:
print("Training dataset passengers (count):", training_dataset_passengers_count)
print("Test dataset passengers (count):", test_dataset_passengers_count)
total_passengers = training_dataset_passengers_count + test_dataset_passengers_count
print("Total passengers on board:", total_passengers)
print("Total passengers on board (as per description):", 2224)
print("")
print("~~~ Discrepancy between the above two figures:", abs(total_passengers - 2224), "extra people on board or miscalculation ~~~")

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>

<a id='resources'></a>

----------

# Resources

- Tutorials
  - [Alexis Cook’s Titanic Tutorial](https://www.kaggle.com/alexisbcook/titanic-tutorial)
- Discussions:
  - https://www.kaggle.com/getting-started/170570
- Tools & Resources:
  - https://github.com/ResidentMario/missingno
  - https://otexts.com/fpp2/causality.html (confounding, corelated and multicollinear fields)
  - https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping
- Competition/Datasets:
  - https://www.kaggle.com/c/titanic
- Kernels
   - https://www.kaggle.com/headsortails/pytanic
   - https://github.com/ashleysmart/mlgym/blob/master/tuts/Onboarding_Fundementals.ipynb (good usage of Pivot table and describe, and other EDA features)
   - https://www.kaggle.com/lavanyashukla01/how-i-made-top-0-3-on-a-kaggle-competition (Housing price)[](http://)
   - Visualisations
     - https://datavizcatalogue.com/index.html
     - https://matplotlib.org/3.2.2/gallery/index.html
     
- Slack
  - [ML Tokyo](https://join.slack.com/t/machinelearningtokyo/shared_invite/zt-asg9oyte-K2p_9_RJo1~mMGi7thYm9Q)
  - [ChaiEDA channel](https://machinelearningtokyo.slack.com/archives/C018042A38W)

<a id='credits'></a>

----------

# Credits

Thanks to @AshleySmart the creator of https://github.com/ashleysmart/mlgym/blob/master/tuts/Onboarding_Fundementals.ipynb, I borrowed a lot of ideas from here and added my own narrations to many of the EDA components.

Thanks to Sanyam and all the participants of the ChaiEDA (see links above) for inspiring me to come up with this notebook/kernel.

## Prequels/sequels

- **ChaiEDA sessions: Titanic: initial (manual) EDA with explanations**
- [ChaiEDA sessions: Titanic - Using tools](https://www.kaggle.com/neomatrix369/chaieda-sessions-titanic-using-tools/): analysis using ready-to-use analysis and profiling tools

<a href='#ToC'><span class="label label-info" style="font-size: 125%">Back to Table of Contents</span></a>