# Introduction

This simple EDA looks at the makeup of the Synthanic dataset for this month's competition. 

## Credits

* A large portion of the graphical analysis was taken and modified from the excellent EDA of the Titanic competition at [Top 4% Approach with RandomForest](https://www.kaggle.com/aipi12/top-4-approach-with-randomforest) by @aipi12

# 1 - General Overview

## 1.1 - Column Descriptions

Below are some generalized properties that can be observed from the datasets.

* `train` dataset contains 100,000 rows, 12 columns
* `test` dataset contains 100,000 rows, 11 columns
* Target column:
  * `Survived` -> whether the passenger survived -> type `int64` -> values `[0, 1]`
* Categorical columns:
  * `Pclass` - passenger class -> type `int64` -> values `[1, 2, 3]`
  * `Name` - passenger name -> type `str` -> format `lastname, firstname`
  * `Sex` - passenger sex -> type `str` -> values `[male, female]`
  * `SibSp` - # siblings / spouses on ship -> type `int64` -> values `[0, 1, 2, 3, 4, 5, 8]`
  * `Parch` - # parents / children on ship -> type `int64` -> values `[0, 1, 2, 3, 4, 5, 6, 9]`
  * `Ticket` - passenger ticket number -> type `object` -> format `[a-zA-Z\.0-9]*`
      * `train` has 4,623 missing values
      * `test` has 5,181 missing values
  * `Cabin` - passenger cabin -> type `object` -> format `[A-G][0-9]+`
      * `train` has 67,866 missing values
      * `test` has 70,831 missing values
  * `Embarked` - the port the passenger left from -> type `object` -> values `[S, C, Q]`
      * `train` has 250 missing values
      * `test` has 277 missing values
* Continuous columns:
  * `Age` - passgenger age -> type `float64`
      * `train` has `0.08` min, `87.0` max (3,292 missing)
      * `test` has `0.08` min, `81.0` max (3,487 missing)
  * `Fare` - passgenger fare amount -> type `float64`
      * `train` has `0.68` min, `744.66` max (134 missing)
      * `test` has `0.05` min, `680.70` max (133 missing)

In [None]:
import pandas as pd
import numpy as np

train = pd.read_csv("../input/tabular-playground-series-apr-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-apr-2021/test.csv")

def cat_column_info(column):
    num_categories = train[column].nunique()
    print("------> {} <------".format(column))
    print("--: train - type {}".format(train[column].dtype))
    print("--: test  - type {}".format(test[column].dtype))
    print("--: train - # categories {}".format(train[column].nunique()))
    print("--: test  - # categories {}".format(test[column].nunique()))
    if num_categories < 10:
        if train[column].dtype == "int64":
            print("--: train - values {}".format(np.sort(train[column].unique())))
            print("--: test  - values {}".format(np.sort(test[column].unique())))
        else:
            print("--: train - values {}".format(train[column].unique()))
            print("--: test  - values {}".format(test[column].unique()))
    print("--: train - NaN count {}".format(train[column].isnull().values.sum()))
    print("--: test  - NaN count {}".format(test[column].isnull().values.sum()))
    print("")

def cont_column_info(column):
    print("------> {} <------".format(column))
    print("--: train - type {}".format(train[column].dtype))
    print("--: test  - type {}".format(test[column].dtype))
    print("--: train - min {}".format(train[column].min()))
    print("--: test  - min {}".format(test[column].min()))
    print("--: train - max {}".format(train[column].max()))
    print("--: test  - max {}".format(test[column].max()))    
    print("--: train - NaN count {}".format(train[column].isnull().values.sum()))
    print("--: test  - NaN count {}".format(test[column].isnull().values.sum()))
    print("")
    
print(": Train shape {}".format(train.shape))
print(": Test shape {}".format(test.shape))
print("")
cat_column_info("Pclass")
cat_column_info("Name")
cat_column_info("Sex")
cat_column_info("SibSp")
cat_column_info("Parch")
cat_column_info("Ticket")
cat_column_info("Cabin")
cat_column_info("Embarked")
cont_column_info("Age")
cont_column_info("Fare")

## 1.2 - Class Balance

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
sns_params = {"palette": "BuPu_r"}

counts = pd.DataFrame(train["Survived"].value_counts())
counts.rename(index={0: "Died (0)", 1: "Survived (1)"}, inplace=True)
ax = sns.barplot(x=counts.index, y=counts.Survived, **sns_params)
for p in ax.patches:
    height = p.get_height()
    ax.text(
        x=p.get_x()+(p.get_width()/2),
        y=height,
        s=round(height),
        ha="center"
    )

# 1.3 - Feature Deep Dive

In [None]:
from math import ceil

def plot_grid(df, columns, ptype, target=None, addl_params=None):
    num_rows = int(ceil(len(columns) / 3))
    grid = (num_rows, 3)
    fig = plt.figure(figsize=(16, num_rows * 6))
    addl_params = sns_params.copy() if not addl_params else addl_params
    for index, column in enumerate(columns):
        fig.add_subplot(grid[0], grid[1], index+1)
        if ptype == "hist":
            addl_params["stat"] = "count" if "stat" not in addl_params else addl_params["stat"]
            _ = sns.histplot(df[column], kde=True, **addl_params)
        elif ptype == "box":
            _ = sns.boxplot(x=df[column], **addl_params)
        elif ptype == "count":
            plot = sns.countplot(x=df[column], hue=df[target], **addl_params)
            plot.legend(loc="upper right", title=df[target].name)
    plt.tight_layout()
    
def countplot(df, columns, target, addl_params=None):
    plot_grid(df, columns, ptype="count", target=target, addl_params=addl_params)
    
def histplot(df, columns, addl_params=None):
    plot_grid(df, columns, ptype="hist", addl_params=addl_params)
    
def boxplot(df, columns, addl_params=None):
    plot_grid(df, columns, ptype="box", addl_params=addl_params)
    
def displot(df, x, y, target, addl_params=None):
    addl_params = sns_params.copy() if not addl_params else addl_params
    plot = sns.displot(df, x=x, y=y, hue=target, kind="kde", **addl_params)

### 1.3.1 Age, Fare -> Train vs Test

In [None]:
histplot(train, ["Age", "Fare"])

In [None]:
histplot(test, ["Age", "Fare"], addl_params={"color": "purple"})

* Age group distribution differences between `train` and `test` presents challenges: 
    * The `train` dataset has large spikes in various age groups.
    * The `test` dataset has missing or low value numbers for various age groups.
* Need to be careful of overfitting `Age` on the `train` dataset, as it will not generalize well to `test`.

### 1.3.2 SibSp, Pclass, Parch -> Train vs Test

In [None]:
histplot(train, ["SibSp", "Pclass", "Parch"])

In [None]:
histplot(test, ["SibSp", "Pclass", "Parch"], addl_params={"color": "purple"})

* Differences exist between `Pclass` distributions with respect to `train` and `test`:
    * A higher proportion of passengers in `Pclass` 3 exist in `test` when compared to `train`.
    * A lower proportion of passengers in `Pclass` 2 exist in `test` when compared to `train`.

### 1.3.3 Pclass, SibSp, Parch -> Relation to Survived

In [None]:
countplot(train, ["Pclass", "SibSp", "Parch"], "Survived")

* People in 1st and 2nd class are more likely to survive (`Pclass` = `1` or `2`)
* People with 1 parent or child are more likely to survive (`Parch` = `1`)

### 1.3.4 Sex, Age -> Relation to Survived

In [None]:
train["BinnedAge"] = pd.cut(train["Age"], [x for x in range(0, 90, 10)])
countplot(train, ["Sex", "BinnedAge"], "Survived")

* Females were much more likely to survive (could be due to the maxim **women and children first**).

### 1.3.5 Male, Age -> Relation to Survived

In [None]:
countplot(train[(train["Sex"] == "male")], ["BinnedAge"], "Survived")

* Males of any age category have a high likelihood of dying.

In [None]:
displot(train, "Age", "Fare", "Survived")

### 1.3.6 Female, Age -> Relation to Survived

In [None]:
countplot(train[(train["Sex"] == "female")], ["BinnedAge"], "Survived")

* Survival of any group of females is high, aside from those ages 0-10. 

### More to come...