# Introduction
This notebook is aimed to host data from the Titanic dataset for Kaggle's competition. We aimed to analyze and predict which passengers from the Titanic will survive.

### Imports
Import libraries establish settings.

In [50]:
# Data manipulation
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Data analysis
import pingouin as pg

# set matplotlib parameters


# Analysis/Modeling

## Import data

In [65]:
# path to the file
path = "train.csv"

# import csv
train = pd.read_csv(path)
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Missing values

Let's explore what kind of data the Dataset holds.

We have numeric and categorical data. Most of the columns contains non-null values, although we can see that missing values are found for Age, Cabin and some for Embarked. Specifically, we have 177 NaN values in Age, 687 in Cabin and 2 in Embarked.


For this, we would adopt some strategies. 

* First, we will first explore if there are differences between "Pclass" in "Age". In case there are statistically significant differences, we will impute the NaN values in "Age" computing the mean value for each "Pclass" group.


* Second, it seems that NaN values for "Cabin" should correspond to those in third class that do not have a cabin assigned. In any case, we will first see if for "Cabin" we only have "3" as unique value.


* Third, we will try to find some information in the Dataset to impute missing values in "Embarked". We will try to seach for that on the internet in case we do not success.

In [66]:
# DataFrame information
train.info()

# missing values in Age, Cabin and Embarked
pd.DataFrame(data=train[["Age", "Cabin", "Embarked"]].isnull().sum(), columns=["Missing values"])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Unnamed: 0,Missing values
Age,177
Cabin,687
Embarked,2


Let's explore descriptive statistics for numerical data. We can see that we have a minimum Age of 0.42, that should correspond to the age of an infant. Also, we see that Pclass is between 1 and 3, so no unexpected values in that column based on unique values.

In [67]:
# descriptive statistics for numeric variables
train[["Age", "Fare", "Pclass"]].describe()

# unique values in Pclass
train["Pclass"].unique()

array([3, 1, 2], dtype=int64)

Now, we will test if there are statistically significant differences in "Age" based on "Pclass". From the ANOVA, we see that there is a statistiaclly significant difference, so we should replace NaN values for each "Pclass" independently.

In [68]:
# descriptive statistics by Pclass
display(train.groupby(["Pclass"])["Age"].describe().T)

# ANOVA
pg.anova(data=train[["Pclass", "Age"]], dv="Age", between="Pclass", detailed=False)

Pclass,1,2,3
count,186.0,173.0,355.0
mean,38.233441,29.87763,25.14062
std,14.802856,14.001077,12.495398
min,0.92,0.67,0.42
25%,27.0,23.0,18.0
50%,37.0,29.0,24.0
75%,49.0,36.0,32.0
max,80.0,70.0,74.0


Unnamed: 0,Source,ddof1,ddof2,F,p-unc,np2
0,Pclass,2,711,57.443484,7.487984e-24,0.139107


Exploring the data, we have discovered that we cannot make an automatic replacement for "Cabin", as we have NaN values not only for third class, but for first and second too. Once discovered this, we will replace NaN values for third class with "T" in "Cabin", but for the others "Pclass" we will maintain NaN values.

In [69]:
# unique values of Pclass for missing values in Cabin
train[train["Cabin"].isnull()]["Pclass"].unique()

array([3, 2, 1], dtype=int64)

We couldn't gather any additional information based on "Ticket" or "Cabin" to replace missing values in "Embarked". However, looking for information on the Internet we found that Miss Amelie Icard traveled in the Titanic as maid to Mrs. Martha Evelyn, and they embarked in Southampton, so we will replace these missing values with an "S" for Southampton.

In [70]:
# rows for missing values in Embarked
display(train[train["Embarked"].isnull()])

# replace Embarked missing values
train.loc[(train["Cabin"] == "B28"), "Embarked"] = "S"

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


## Feature engineering

The "Ticket" variable contains a lot of interesting information. We will try to extract more useful features splitting the ticket data into two, one variable including the first letters of the ticket as it seems to code something more specific. The other will contain the rest of the ticket information.

In [72]:
# empty DataFrame to contain the splitted information
empty = pd.DataFrame()

# split the data
empty = train["Ticket"].str.split(pat=' ', n=1, expand=True)

# rename the columns
empty.rename({0: "Ticket_L", 1: "Ticket_N"}, axis=1, inplace=True)

# merge into the train DataFrame the splitted data
train = train.merge(empty, left_on=train.index, right_on=empty.index).drop(columns='key_0')
train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Ticket_L,Ticket_N
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,A/5,21171
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,PC,17599
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,STON/O2.,3101282
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,113803,
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,373450,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,211536,
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,112053,
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,W./C.,6607
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,111369,


For this, let's create a Custom Transformer, so we can apply it to the test dataset in a Pipeline. In this Custom Transformer we will also include the replacement of NaN values in Cabin for third class people for another category "T", to avoid NaN values to remain.

In [73]:
class CustomImputer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        super().__init__()
        self.mean_age_by_class = {}
        print("Transforming data. In the CustomImputer init method: ")

    def fit(self, X, y=None):
        self.mean_age_by_class = X.groupby(["Pclass"])["Age"].mean()

        return self

    def transform(self, X, y=None):
        # apply values based on dictionaries 
        for pclass, mean_age in self.mean_age_by_class.items():           
            X.loc[((X["Age"].isnull()) & (X["Pclass"] == pclass)), "Age"] = mean_age
        
        # third class Cabin null values replace for "T"
        X.loc[((X["Cabin"].isnull()) & (X["Pclass"] == 3)), "Cabin"] = "T"
        
        return X

# Results
Show graphs and stats here

# Conclusions and Next Steps
Summarize findings here