### EDA of Apr 2021 kaggle TPS

This notebook is used to do initial exploration of the training data, get an understanding of the distribution and completeness of each column and relationship with the target. The aim is to build enough of an understanding to make some baseline predictions.

[Data dictionary](https://www.kaggle.com/c/tabular-playground-series-apr-2021/data?select=train.csv)

In [None]:
# imports for EDA
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# imports for inference
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [None]:
df = pd.read_csv("../data/raw/train.csv")

In [None]:
df.head()

We can get some idea of what to expect from the first few rows - next to look at the distribution of each more closely, and think about how to handle missing data and how to encode sex, class, port of embarkation etc.

In [None]:
df.info()

In [None]:
df.describe()

So with a 43% survival rate, hopefully we can do better than predicting everyone as lost at 57%. Noticing a lot of missing ages, not so many missing fares. Assuming the passenger IDs are all ok, on to survival and lots of plots.

In [None]:
sns.displot(data=df, x="Survived")

In [None]:
sns.displot(data=df, x="Pclass", hue="Survived", multiple="stack")

In [None]:
df.groupby(['Sex', 'Survived'])['Survived'].count()

In [None]:
df['Sex'].isnull().sum()

No missing values here.

In [None]:
sns.displot(data=df, x="Age", hue="Survived", kind="kde")

In [None]:
df['Age'].isnull().sum()
#3.3% of values for Age are missing - may be able to impute from other cols

In [None]:
sns.displot(data=df, x="SibSp", hue="Survived", multiple="stack")
# SibSp and Parch will be returned to after a baseline established

In [None]:
sns.displot(data=df, x="Parch", hue="Survived", multiple="stack")

In [None]:
print(df['Ticket'].nunique())
print(df['Ticket'].isnull().sum())
# After baseline established, should be possible to link families using ticket, cabin

In [None]:
print(df['Fare'].isnull().sum())
sns.displot(data=df, x="Fare", hue="Survived", kind="kde")

In [None]:
print(df['Cabin'].nunique())
print(df['Cabin'].isnull().sum())

In [None]:
df.groupby(['Embarked', 'Survived'])['Survived'].count()

In [None]:
df['Embarked'].isnull().sum()
# A few missing here, to investigate after the first iteration is complete

There is work to be done to work out the best way to impute missing data around Age, SibSp, Parch, Ticket, Cabin, and Embarked, but first I want to compare a quick mvp to a dummy benchmark to set a baseline. For this the categorical columns will need to be encoded, nulls filled with temporary junk, after breaking out to a train/test split.

In [None]:
target = df[['Survived']]
features = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

In [None]:
categorical = ['Pclass', 'Sex', 'Embarked']
numeric = ['Age', 'SibSp', 'Parch', 'Fare']

In [None]:
X_train[numeric].describe()
# df['value'] = df['value'].fillna(df.groupby('category')['value'].transform('mean'))
# df['value'] = df['value'].fillna(df['value'].mean())

In [None]:
X_train['Age'] = X_train['Age'].fillna(X_train['Age'].median())

In [None]:
X_train['Fare'] = X_train['Fare'].fillna(X_train['Fare'].median())

With numerical blanks filled with median values (remember, this is an mvp) and a train test split specified, numeric columns are scaled and categorical columns are one hot encoded. Initial estimator is logistic regression for a simple binary classification task.

In [None]:
preprocessor = make_column_transformer(
    (StandardScaler(), numeric),
    (OneHotEncoder(drop='if_binary'), categorical), 
    remainder='passthrough')

model = make_pipeline(
    preprocessor,
    LogisticRegression())

_ = model.fit(X_train, np.ravel(y_train))

To set a floor with a baseline, let's see the classification report if we predict the ship going down with no survivors.

In [None]:
y_dummy = np.full(shape=len(y_pred), fill_value=0)
print(classification_report(y_train, y_dummy))

In [None]:
y_pred = model.predict(X_train)
print(classification_report(y_train, y_pred))

Performance on training data is ok, better than guessing that no one makes it at least.

In [None]:
X_test['Age'] = X_test['Age'].fillna(X_train['Age'].median())
X_test['Fare'] = X_test['Fare'].fillna(X_train['Fare'].median())
y_unseen = model.predict(X_test)
print(classification_report(y_test, y_unseen))

To finish with the same steps are carried out on the test set and submitted. With more time I'd want to do EDA on the test set to look for drift in each category, but for right now I'm interested in how this model performs without any additional intervention.

In [None]:
# convert to code and run if running inference on test set for kaggle submission
file_out = "../data/inference/basic_eda_gb.csv"
test_df = pd.read_csv("../data/raw/test.csv")
test_df.head()
test_features = test_df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
test_features['Age'] = test_features['Age'].fillna(X_train['Age'].median())
test_features['Fare'] = test_features['Fare'].fillna(X_train['Fare'].median())
test_unseen = model_gb.predict(test_features)
test_df['Survived'] = test_unseen.tolist()
test_submission = test_df[['PassengerId', 'Survived']]
test_submission.to_csv(file_out, index=False)

This initial submission achieved a leaderboard score of 0.77000. 

Next steps are:
- using an ensemble algorithm on the same data as logistic regression above, measure uplift
- look at more intelligent ways of filling nulls, making better use of tickets and cabins to see if that affects survival.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
# yes there needs to be a proper CV grid search eventually - submitting this as is to kaggle yielded leaderboard score of 0.78642
model_gb = make_pipeline(
    preprocessor,
    GradientBoostingClassifier())

_ = model_gb.fit(X_train, np.ravel(y_train))

In [None]:
y_pred_gb = model_gb.predict(X_train)
print(classification_report(y_train, y_pred_gb))

In [None]:
y_unseen_gb = model_gb.predict(X_test)
print(classification_report(y_test, y_unseen_gb))