# Titanic Data Challenge

## Introduction
**Kaggle Description**: 
> On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

> While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

> In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

**Goal**: To predict whether or not a passenger will survive the sinking of the Titanic based on provided information

Let's begin! First, let's do some boilerplate setup.

### Imports

In [None]:
%reload_ext autoreload
%autoreload 2

# custom helpers
from helpers.helper import get_splits
# data handling
import numpy as np
import pandas as pd
# output
from termcolor import cprint
import matplotlib.pyplot as plt
import seaborn as sns

cprint('All Modules Imported!', 'green')

### Data Import

In [None]:
import os
os.listdir('./data/')

In [None]:
train_data = pd.read_csv('./data/train.csv', index_col='PassengerId')
test_data = pd.read_csv('./data/test.csv', index_col='PassengerId')

cprint('Data Imported!', 'green')
cprint('Training Data Example:', 'cyan')
display(train_data)

## Process
1. Figure out which features we can safely drop/keep.
2. Encode features that need encoding (label encoding, categorical encoding).
3. Start feature engineering some new columns so we have a wider predicition set. 
4. Do feature selection to determine which features are not needed and find the best combination of features to use.
5. Research and test what models would be best for our situation and train/test different models.
6. Train and predict on the train/test sets.
7. Finally, output everything to a new CSV.

## Tools
Our current model options are: LightGBM, RandomForestRegressor, ExtraTreesRegressor.

I also want to use a pipeline to keep everything organized into various steps.


## Getting Started
### Feature Engineering
So the columns we have are:

| Variable | Definition | Key |
| ----- | --- | --- |
| survival | Survived or not | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| age | Age in years | |
| sibsp | Num of siblings / spouses aboard | |
| parch | Num of parents / children aboard | |
| ticket | Ticket number | |
| fare | Passengar fare | |
| cabin | Cabin number | |
| embarked | Port of embarkation |   C = Cherbourg, Q = Queenstown, S = Southampton |

Looking at these descriptions, we can probably disregard
- Name
- Ticket number

Ticket number is a ... maybe, as we aren't entirely sure how ticket numbers are handed out.

Let's get to engineering.
 

## Model Createion Steps
1. [x] Choose feature cols based on feature table and relevant data
2. [x] Split into train/valid/test sets
3. [ ] Generate features
    1. Interactions
    2. ...
4. [ ] Setup pipeline
    1. [x] Imputation to fill in N/A values
    2. [x] Categorical encoding, CatBoost
    4. [x] Standardize values
    5. [ ] Feature Selection
5. [ ] Train

Feature Generation / Engineering

In [None]:
# 1. Choose our feature cols based on feature table above
numerical_cols = ['Age', 'SibSp', 'Parch', 'Fare']
categorical_cols = ['Pclass', 'Sex', 'Cabin', 'Embarked']
target_col = 'Survived'
# Let's make some features
from itertools import combinations

interactions = pd.DataFrame(index=train_data.index)
for comb in combinations(categorical_cols, 2):
    new_feat = comb[0] + "_" + comb[1]
    interactions[new_feat] = train_data[comb[0]].astype(str) + "_" + train_data[comb[1]].astype(str)
    categorical_cols.append(new_feat)
train_data = train_data.join(interactions)
display(train_data)
# 2. Split sets
train, valid, _ = get_splits(train_data)
X_train = train.drop([target_col], axis=1)
y_train = train[target_col]
X_valid = valid.drop([target_col], axis=1)
y_valid = valid[target_col]
display(X_train)


### Pipeline

In [None]:
# machine learning
from sklearn import feature_selection
from sklearn import preprocessing
# Pipeline code originally copied from 3_intermediate_training_summary
from helpers.helper import PipelineFS
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler
# conda install -c conda-forge category_encoders
from category_encoders import CatBoostEncoder
# conda install -c conda-forge lightgbm
import lightgbm as lgb
from sklearn.model_selection import cross_val_score


In [None]:
def get_lgb_pipeline_score(X, y, params={'n_estimators':10,'num_leaves':64,'rate':0.1,'early_stopping_rounds':10}):
    """
    Run LightBGM pipeline on the provided parameters. 
    Scores based on cross_validation with 5 folds.

    params: Python object of params with the following keys
        n_estimators: number of estimators to use in pipeline, Default: 10
        num_leaves: num_leaves in lgb model, Default: 64
        rate: learning rate, Default: 0.1
        early_stopping_rounds: how many rounds to stop after if low variance, Default: 10
    """
    # Preprocessing for numerical data (fill in NA)
    numerical_transformer = PipelineFS(
        steps=[
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler())
        ]
    )

    # Preprocessing for categorical data
    categorical_transformer = PipelineFS(
        steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('catboost', CatBoostEncoder())
        ]
    )

    # Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ]
    )

    model = lgb.LGBMClassifier(n_estimators=params['n_estimators'], num_leaves=params['num_leaves'], learning_rate=params['rate'])

    # Bundle preprocessing and modeling code in a pipeline
    my_pipeline = PipelineFS(
        steps=[
            ('preprocessor', preprocessor),
            ('model', model)
        ],
        verbose=False
    )
    # Preprocessing of training data, fit model 
    # my_pipeline.fit(X_train, y_train)
    # cprint('Fit!', 'green')
    # Preprocessing of validation data, get predictions
    scores = cross_val_score(my_pipeline, X, y, cv=5)
    return scores.mean()

In [None]:
# get_lgb_pipeline_score(X_train, y_train)

results = {}
params={'n_estimators':10,'num_leaves':64,'rate':0.1,'early_stopping_rounds':10}
for i in range(50, 1001, 50):
    params['n_estimators'] = i
    results[i] = get_lgb_pipeline_score(X_train, y_train, params=params)
print(results)

# Notes from example kernel
So far so good, but compared to others...not great. Below are some notes after looking at some example kernels.
- Combine the train and test data sets into a single dataframe to make feature transformation easier.
    - Get the ids (indexes) of both the test and train sets so that they can be extracted again later.