# Titanic w/ Pipelines
## Goal
Take what we did in V1 and expand upon it with more info/tutorials on pipelining.

## Relevant Material
[Deep dive in sklearn pipelines](https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines)  
[Simple pipeline example with scikit learn](https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976)

## Titanic Data
| Variable | Definition | Key |
| ----- | --- | --- |
| survival | Survived or not | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| age | Age in years | |
| sibsp | Num of siblings / spouses aboard | |
| parch | Num of parents / children aboard | |
| ticket | Ticket number | |
| fare | Passengar fare | |
| cabin | Cabin number | |
| embarked | Port of embarkation |   C = Cherbourg, Q = Queenstown, S = Southampton |

## Base setup

In [9]:
%reload_ext autoreload
%autoreload 2

# custom helpers
from helpers.helper import get_splits, process_titanic
# data handling
import numpy as np
import pandas as pd
# output
from termcolor import cprint
import matplotlib.pyplot as plt
import seaborn as sns

cprint('All Modules Imported!', 'green')

[32mAll Modules Imported![0m


## Data Import

In [11]:
train_data = pd.read_csv('./data/train.csv', index_col='PassengerId')
test_data = pd.read_csv('./data/test.csv', index_col='PassengerId')

cprint('Data Imported!', 'green')
cprint('Training Data Example:', 'cyan')
display(train_data)

[32mData Imported![0m
[36mTraining Data Example:[0m


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Pipelining
### Reading Notes
- For very basic feature engineering, it's good to encapsulate processing logic into functions so it can be reproduced easily.
- For purposes of optimizing and testing models, it's a good idea to create feature selectors that can be used inside a pipeline to apply transformations on single columns. Check out `TextSelector` and `NumericSelector` in helper.py
  - the selectors can used as follows `('selector', TextSelector(key='processed'))`
- Pipelines are built from pipelines. For each processing step, we can create a mini-pipeline that carries out the task/engineering we need.
  - If doing engineering on individual columns, it's important to join the engineered columns back into the dataset. `sklearn.pipeline`'s `FeatureUnion` method can be great for this.

### Basic Engineering

In [12]:
numerical_cols = ['Age', 'SibSp', 'Parch', 'Fare']
categorical_cols = ['Pclass', 'Sex', 'Cabin', 'Embarked']
target = 'Survived'

from sklearn.model_selection import train_test_split

# Preprocessing fills NA values and adds interactions
processed_data = preprocess_titanic(train_data, categorical_cols, numerical_cols)
# display(processed_data)

features = [c for c in train_data.columns.values if c not in ['PassengerId','Survived','Name','Ticket']]

X_train, X_test, y_train, y_test = train_test_split(train_data[features], train_data[target], test_size=0.33, random_state=42)
X_train.head()

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Pclass_Sex,Pclass_Cabin,Pclass_Embarked,Sex_Cabin,Sex_Embarked,Cabin_Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
7,1,male,54.0,0,0,51.8625,E46,S,1_male,1_E46,1_S,male_E46,male_S,E46_S
719,3,male,-1.0,0,0,15.5,Not Specified,Q,3_male,3_Not Specified,3_Q,male_Not Specified,male_Q,Not Specified_Q
686,2,male,25.0,1,2,41.5792,Not Specified,C,2_male,2_Not Specified,2_C,male_Not Specified,male_C,Not Specified_C
74,3,male,26.0,1,0,14.4542,Not Specified,C,3_male,3_Not Specified,3_C,male_Not Specified,male_C,Not Specified_C
883,3,female,22.0,0,0,10.5167,Not Specified,S,3_female,3_Not Specified,3_S,female_Not Specified,female_S,Not Specified_S


As stated above, we can use the custom Selector classes for making transforms for specific columns. For example, standardization.
### 2/4/20 notes
After doing some work, I don't think the selector is really necesarry, at least for our case? We can just scale everything once in the pipeline specifying the columns we want to use it on.

Also, in our helper pre-processer, we are filling in NA's. In the future, we can either decide to do that IN the pipeline or not. Maybe better to do in pipeline? But then also how do you do interactions in a pipeline.

#### *Hey*!
we can use scikits [PolynomialFeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) to create interactions automatically, meaning we can do it in the pipeline! (i think)

In [None]:
from sklearn.preprocessing import StandardScaler

numerical_tranformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')



# http://scipy-lectures.org/packages/statistics/index.html
import seaborn
seaborn.pairplot(data, vars=['WAGE', 'AGE', 'EDUCATION'],
                 kind='reg')  