# Under Construction Feb 15, 2018
![alt text](under-2891888_640.jpg "Under Construction")

# Scikit Learn Workflow with Pandas

This is the first in a series of notebooks which demonstrate proper Machine Learning workflow using Scikit Learn and Pandas.

### Common Theme in Notebooks
These notebooks focus on the Pandas and Scikit Learn ** workflow **.  * As a beginniner, it is more important to understand proper workflow than it is to write one-off code which attempts to optimize the predictive accuracy of one dataset. *

These notebooks use the titanic dataset from Kaggle, but emphasize * repeatable process * over predictive accuracy.

Once the proper workflow is understood, it then becomes worthwhile to optimize predictive accuracy on a dataset by dataset basis.

Some useful resources for getting starting with Pandas and Scikit Learn include:

1. Udemy
2. O'Reilly
3. Datacamp

### Machine Learning Task
Make a prediction for survived / not-survived using the titanic dataset from Kaggle.  This is a supervised learning problem that will make use of labeled data only.

### Who These Notebokes are For
Anyone who have written some code in Python using Pandas and Scikit Learn.  Someone who knows how to learn an API from documentiong, but is looking for the big picture as to how to put all this together.

### Notebooks
1. Scikit Learn and Pandas Basic Workflow
2. Scikit Learn Pipelines
3. Scikit Learn Pipelines with Pandas Feature Engineering
4. Scikit Learn Pipelines with Pandas Feature Engineering and Hyperparamter Optimization

### Software Versions
This notebook was created in a development enviroment using the Anaconda distribution.  The following versions of of software were used:

* Python  3.6
* Numpy  1.13
* Pandas 0.22
* Scikit Learn 19.1

## Acquire the Data

Download "test.csv" from: https://www.kaggle.com/c/titanic/data 

### Common Imports

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn as sk
%matplotlib inline
sns.set() # enable seaborn style

### Read data into Pandas DataFrame and Examine First Records

In [2]:
train = pd.read_csv('./data/train.csv')
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Exploratory Data Analysis (EDA)

This will usually consists of:
1. Determine how many null values there are per column
2. Determine which features to keep
3. Correcting the datatypes (read.csv() infers datatypes but it's better to be specific)
4. Visual analysis, perhaps with a package such as seaborn

EDA Null Value Analysis

In [3]:
# Find the percentage of missing values per column
nrows, ncols = train.shape
train.isnull().sum() / nrows

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

In [4]:
# Discard Cabin as there are too many missing values
train.drop('Cabin', axis=1, inplace=True)

In [None]:
EDA Drop Feature Analysis

In [5]:
# As this notebook is focusing on process not predicitve accuracy,
# let's avoid feature engineering for the Name and Ticket fields.
train.drop(['Name', 'Ticket'], axis=1, inplace=True)

In [6]:
# Examine the datatypes of each remaining column
train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Sex             object
Age            float64
SibSp            int64
Parch            int64
Fare           float64
Embarked        object
dtype: object

In [7]:
# In most cases, 'object' represents a string in Pandas
# Let's check the value_counts for Sex and Embarked
train['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [8]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [9]:
# Clearly Sex and Embarked are categorical datatypes.  Let's correct the datatype.
train['Sex'] = train['Sex'].astype('category')
train['Embarked'] = train['Embarked'].astype('category')
train.dtypes

PassengerId       int64
Survived          int64
Pclass            int64
Sex            category
Age             float64
SibSp             int64
Parch             int64
Fare            float64
Embarked       category
dtype: object

In [10]:
# PassengerId is a unquie id and therefore cannot contribute information towards predicting survival
train.drop("PassengerId", axis=1, inplace=True)

EDA Correct Datatype Analysis

In [11]:
# Pclass is represented as an integer, but integers have an ordering and a well defined distance
# For example, 3-2 = 2-1
# However for Pclass we cannot say that 3rd class - 2nd class = 2nd class - 1st class
# Pclass is better represented as a category, not an integer
train['Pclass'] = train['Pclass'].astype('category')
train.dtypes

Survived       int64
Pclass      category
Sex         category
Age          float64
SibSp          int64
Parch          int64
Fare         float64
Embarked    category
dtype: object

In [12]:
# Let's check null values per column again
train.isnull().sum(axis=0)

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

In [None]:
EDA Complete Cases

In [13]:
# Let's remove the 2 records with null for Embarked
# When learning, it's a good idea to check the data types

# 1st step, create a boolean series with True for each record where Embarked is null
boolean_series = train['Embarked'].isnull()
print('Result Type: ', type(boolean_series))
print('Series Type: ', boolean_series.dtype)

Result Type:  <class 'pandas.core.series.Series'>
Series Type:  bool


In [14]:
train[train['Embarked'].isnull()] # maybe this cannot be used Also see Pep 8

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
61,1,1,female,38.0,0,0,80.0,
829,1,1,female,62.0,0,0,80.0,


In [None]:
# 2nd step, determine which indexes (i.e. row lables) these coorespond to
indexes = train.loc[boolean_series, 'Embarked'].index
print('Result Type: ', type(indexes))
print(indexes)

In [None]:
# 3rd step, drop these records
train.drop(indexes, inplace = True)

In [None]:
train.columns

In [None]:
# break up the dataframe into X and y
# X is a 2 dimensional "spreadsheet" of values used for prediction
# y is a 1 dimensional vector of target (aka response) values
X = train.drop('Survived', axis=1)
y = train['Survived']
print('X Shape: ', X.shape)
print('y Shape: ', y.shape)

### Train Test Split and Impute Age
1. Split the dataset into 70% for training and 30% for test.
2. Impute missing Age values for the training test and use * that imputation * on the test set.

It is very important not to "peek" at the test data.  The Imputer() that is created must be created on the training data and then applied to the test data.  This subtle point is missed in many beginning tutorials and yet it may be the most important point of this entire notebook.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=123)

In [None]:
from sklearn.preprocessing import Imputer

In [None]:
age_imputer = Imputer(strategy='mean')

In [None]:
# apply the age_imputer to the training data only
age_imputer.fit(X_train['Age'].values.reshape(-1,1))

In [None]:
# Let's look behind the scenes to see what value will be used for imputation
# Looking at "dunder" is for pedagogical reasons only
age_imputer.__getstate__()

In [None]:
# We see a value of 29.45299 will be used
# This is the same as the mean of the training data
X_train['Age'].mean()

In [None]:
# This is not the mean of the test data
X_test['Age'].mean()

In [None]:
# Let's apply the Imputer to the test data and take the mean of Age in the test data
age_imputer.transform(X_test['Age'].values.reshape(-1,1)).mean()

In [None]:
# Compare this to the following, which produces a different result!
# See the video for Stanford Professor about the heart experiment
test_mean = X_test['Age'].mean()
X_test.loc[X_test['Age'].isnull().index]['Age'] = test_mean

In [None]:
X_test.mean()

In [29]:
l = [1,2,3,4]
l.extend([5,6,7,8,9])
l

[1, 2, 3, 4, 5, 6, 7, 8, 9]

'SUPERCALIFRAGILISTICEXPIALIDOCIOUS'