# Under Construction Feb 26, 2018 -- Major Edits Required

# Machine Learning Workflow 1

## First Part in a 3 Part Series of Jupyter Notebooks and Tutorials from <a href="https://sdiehl28.netlify.com/" target="_blank">Software Nirvana</a>

### Start Simple!

This series of notebooks represent a *** step by step *** approach to best practices in Machine Learning Workflow.

As in software engineering, it is often a good idea to begin with a [skeleton][1] of your project and then fill in the details.  We want to quickly get started with something that works, end-to-end and then [Kiazen](https://en.wikipedia.org/wiki/Kaizen) the model to improve its accuracy.

Having something that works end-to-end will alert you to what types of problems are inherent in your particular dataset and will provide a rough sense of the accuracy that can be expected.
[1]: https://en.wikipedia.org/wiki/Skeleton_(computer_programming)

### Machine Learning Task
Make a prediction for survived / not-survived using the titanic dataset from Kaggle.  This is a supervised learning problem.

This notebook will set the stage for more advanced machine learning techniques to be presented later.

### Acquire the Data

Download "train.csv" from: https://www.kaggle.com/c/titanic/data and place it in a data subdirectory.

### Common Imports and Notebook Setup

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
%matplotlib inline
sns.set() # enable seaborn style

### Check Software Versions

In [23]:
import sys
print('python:     ', sys.version)
print('numpy:      ', np.__version__)
print('pandas:     ', pd.__version__)
import matplotlib
print('matplotlib: ', matplotlib.__version__)
print('seaborn:    ', sns.__version__)
print('sklearn:    ', sk.__version__)

python:      3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0]
numpy:       1.13.3
pandas:      0.22.0
matplotlib:  2.1.2
seaborn:     0.8.1
sklearn:     0.19.1


### Read Data

In [18]:
# read in all the labeled data
all_data = pd.read_csv('../data/train.csv')
all_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Split the Data into Train and Test
Do this prior to Exploratory Data Analysis and other Model Building Steps.
See future blog post for details as to why this step is done first: [Overfitting](https://sdiehl28.netlify.com/posts/overfitting/)

TODO: Check above link

In [24]:
# break up the dataframe into X and y
# X is a 2 dimensional "spreadsheet" of values used for prediction
# y is a 1 dimensional vector of target (aka response) values
X = all_data.drop('Survived', axis=1)
y = all_data['Survived']
print('X Shape: ', X.shape)
print('y Shape: ', y.shape)

X Shape:  (891, 11)
y Shape:  (891,)


In [108]:
# create the train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=111)

### Exploratory Data Analysis -- Null Values

In [109]:
# Find the percentage of missing values per column
nrows, ncols = X_train.shape
X_train.isnull().sum() / nrows

PassengerId    0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.187801
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.776886
Embarked       0.003210
dtype: float64

### Null Value Analysis
The following is a reasonable judgement call as to how to proceed.
1. The Age attribute has some missing values => impute missing values
2. Most of the Cabin attribute is missing => remove it
3. Very few Emarked records are missing => remove records with missing Emarked value

### Impute Age Value
For instructional purposes, this will be performed in two different ways.

In [110]:
# Manually impute age column based on mean value of *train* data
mean_age_train = X_train['Age'].mean()
print(mean_age_train)

29.78788537549407


In [111]:
# check that we have the right expression for selecting null Age values
null_train_age_values = X_train['Age'][X_train['Age'].isnull()]
null_train_age_values.all()

True

In [112]:
# replace the null values with the mean
null_train_age_values = mean_age_train

In [113]:
# Verify that we no longer have any null age values
X_train['Age'].isnull().any()

True

In [114]:
# check that we have the right expression for selecting null Age values
null_test_age_values = X_test['Age'][X_test['Age'].isnull()]
null_test_age_values.all()

True

In [115]:
# replace null values in test data with mean from *train* data (see blog for why)
null_test_age_values = mean_age_train

In [116]:
# verify that we no longer have any null age values
X_test['Age'].isnull().any()

True

In [117]:
# Compute the mean of the test data to verify that using an Imputer produces the same result
X_test['Age'].mean()

29.483173076923077

In [118]:
# Use Scikit Learn Impute to impute Age and Show that it gives the same result

# Reread and resplit the data, so we are starting from scratch
X = all_data.drop('Survived', axis=1)
y = all_data['Survived']

# note use of random_state for repeatability
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=111)

In [119]:
from sklearn.preprocessing import Imputer

age_imputer = Imputer(strategy='mean')

# apply the age_imputer to the training data only
age_imputer.fit(X_train['Age'].values.reshape(-1,1))

# Let's look behind the scenes to see what value will be used for imputation
# Looking at "dunder" is for pedagogical reasons only
age_imputer.__getstate__()

{'_sklearn_version': '0.19.1',
 'axis': 0,
 'copy': True,
 'missing_values': 'NaN',
 'statistics_': array([ 29.78788538]),
 'strategy': 'mean',
 'verbose': 0}

In [120]:
# We see a value of 29.7878 will be used
# This is the same as the mean of the training data
print(X_train['Age'].mean())

# This is not the mean of the test data!
print(X_test['Age'].mean())

29.78788537549407
29.483173076923077


In [121]:
# Let's apply the Imputer to the test data and take the mean of Age in the test data
age_imputer.transform(X_test['Age'].values.reshape(-1,1)).mean()

29.551392248244941

### Both methods produce the same result!
The key here is that the Imputer took the mean of the training data and used that to fill in the missing values in the test data.

When we did this manually, it was obvious.  But it may not have been obvious that this was happening when using the Imputer.

The key point here is that Scikit Learn *correctly* imputes missing test data by using the values form the *train* data, not by looking at the test data.

In [59]:
# 2. Discard Cabin column
X_train = X_train.drop('Cabin', axis=1)
X_test = X_test.drop('Cabin', axis=1)

In [122]:
# 3. Remove records having any null values (this is only the Embarked column)
X_train = X_train.dropna(axis=0, how='any')
X_test = X_test.dropna(axis=0, how='any')

### Initial Draft Ends Here
Following is material to be edited for more of the initial draft

### Exploratory Data Analysis (EDA)

This will usually consists of:
1. Determine how many null values there are per column
2. Determine which features to keep
3. Correcting the datatypes (read.csv() infers datatypes but it's better to be specific)
4. Visual analysis, perhaps with a package such as seaborn

EDA Null Value Analysis

In [3]:
# Find the percentage of missing values per column
nrows, ncols = train.shape
train.isnull().sum() / nrows

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

In [4]:
# Discard Cabin as there are too many missing values
train.drop('Cabin', axis=1, inplace=True)

In [None]:
EDA Drop Feature Analysis

In [5]:
# As this notebook is focusing on process not predicitve accuracy,
# let's avoid feature engineering for the Name and Ticket fields.
train.drop(['Name', 'Ticket'], axis=1, inplace=True)

In [6]:
# Examine the datatypes of each remaining column
train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Sex             object
Age            float64
SibSp            int64
Parch            int64
Fare           float64
Embarked        object
dtype: object

In [7]:
# In most cases, 'object' represents a string in Pandas
# Let's check the value_counts for Sex and Embarked
train['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [8]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [9]:
# Clearly Sex and Embarked are categorical datatypes.  Let's correct the datatype.
train['Sex'] = train['Sex'].astype('category')
train['Embarked'] = train['Embarked'].astype('category')
train.dtypes

PassengerId       int64
Survived          int64
Pclass            int64
Sex            category
Age             float64
SibSp             int64
Parch             int64
Fare            float64
Embarked       category
dtype: object

In [10]:
# PassengerId is a unquie id and therefore cannot contribute information towards predicting survival
train.drop("PassengerId", axis=1, inplace=True)

EDA Correct Datatype Analysis

In [11]:
# Pclass is represented as an integer, but integers have an ordering and a well defined distance
# For example, 3-2 = 2-1
# However for Pclass we cannot say that 3rd class - 2nd class = 2nd class - 1st class
# Pclass is better represented as a category, not an integer
train['Pclass'] = train['Pclass'].astype('category')
train.dtypes

Survived       int64
Pclass      category
Sex         category
Age          float64
SibSp          int64
Parch          int64
Fare         float64
Embarked    category
dtype: object

In [12]:
# Let's check null values per column again
train.isnull().sum(axis=0)

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

In [None]:
EDA Complete Cases

In [13]:
# Let's remove the 2 records with null for Embarked
# When learning, it's a good idea to check the data types

# 1st step, create a boolean series with True for each record where Embarked is null
boolean_series = train['Embarked'].isnull()
print('Result Type: ', type(boolean_series))
print('Series Type: ', boolean_series.dtype)

Result Type:  <class 'pandas.core.series.Series'>
Series Type:  bool


In [14]:
train[train['Embarked'].isnull()] # maybe this cannot be used Also see Pep 8

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
61,1,1,female,38.0,0,0,80.0,
829,1,1,female,62.0,0,0,80.0,


In [None]:
# 2nd step, determine which indexes (i.e. row lables) these coorespond to
indexes = train.loc[boolean_series, 'Embarked'].index
print('Result Type: ', type(indexes))
print(indexes)

In [None]:
# 3rd step, drop these records
train.drop(indexes, inplace = True)

In [None]:
train.columns

In [None]:
# break up the dataframe into X and y
# X is a 2 dimensional "spreadsheet" of values used for prediction
# y is a 1 dimensional vector of target (aka response) values
X = train.drop('Survived', axis=1)
y = train['Survived']
print('X Shape: ', X.shape)
print('y Shape: ', y.shape)

### Train Test Split and Impute Age
1. Split the dataset into 70% for training and 30% for test.
2. Impute missing Age values for the training test and use * that imputation * on the test set.

It is very important not to "peek" at the test data.  The Imputer() that is created must be created on the training data and then applied to the test data.  This subtle point is missed in many beginning tutorials and yet it may be the most important point of this entire notebook.

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=111)

### Exploratory Data Analysis

In [None]:
from sklearn.preprocessing import Imputer

age_imputer = Imputer(strategy='mean')

# apply the age_imputer to the training data only
age_imputer.fit(X_train['Age'].values.reshape(-1,1))

# Let's look behind the scenes to see what value will be used for imputation
# Looking at "dunder" is for pedagogical reasons only
age_imputer.__getstate__()

# We see a value of 29.45299 will be used
# This is the same as the mean of the training data
X_train['Age'].mean()

# This is not the mean of the test data
X_test['Age'].mean()

# Let's apply the Imputer to the test data and take the mean of Age in the test data
age_imputer.transform(X_test['Age'].values.reshape(-1,1)).mean()

# Compare this to the following, which produces a different result!
# See the video for Stanford Professor about the heart experiment
test_mean = X_test['Age'].mean()
X_test.loc[X_test['Age'].isnull().index]['Age'] = test_mean

In [None]:
X_test.mean()

In [29]:
l = [1,2,3,4]
l.extend([5,6,7,8,9])
l

[1, 2, 3, 4, 5, 6, 7, 8, 9]