# Machine Learning Workflow Iteration 2

Content for my website: <a href="https://sdiehl28.netlify.com/" target="_blank">Software Nirvana</a>

SD TODO: Ajust link to 2nd iteration of workflow on blog

### Where We Are
In the first iteration, we created a simple model and showed that the accuracy was better than the null model.  The null model being the model that predicts the predominant class in all cases.

### What's Next
Refine any or all of the model building steps and measure the accuracy of the new model.

Start by copying the previous notebook and renaming it.  For each reminder to ourself on what to try next from the previous iteration, try it.  In addition, more Exploratory Data Analysis would likely be helpful.

<a name="outline"></a>
### Outline
1. [Previous Iteration](#previous)
2. [Exploratory Data Analysis](#eda)
3. [Preprocessing](#preprocess)
4. [Model Building](#model)
5. [Model Evaluation](#eval)
6. [Summary](#summary)

### Common Imports and Notebook Setup

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
%matplotlib inline
sns.set() # enable seaborn style

### Previous Iteration

In [23]:
# read in all the labeled data
all_data = pd.read_csv('../data/train.csv')

In [24]:
# break up the dataframe into X and y
# X is a 2 dimensional "spreadsheet" of values used for prediction
# y is a 1 dimensional vector of target (aka response) values
X = all_data.drop('Survived', axis=1)
y = all_data['Survived']
print('X Shape: ', X.shape)
print('y Shape: ', y.shape)

X Shape:  (891, 11)
y Shape:  (891,)


In [25]:
# create the train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=111)

### This Section Deals with a Subtle Train/Test Split Bug
The train/test split should create *copies* of dataframe rather than a view into a dataframe.  However in the version I am using, a view is returned for X.

In [42]:
print(X_train.is_copy)
print(X_test.is_copy)
print(y_train.is_copy)
print(y_test.is_copy)

<weakref at 0x7fc8846b4958; to 'DataFrame' at 0x7fc8846b04a8>
<weakref at 0x7fc8846b4958; to 'DataFrame' at 0x7fc8846b04a8>
None
None


In the above we see a weakref, this means we have a view.  That will cause problems with chained assignment later.  See: [Setting With Copy Warning](https://www.dataquest.io/blog/settingwithcopywarning/)

In [27]:
# Shouldn't have to do this, but it ensures we have a copy of the dataframe
X_train = X_train.copy()
X_test = X_test.copy()
print(X_train.is_copy)
print(X_test.is_copy)
print(y_train.is_copy)
print(y_test.is_copy)

None
None
None
None


This is correct.  If DataFrame.is_copy is None, then we are not using a view into another DataFrame.

<a name="eda"></a>
### Exploratory Data Analysis

In [28]:
# Find the percentage of missing values per column
nrows, ncols = X_train.shape
X_train.isnull().sum() / nrows

PassengerId    0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.187801
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.776886
Embarked       0.003210
dtype: float64

### Null Value Analysis
The following is a reasonable judgement call as to how to proceed based on the observed percentages of null values.
1. The Age attribute has some missing values => impute missing values
2. Most of the Cabin attribute is missing => remove it
3. Very few Emarked records are missing => remove records with missing Emarked value

### Impute Age Value
For instructional purposes, to emphasize not looking at the test data, this this will be performed both manually *and* using a Scikit Learn Imputer.

#### Manually Impute Age Value

In [33]:
# Get a boolean Series the same length as X_train
train_age_null = X_train['Age'].isnull()
print(type(train_age_null))
print(train_age_null.dtype)
print(train_age_null.shape)
X_train.loc[train_age_null, 'Age'].head()

<class 'pandas.core.series.Series'>
bool
(623,)


573   NaN
697   NaN
601   NaN
709   NaN
783   NaN
Name: Age, dtype: float64

In [37]:
# Set the null age values to the mean age value
X_train.loc[train_age_null, 'Age'] = X_train['Age'].mean()
print(X_train['Age'].mean())

29.787885375494064


In [38]:
# double check that the values that were Null are now the mean
X_train.loc[train_age_null, 'Age'].head()

573    29.787885
697    29.787885
601    29.787885
709    29.787885
783    29.787885
Name: Age, dtype: float64

### Replace Test Set Null values with Mean from *Train* Set
This step is key to understanding how to avoid "test set data leakage".  If we look at the data in the test set, in any way, it no longer acts as a test set.

We must replace null values in the test set with the mean from the *train* set without looking at any of the values in the *test* set.

In [39]:
# Get a boolean Series the same length as X_test
test_age_null = X_test['Age'].isnull()
X_test.loc[test_age_null, 'Age'] = X_train['Age'].mean()

In [40]:
# double check that the values that were Null are now the mean of the *train* data
X_test.loc[test_age_null, 'Age'].head()

584    29.787885
411    29.787885
826    29.787885
384    29.787885
692    29.787885
Name: Age, dtype: float64

#### Scikit Learn Imputer for Age Value

In [41]:
# Use Scikit Learn Imputer for Age

# Resplit the original data, so we are starting from scratch
X = all_data.drop('Survived', axis=1)
y = all_data['Survived']

# note use of random_state for repeatability
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, random_state=111)

In [43]:
# Shouldn't have to do this, but it ensures we have a copy of the dataframe
X_train = X_train.copy()
X_test = X_test.copy()
print(X_train.is_copy)
print(X_test.is_copy)
print(y_train.is_copy)
print(y_test.is_copy)

None
None
None
None


In [44]:
from sklearn.preprocessing import Imputer

age_imputer = Imputer(strategy='mean')

# use the age imputer to compute the mean of the *train* set
age_imputer.fit(X_train['Age'].values.reshape(-1,1))

# Let's look behind the scenes to see what value will be used for imputation
# Looking at "dunder getstate" is for instructional purposes only
age_imputer.__getstate__()

{'_sklearn_version': '0.19.1',
 'axis': 0,
 'copy': True,
 'missing_values': 'NaN',
 'statistics_': array([29.78788538]),
 'strategy': 'mean',
 'verbose': 0}

We see that the mean of the train set is stored in the sklearn imputer object.

In [48]:
# Apply the imputer to the Age column
X_test['Age'] = age_imputer.transform(X_test['Age'].values.reshape(-1,1))

In [49]:
# double check that the values that were Null in the test data
# now have the mean of the mean of the *train* data
X_test.loc[test_age_null, 'Age'].head()

584    29.787885
411    29.787885
826    29.787885
384    29.787885
692    29.787885
Name: Age, dtype: float64

### Summary of Imputation
It is critical to understand that we replaced null values in the test set with values computed from the train set.  We never looked at the data in test set

Scikit Learn's imputer does exactly this.

In [45]:
# This internal value matches the value we computed.
mean_age_train

NameError: name 'mean_age_train' is not defined

### Select the Null Age Values in Test to Observe How the Imputer Works

In [None]:
# get (a view of) the null values in test
idx = X_test['Age'][X_test['Age'].isnull()].index
print(X_test['Age'].loc[idx].head())

In [None]:
X_test.loc[X_test['Age'].isnull()]['Age'].head()

In [None]:
new_age_series = age_imputer.transform(X_test['Age'].values.reshape(-1,1))

In [None]:
X_test.assign(AgeImputed = new_age_series).head()

### Apply the Imputer to the Test Set

In [None]:
new_age_series.flatten().shape

In [None]:
print(np.NaN == np.NaN)
(X_test.loc[:,'Age'] == X_test['Age']).head()

In [None]:
X_test.loc[:, 'Age']

In [None]:
new_age_series = age_imputer.transform(X_test['Age'].values.reshape(-1,1))
X_test.assign('AgeImputed' = new_age_series)
# print(type(new_age_series))
# print(new_age_series.shape)
# X_test.loc[:,'Age'] = pd.Series(new_age_series.flatten())

In [None]:
X_test['Age'][X_test['Age'].isnull()]

In [None]:
# Examine the values that were null
print(null_test_age_values)

In [None]:
# Discard Cabin column
X_train = X_train.drop('Cabin', axis=1)
X_test = X_test.drop('Cabin', axis=1)

### Examine Datatypes
Often this involves converting text or integers to categorical variables.

** TODO: convert the following variables to categorical **
* Pclass
* Sex
* Embarked


In [None]:
# For 1st Iteration, ignore all text and categorical variables
X_train = X_train.drop('Pclass', axis=1)
X_test = X_test.drop('Pclass', axis=1)
X_train = X_train.drop('Name', axis=1)
X_test = X_test.drop('Name', axis=1)
X_train = X_train.drop('Sex', axis=1)
X_test = X_test.drop('Sex', axis=1)
X_train = X_train.drop('Ticket', axis=1)
X_test = X_test.drop('Ticket', axis=1)
X_train = X_train.drop('Embarked', axis=1)
X_test = X_test.drop('Embarked', axis=1)

In [None]:
# Examine the datatypes of each remaining column
X_train.dtypes

<a name="preprocess"></a>
### Preprocessing
[Back to Outline](#outline)

Preprocessing was done "inline" with the Exploratory Data Analysis above.

<a name="model"></a>
### Model Building
[Back to Outline](#outline)

Perhaps the simplest model to try for classification is Logistic Regression.

Special techniques are required if one class is much more rare than another.  Let's check for that.

In [None]:
y.value_counts()

That's close enough to "even".  Logistic Regression may work well.

In [None]:
# Build Model
from sklearn.linear_model import LogisticRegression
base_model = LogisticRegression()
base_model.fit(X_train, y_train)

<a name="eval"></a>
### Model Evaluation
[Back to Outline](#outline)

The simplest measure of accuracy is to look at the percent of correct predictions.

In [None]:
predictions = base_model.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(predictions, y_test)

In [None]:
# Compute Accuracy
base_accuracy = (154 + 28) / (154+69+17+28)
print(base_accuracy)

In [None]:
# Compare with Simplest Possible Model Sometimes called the Null Model
# Null Model Predicts predominant class every time
y_test.value_counts()

In [None]:
# Null Model Accuracy
null_accuracy = 171 / (171 + 97)
print(null_accuracy)

<a name="summary"></a>
### Conclusion
[Back to Outline](#outline)

The simplest model had a prediction accuracy of about 68%.  The null model which just predicts the most common class in all cases was accurate about 64% of the time.

In this first iteration:
* we quickly created a model
* noted a few things to try next
* established a baseline accuracy of 68%
* showed that this accuracy is better than the null model accuracy of 64%