# Machine Learning Workflow Iteration 2

Content for my website: <a href="https://sdiehl28.netlify.com/" target="_blank">Software Nirvana</a>

SD TODO: Ajust link to 1st iteration of workflow on blog

### Where We Are
In the first iteration, we created a simple model and showed that the accuracy was better than the null model.  The null model simply predicts the predominant class in all cases.

### What's Next
Copy the previous notebook and rename it.  For each reminder to ourself on what to try next, try it.  Also more Exploratory Data Analysis would likely be helpful.

<a name="outline"></a>
### Outline
1. [Previous Iteration](#previous)
2. [Exploratory Data Analysis](#eda)
3. [Preprocessing](#preprocess)
4. [Model Building](#model)
5. [Model Evaluation](#eval)
6. [Summary](#summary)

### Common Imports and Notebook Setup

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
%matplotlib inline
sns.set() # enable seaborn style

### Previous Iteration

In [9]:
# read in all the labeled data
all_data = pd.read_csv('../data/train.csv')

In [10]:
# break up the dataframe into X and y
# X is a 2 dimensional "spreadsheet" of values used for prediction
# y is a 1 dimensional vector of target (aka response) values
X = all_data.drop('Survived', axis=1)
y = all_data['Survived']
print('X Shape: ', X.shape)
print('y Shape: ', y.shape)

X Shape:  (891, 11)
y Shape:  (891,)


In [11]:
# create the train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=111)

<a name="eda"></a>
### Exploratory Data Analysis

In [21]:
# Find the percentage of missing values per column
nrows, ncols = X_train.shape
X_train.isnull().sum() / nrows

PassengerId    0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.187801
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.776886
Embarked       0.003210
dtype: float64

### Null Value Analysis
The following is a reasonable judgement call as to how to proceed based on the observed percentages of null values.
1. The Age attribute has some missing values => impute missing values
2. Most of the Cabin attribute is missing => remove it
3. Very few Emarked records are missing => remove records with missing Emarked value

### Impute Age Value
For instructional purposes, to emphasize not looking at the test data, this this will be performed manually and using a Scikit Learn Imputer.

#### Manually Impute Age Value

In [23]:
# Manually impute age column based on mean value of *train* data
mean_age_train = X_train['Age'].mean()
print(mean_age_train)

29.78788537549407


In [24]:
# verify that we have the right expression for selecting 
# all null Age values in the train set
null_train_age_values = X_train['Age'][X_train['Age'].isnull()]
null_train_age_values.all()

True

In [25]:
# note that null_train_age_values is a reference or a view, not a copy
# replace the null values with the mean
null_train_age_values = mean_age_train

In [26]:
# Verify that we no longer have any null age values
X_train['Age'].isnull().any()

True

In [27]:
# check that we have the right expression for selecting
# all null Age values in the test set
null_test_age_values = X_test['Age'][X_test['Age'].isnull()]
null_test_age_values.all()

True

### Replace Test Set Null values with Mean from *Train* Set
This step is key to understanding how to avoid "test set leakage".  If we look at the data in the test set, in any way, it no longer acts as a test set.

We must replace null values in the test set with the mean from the *train* set.

In [19]:
# replace null values in test data with mean from *train* data
null_test_age_values = mean_age_train

In [20]:
# verify that we no longer have any null age values
X_test['Age'].isnull().any()

True

In [28]:
# to compare methods, check the new mean for the test set
X_test['Age'].mean()

29.483173076923077

#### Scikit Learn Imputer for Age Value

In [30]:
# Use Scikit Learn Impute for Age

# Reread and resplit the data, so we are starting from scratch
X = all_data.drop('Survived', axis=1)
y = all_data['Survived']

# note use of random_state for repeatability
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, random_state=111)

In [None]:
from sklearn.preprocessing import Imputer

age_imputer = Imputer(strategy='mean')

# use the age imputer to compute the mean of the *train* set
age_imputer.fit(X_train['Age'].values.reshape(-1,1))

# Let's look behind the scenes to see what value will be used for imputation
# Looking at "dunder getstate" is for instructional reasons only
age_imputer.__getstate__()

### Select the Null Age Values in Test to Observe How they Change

In [59]:
# get (a view of) the null values in test
idx = X_test['Age'][X_test['Age'].isnull()].index
print(X_test['Age'].loc[idx])

IndexError: positional indexers are out-of-bounds

### Apply the Imputer to the Test Set

In [None]:
age_imputer.transform(X_test)

In [50]:
# Examine the values that were null
print(null_test_age_values.head())

584   NaN
411   NaN
826   NaN
384   NaN
692   NaN
Name: Age, dtype: float64


In [None]:
# Discard Cabin column
X_train = X_train.drop('Cabin', axis=1)
X_test = X_test.drop('Cabin', axis=1)

### Examine Datatypes
Often this involves converting text or integers to categorical variables.

** TODO: convert the following variables to categorical **
* Pclass
* Sex
* Embarked


In [47]:
# For 1st Iteration, ignore all text and categorical variables
X_train = X_train.drop('Pclass', axis=1)
X_test = X_test.drop('Pclass', axis=1)
X_train = X_train.drop('Name', axis=1)
X_test = X_test.drop('Name', axis=1)
X_train = X_train.drop('Sex', axis=1)
X_test = X_test.drop('Sex', axis=1)
X_train = X_train.drop('Ticket', axis=1)
X_test = X_test.drop('Ticket', axis=1)
X_train = X_train.drop('Embarked', axis=1)
X_test = X_test.drop('Embarked', axis=1)

In [48]:
# Examine the datatypes of each remaining column
X_train.dtypes

PassengerId      int64
SibSp            int64
Parch            int64
Fare           float64
dtype: object

<a name="preprocess"></a>
### Preprocessing
[Back to Outline](#outline)

Preprocessing was done "inline" with the Exploratory Data Analysis above.

<a name="model"></a>
### Model Building
[Back to Outline](#outline)

Perhaps the simplest model to try for classification is Logistic Regression.

Special techniques are required if one class is much more rare than another.  Let's check for that.

In [56]:
y.value_counts()

0    549
1    342
Name: Survived, dtype: int64

That's close enough to "even".  Logistic Regression may work well.

In [50]:
# Build Model
from sklearn.linear_model import LogisticRegression
base_model = LogisticRegression()
base_model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

<a name="eval"></a>
### Model Evaluation
[Back to Outline](#outline)

The simplest measure of accuracy is to look at the percent of correct predictions.

In [51]:
predictions = base_model.predict(X_test)

In [52]:
from sklearn.metrics import confusion_matrix
confusion_matrix(predictions, y_test)

array([[154,  69],
       [ 17,  28]])

In [53]:
# Compute Accuracy
base_accuracy = (154 + 28) / (154+69+17+28)
print(base_accuracy)

0.6791044776119403


In [54]:
# Compare with Simplest Possible Model Sometimes called the Null Model
# Null Model Predicts predominant class every time
y_test.value_counts()

0    171
1     97
Name: Survived, dtype: int64

In [55]:
# Null Model Accuracy
null_accuracy = 171 / (171 + 97)
print(null_accuracy)

0.6380597014925373


<a name="summary"></a>
### Conclusion
[Back to Outline](#outline)

The simplest model had a prediction accuracy of about 68%.  The null model which just predicts the most common class in all cases was accurate about 64% of the time.

In this first iteration:
* we quickly created a model
* noted a few things to try next
* established a baseline accuracy of 68%
* showed that this accuracy is better than the null model accuracy of 64%