# Iteration 1: Establish a Baseline Model

Jupyter Notebook referenced from my website: <a href="https://sdiehl28.netlify.com/projects/titanic/titanic01/" target="_blank">Software Nirvana: Titantic01</a>

The material on my website and notebooks is intended to assist a beginner in Applied Machine Learning with Python, but it is not a course in machine learning.

The intent of the series of notebooks is to illustrate general iterative model development techniques that apply to a wide range of datasets.

The intent of this series of notebooks is not to optimize model creation specific to the Titanic dataset.  That Titanic dataset was chosen to provide a concrete example.

Topics such as how to use Pandas will not be discussed.  However links to my Jupyter Notebooks which demonstrate the use of common Pandas methods will be presented.
    
* [github repo](https://github.com/sdiehl28/tutorial-jupyter-notebooks)  
* [Pandas: Series](http://nbviewer.jupyter.org/github/sdiehl28/tutorial-jupyter-notebooks/blob/master/pandas/Series.ipynb)  
* [Pandas: Axis Specification](http://nbviewer.jupyter.org/github/sdiehl28/tutorial-jupyter-notebooks/blob/master/pandas/AxisSpecification.ipynb)  
* [Pandas: DataFrame](http://nbviewer.jupyter.org/github/sdiehl28/tutorial-jupyter-notebooks/blob/master/pandas/Dataframe.ipynb)  

### Machine Learning Example
Make a prediction for Survived / Not-Survived using the titanic dataset from Kaggle.  This is a supervised learning problem.

Several notebooks will be created after this one.  Each iteratively improving:
* the model's accuracy
* the workflow used to create the model

<a name="outline"></a>
### Outline
1. [Acquire and Read Data](#readdata)
2. [Identify Target Variable](#target)
3. [Tentative Assumptions For 1st Iteration](#assumptions)
4. [Model Building](#model)
5. [Model Evaluation](#eval)
6. [Summary](#summary)

### Common Imports and Notebook Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
%matplotlib inline
sns.set() # enable seaborn style

<a name="readdata"></a>
### Acquire the Data
[Back to Outline](#outline)

Download "train.csv" from: https://www.kaggle.com/c/titanic/data and place it in a data subdirectory.

This link also has the data dictionary (sometimes called the codebook).

### Read in Data

In [2]:
# read in all the labeled data
all_data = pd.read_csv('../data/train.csv')
all_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<a name="target"></a>
### Target Variable: Survived
[Back to Outline](#outline)

Create two variables from the data we read in.  
X: Pandas DataFrame that represents the features (aka attributes)  
y: Pandas Series that represents the target (aka response)  

In [3]:
# X: drop target variable
# y: keep only the target
X = all_data.drop('Survived', axis=1)
y = all_data['Survived']
print('X Shape: ', X.shape)
print('y Shape: ', y.shape)

X Shape:  (891, 11)
y Shape:  (891,)


In [4]:
# ndim, as in numpy, reports the number of dimensions (e.g. 1D, 2D)
print('X dimensions: ',X.ndim)
print('y dimensions: ',y.ndim)

X dimensions:  2
y dimensions:  1


In [5]:
# aside for later
# y is a 1D object
# sometimes we need its 2D equivalent: a 2D object having 1 column
# using .values picks out the values as a numpy array
# using reshape(-1,1) converts it to a 2D object having 1 column
y_2D = y.values.reshape(-1,1)
print('y_2D type: ', type(y_2D))
print('y_2D shape: ', y_2D.shape)
print('y_2D dimensions: ', y_2D.ndim)

y_2D type:  <class 'numpy.ndarray'>
y_2D shape:  (891, 1)
y_2D dimensions:  2


<a name="assumptions"></a>
### 1st Iteration: Establish a Baseline Model
[Back to Outline](#outline)  

In order to quickly get something up and running, let's arbitrarily decide on the following for this iteration:
* use LogisticRegression (as is common for classification problems)
* drop ID field
* drop all non-numeric features
* drop any column having a null value
* model evaluation metric: accuracy (auc is another good choice)

In [6]:
X.dtypes

PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [7]:
# Find the percentage of missing values per column
nrows, ncols = X.shape
X.isnull().sum() / nrows

PassengerId    0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

In [8]:
# Note Pclass is encoded as an integer, but it is actually an ordered categorial variable
# Let's remove it as well
drop_cols = ['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'Ticket', 'Cabin', 'Embarked']
X.drop(drop_cols, axis=1, inplace = True)
X.dtypes

SibSp      int64
Parch      int64
Fare     float64
dtype: object

Important Note: the decision to drop these columns did not depend on the data, therefore it is "okay" to do this prior to the train/test split.

Had we decided to only keep columns that were correlated with the target, for example, this would require examining the data and would have to be performed after the train/test split.

Link to stat prof anecdotal story.

<a name="model"></a>
### Model Building & Evaluation
[Back to Outline](#outline)

A train/test split will be used to demonstrate how to build and evaluate a model.

Random_state will be set so that these results are repeatable.

Cross validation will be used to get a more accurate estimate of accuracy than a single train/test split can provide.

In [9]:
# create the train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, stratify=y, random_state=10)

In [10]:
# Build Model
from sklearn.linear_model import LogisticRegression
base_model = LogisticRegression()
base_model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [11]:
# Score will compute the accuarcy
base_model.score(X_test, y_test)

0.6417910447761194

In [12]:
# Compute the accuracy manually just to be sure we understand what score() is doing
predictions = base_model.predict(X_test)

In [13]:
from sklearn.metrics import confusion_matrix
confusion = confusion_matrix(predictions, y_test)
confusion

array([[143,  74],
       [ 22,  29]])

In [14]:
# Compute Accuracy
base_model_accuracy = (143 + 29) / (143+74+22+29)
print(base_model_accuracy)
print((predictions == y_test).sum() / len(predictions))

0.6417910447761194
0.6417910447761194


This is the same as above for base_model.score(), so the score is computing the accuracy.  

Accuracy is sometimes referred to as:
TP + TN / (TP + FP + TN + FN)

Where "Positive" is arbitrarily defined to be "Survived"
<pre>
TP = True  Positive = confusion[0,0]  
FP = False Positive = confusion[0,1]  
TN = True Negative  = confusion[1,0]  
FN = False Negative = confusion[1,1]
</pre>

The problem with the above approach is that we took just one random train/test split.  To get a more accurate result we can perform multiple train/test splits using cross validation.

In [15]:
from sklearn.model_selection import cross_val_score, KFold

In [16]:
crossvalidation = KFold(n_splits=5, shuffle=True, random_state=10)
scores = cross_val_score(base_model, X, y, cv=crossvalidation,
 n_jobs=1)
print('median: {:5.3f}'.format(np.median(scores)))

median: 0.663


In [17]:
# Compare with Simplest Possible Model Sometimes called the Null Model
# Null Model Predicts predominant class every time
y_test.value_counts()

0    165
1    103
Name: Survived, dtype: int64

In [18]:
# Null Model Accuracy
null_accuracy = 165 / (165 + 103)
print('Null Model Accuracy: {:5.3f}'.format(null_accuracy))

Null Model Accuracy: 0.616


66.3% is better than the null model accuracy 61.6%.  This is likely to be statistically significant.  A hypothesis test could be performed to see if it is, but that will not be done here.

Here we will simply say that the Logistic Regression Model is worthwhile to continue using as we iteratively Kaizen the model building process.

<a name="summary"></a>
### Summary
[Back to Outline](#outline)

The simple Logistic Regression model had a prediction accuracy of about 68%.  The null model which just predicts the most common class in all cases was accurate about 64% of the time.

In this first iteration:
* we quickly created a model
* noted a few things to try to improve the model
* established a baseline accuracy of 66.3%
* showed that this accuracy is better than the null model accuracy of 61.6%