# Iteration 1: Establish a Baseline Model

Jupyter Notebook referenced from my website: <a href="https://sdiehl28.netlify.com/projects/titanic/titanic01/" target="_blank">Software Nirvana: Titantic01</a>

The material on my website and notebooks is intended to suppliment a course in machine learning, rather than be a course in machine learning.

In this series of notebooks which demonstrates iterative model development, topics such as how to use Pandas will not be discussed.  However links to my Jupyter Notebooks which do discuss such topics will presented.
    
* [github repo](https://github.com/sdiehl28/tutorial-jupyter-notebooks)  
* [Pandas: Series](http://nbviewer.jupyter.org/github/sdiehl28/tutorial-jupyter-notebooks/blob/master/pandas/Series.ipynb)  
* [Pandas: Axis Specification](http://nbviewer.jupyter.org/github/sdiehl28/tutorial-jupyter-notebooks/blob/master/pandas/AxisSpecification.ipynb)  
* [Pandas: DataFrame](http://nbviewer.jupyter.org/github/sdiehl28/tutorial-jupyter-notebooks/blob/master/pandas/Dataframe.ipynb)  

### Machine Learning Example
Make a prediction for Survived / Not-Survived using the titanic dataset from Kaggle.  This is a supervised learning problem.

Several notebooks will be created after this one.  Each iteratively improving the model and measuring the model's accuracy.

<a name="outline"></a>
### Outline
1. [Acquire and Read Data](#readdata)
2. [Identify Target Variable](#target)
3. [Train / Test Split](#traintest)
4. [Exploratory Data Analysis](#eda)
5. [Preprocessing](#preprocess)
6. [Model Building](#model)
7. [Model Evaluation](#eval)
8. [Summary](#summary)

### Common Imports and Notebook Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
%matplotlib inline
sns.set() # enable seaborn style

### Check Software Versions

In [2]:
import sys
print('python:     ', sys.version)
print('numpy:      ', np.__version__)
print('pandas:     ', pd.__version__)
import matplotlib
print('matplotlib: ', matplotlib.__version__)
print('seaborn:    ', sns.__version__)
print('sklearn:    ', sk.__version__)

python:      3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0]
numpy:       1.14.1
pandas:      0.22.0
matplotlib:  2.1.2
seaborn:     0.8.1
sklearn:     0.19.1


<a name="readdata"></a>
### Acquire the Data
[Back to Outline](#outline)

Download "train.csv" from: https://www.kaggle.com/c/titanic/data and place it in a data subdirectory.

This link also has the data dictionary (sometimes called the codebook).

### Read in Data

In [3]:
# read in all the labeled data
all_data = pd.read_csv('../data/train.csv')
all_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<a name="target"></a>
### Target Variable: Survived
[Back to Outline](#outline)

In [4]:
# break up the dataframe into X and y
# X is a 2 dimensional "spreadsheet" of values used for prediction
# y is a 1 dimensional vector of target (aka response) values
X = all_data.drop('Survived', axis=1)
y = all_data['Survived']
print('X Shape: ', X.shape)
print('y Shape: ', y.shape)

X Shape:  (891, 11)
y Shape:  (891,)


<a name="traintest"></a>
### Train and Test Split
[Back to Outline](#outline)

Do this prior to Exploratory Data Analysis and other Model Building Steps to avoid looking at the test data.

Train/Test split will later be refined to use cross validation.

Performing the same operation on the train and test datasets will later be refined to use a pipeline.

In [5]:
# create the train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=111)

<a name="eda"></a>
### Exploratory Data Analysis
[Back to Outline](#outline)

One of the first things to check for is **null values**.  In this *first iteration* of creating a machine learning model, this will be the only EDA performed.

In [6]:
# Find the percentage of missing values per column
nrows, ncols = X_train.shape
X_train.isnull().sum() / nrows

PassengerId    0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.187801
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.776886
Embarked       0.003210
dtype: float64

### Null Value Analysis
The following is a reasonable judgement call as to how to proceed based on the observed percentages of null values.
1. The Age attribute has some missing values => impute missing values
2. Most of the Cabin attribute is missing => remove it
3. Very few Emarked records are missing => remove records with missing Emarked value

Age imputation is likely to be helpful, however in this first iteration the goal is to quickly create a model for baseline purposes.

Write a note to ourself and others about what to try next.  In a business environment, this would be maintained in an issue tracking system.  

**Next Iteration:**
- Try Age Imputation
- Removed records with missing Emarked value

In [7]:
# Discard Age column (for now)
X_train = X_train.drop('Age', axis=1)
X_test = X_test.drop('Age', axis=1)

# Discard Cabin column
X_train = X_train.drop('Cabin', axis=1)
X_test = X_test.drop('Cabin', axis=1)

### Examine Datatypes
Often this involves converting text or integers to categorical variables.

Based on a review of the data dictionary at [titanic](https://www.kaggle.com/c/titanic/data), and an examination of the values of each column, the following variables need to be converted to categorical:


**Next Iteration: convert the following variables to categorical**
- Pclass
- Sex
- Embarked

In [8]:
# For 1st Iteration only, ignore all text and categorical variables
X_train = X_train.drop('Pclass', axis=1)
X_test = X_test.drop('Pclass', axis=1)
X_train = X_train.drop('Name', axis=1)
X_test = X_test.drop('Name', axis=1)
X_train = X_train.drop('Sex', axis=1)
X_test = X_test.drop('Sex', axis=1)
X_train = X_train.drop('Ticket', axis=1)
X_test = X_test.drop('Ticket', axis=1)
X_train = X_train.drop('Embarked', axis=1)
X_test = X_test.drop('Embarked', axis=1)

A natural question to ask is, wouldn't it have been easier to drop these columns prior
to creating the train/test split so we wouldn't have to apply the same operation (drop column) to one?  The answer is "yes", but the proper way to do this, while ensuring no "test data leakage", is by way of pipelines and that will be discussed in a subsequent notebook.

In [9]:
# Examine the datatypes of each remaining column
X_train.dtypes

PassengerId      int64
SibSp            int64
Parch            int64
Fare           float64
dtype: object

<a name="preprocess"></a>
### Preprocessing
[Back to Outline](#outline)

Preprocessing was done "inline" with the Exploratory Data Analysis above.

<a name="model"></a>
### Model Building
[Back to Outline](#outline)

Perhaps the simplest model to try for classification of two classes (Survived, Not-Survived) is Logistic Regression.

Special techniques are required if one class is much more rare than another.  Let's check for that.

In [10]:
y.value_counts()

0    549
1    342
Name: Survived, dtype: int64

That's close enough to "even" for this first iteration.  Logistic Regression may work fine.

In [11]:
# Build Model
from sklearn.linear_model import LogisticRegression
base_model = LogisticRegression()
base_model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

<a name="eval"></a>
### Model Evaluation
[Back to Outline](#outline)

The simplest measure of "model value" (to the business or end user) is the percent of correct predictions.  Assuming that the cost of a false positive is equal to the cost of a false negative, "accuracy" is a good measure of model value.

In [12]:
predictions = base_model.predict(X_test)

In [13]:
from sklearn.metrics import confusion_matrix
confusion_matrix(predictions, y_test)

array([[154,  69],
       [ 17,  28]])

In [14]:
# Compute Accuracy
base_model_accuracy = (154 + 28) / (154+69+17+28)
print(base_model_accuracy)

0.6791044776119403


In [15]:
# Compare with Simplest Possible Model Sometimes called the Null Model
# Null Model Predicts predominant class every time
y_test.value_counts()

0    171
1     97
Name: Survived, dtype: int64

In [16]:
# Null Model Accuracy
null_accuracy = 171 / (171 + 97)
print(null_accuracy)

0.6380597014925373


<a name="summary"></a>
### Conclusion
[Back to Outline](#outline)

The simple Logistic Regression model had a prediction accuracy of about 68%.  The null model which just predicts the most common class in all cases was accurate about 64% of the time.

In this first iteration:
* we quickly created a model
* noted a few things to try to improve the model
* established a baseline accuracy of 67.9%
* showed that this accuracy is better than the null model accuracy of 63.8%