# Introduction to Keras and Deep Learning

This notebook is an introduction to the Keras framework, using tensorflow as the backend engine, and an introduction to Deep Learning.

## Setup

If you want to following along with the code examples you will need to setup a Python 3.6.x environment. 

**NOTE** Python 3.7.x is not yet supported by tensorflow and some of the other libraries.

### pip install

```python

pip install --upgrade pip
pip install pandas
pip install numpy
pip install tensorflow
pip install keras
pip install matplotlib
pip install jupyter
```

on macos:  source venv/bin/activate to activate the python virtual environment

### Jupyter Notebook Extensions
If you would like to include jupyter notebook extensions you can execute the following:
```python
pip install jupyter_nbextensions_configurator jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter nbextensions_configurator enable --user
```


## DeepLearning vs ShallowLearning

The most significant difference, in this authors humble opinion, between the two styles of machine learning is that in deep learning accounts for the **interaction** between the features from the data, the algorithm figures out the important features by adjusting weights, and uses neural networks to accomplish this.  

In shallow learning, the machine learning engineer has to go through a step, sometimes a lengthy step, to do the feature engineering from the data.  The shallow learning model is then parameterized and tested by the machine learning engineer.  In deep learning, the algorithm typically determines the features of interest, and the interactions between the features by a series of forward propagation of data, loss function error measurement, backpropagation of the error, adjust the model weights, rinse, repeat.  

This is not to say that with deep learning there is no feature engineering, or no model parameterization - but in deep learning the 'neural network' figures this out for the most part.

### Feature interaction between Shallow an Deep Learning
In shallow learning, like linear regression, each feature is considered individually, with a weight also known as coefficient to create a equation like:

The general form of a linear equation used for predictions with multiple features is:

**<font size="3">General form of shallow learning linear regression</font>**

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

- $y$ is the response
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for $x_1$ (the first feature)
- $\beta_n$ is the coefficient for $x_n$ (the nth feature)

In this case:

$y = \beta_0 + \beta_1 \times TV Sales + \beta_2 \times Radio Sales + \beta_3 \times Newspaper Sales$

The $\beta$ values are called the **model coefficients**. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions!

Shallow learning is implemented in machine learning frameworks such as [Scikit-learn](https://scikit-learn.org/stable/).


In **DeepLearning**, all of the features are combined in a series of *hidden* nodes with weights, similar to shallow learning, however for each node the weights are adjusted based on the error to determine an optimal combination of features.

Overly simplistically it might look like:

$y = (\beta_4x_1+\beta_5x_2+\beta_6x_3) + (\beta_7x_1+\beta_8x_2+\beta_9x_3) + ... +$

Deep Learning is implemented in machine learning frameworks such as [Keras](https://keras.io)

Both types of Machine Learning have their place.  One is not necessarily superiour to the other.  They are both valid tools to solve a problem, and when applied correctly can achieve very good results.

It seems that in 'the wild' DeepLearning is considered a better choice, because anything you can solve with ShallowLearning you can do with DeepLearning.  

I have read that always applying DeepLearning to a problem is like using a sledgehammer as your only hammer.  Need to tack down upholstery - sledgehammer.  Need to nail a finishing nail into trim - sledgehammer.  Need to hang a picture on a wall - sledgehammer.  Need to break up concrete, or a drive a concrete nail into a wall - NOW... sledgehammer.  While you can use a sledgehammer for all of these problems - it might not always be the best tool. 

### Shallow Learning - Linear Regression

$Bank Transactions = \beta_0 + \beta_1 \times Age + \beta_2 \times Bank Balance + \beta_3 \times 401k Balance + \beta_4 \times Retire Status$



![slide2](./docs/slide_images/slide_images.001.jpeg)

### Deep Learning Neural Network

![slide3](./docs/slide_images/slide_images.002.jpeg)

### Dot Product

What is:

$(\beta_4x_1+\beta_5x_2+\beta_6x_3)$

That is the dot product of two matrices:

$$\begin{bmatrix} \beta_4 & \beta_5 & \beta_6 \end{bmatrix} dot \begin{bmatrix} x_1 \\ x_2\\ x_3 \end{bmatrix}$$

The value of each hidden node is the dot product of the node values going into the hidden node and the weights of all edges connecting them.  This dot product, along with the *activation function* make up the *forward propagation* algorithm to determine the value of each node.

Deep Learning neural networks and linear regression problems rely on linear algebra and heavily on matrix mathematics. Fortunately - the machine learning libraries handle all of this math but it is still very helpful to understand that is what is happening.




In [17]:
# Using Python to calculate dot products
import numpy as np

m1 = np.array([2,3,4])
m2 = np.array([5,6,7])

# m1 dot m2 = 2*5 + 3*6 + 4*7 = 56

v1 = m1@m2
v2 = np.dot(m1,m2)
v3 = (m1*m2).sum()
print(v1,v2,v3)

56 56 56


## Forward Propagation

Neural networks use a forward propagation algorithm to make initial predictions based on the data.

Lets look at an example where we want to predict the number of transations a person will make at a bank given just the number of children and the number of accounts.

Forward propagation starts at the input layer, moves through the hidden layers, and then finally to the output layer.

Forward propagation procedes for each data point.

**NOTE** the diagram below is focused on the node value calculations *without* an activation function.  We will see how activation functions are used in the next section

![slide3](./docs/slide_images/slide_images.003.jpeg)

The input layer feature vector can be represented by:

$$\begin{bmatrix} 2 & 3 \end{bmatrix}$$

The weights vector for Node 0:

$$\begin{bmatrix} 2 \\ 1 \end{bmatrix}$$

The weights vector for Node 1:

$$\begin{bmatrix} 1 \\ -1 \end{bmatrix}$$

The weights vector for the Output layer:

$$\begin{bmatrix} 1 \\ -1 \end{bmatrix}$$


In [18]:
# input_data vector: [2,3]
# weight vector: [2,1]
node_0 = np.array([2,3])@np.array([2,1])
print(node_0)

7


In [19]:
node_1 = np.array([2,3])@np.array([1,-1])
print(node_1)

-1


In [20]:
target = np.array([node_0,node_1])@np.array([1,-1])
print(target)

8


## Activation Functions

The dot product calculation between the inputs and the weights is only part of what happens at each hidden node.

Activation functions are run in the hiddlen layers and capture non-linearities in the input features. Activation functions are applied to values coming into a node.

Between the combining of all of the features with various weights so that the algorithm can account for feature interactions, and the ability to capture non-linearities with activation functions is what gives deep learning its power.

The value of the node can be thought of as:

<font size="4">node_value = activation_function(input_values dot weights )</font>

A good overview of Activations function can be found in this [Toward Data Science](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6) blog.

### ReLU Rectified Linear Unit

There are many activation functions, such as *tanh*, but the ReLU function seems to be the popular one with neural networks.

![ReLU](./docs/relu.png)

If the value into the ReLU activation function is negative, then the output is set to zero.  Otherwise the value is returned.  

It is a very simple function, which is also very effective and powerful.

```python
def relu(input):
    # Calculate the value for the output of the relu function: output
    output = max(0, input)

    # Return the value just calculated
    return (output)
```

Lets apply the ReLU activation to the example above:


![ReLU Activation Example](./docs/slide_images/slide_images.004.jpeg)

In [21]:
def relu(input):
    # Calculate the value for the output of the relu function: output
    output = max(0, input)

    # Return the value just calculated
    return (output)

In [22]:
node_0 = relu(np.array([2,3])@np.array([2,1]))
print(node_0)

7


In [23]:
node_1 = relu(np.array([2,3])@np.array([1,-1]))
print(node_1)

0


Notice now, for Node 1 - the value was previously -1, however as a result of the ReLU activation function the value of Node 1 is now 0 (zero).

**Note** The output layer does not necessarily use the same activation function as the hidden layers.  

For the output layer typically for a regression problem you use a linear activation function and for classification one called, 'softmax'.

In [24]:
target = np.array([node_0,node_1])@np.array([1,-1])
print(target)

7


Using Forward Propagation with the ReLU activation function, lets calculate the output layer values for different input values.

In [1]:
# Define predict_with_network()
# this 'network' has one hidden layer with 2 nodes.
def predict_with_network(input_data_row, weights):
    # Calculate node 0 value
    node_0_input = input_data_row @ weights['node_0']
    node_0_output = relu(node_0_input)

    # Calculate node 1 value
    node_1_input = input_data_row @ weights['node_1']
    node_1_output = relu(node_1_input)

    # Put node values into array: hidden_layer_outputs
    hidden_layer_outputs = np.array([node_0_output, node_1_output])

    # Calculate model output
    input_to_final_layer = hidden_layer_outputs @ weights['output']
    
    # note in this case we are using a relu because you cannot have negative transactions
    model_output = relu(input_to_final_layer)

    # Return model output
    return (model_output)



In [26]:
# Setup input_data/features/input layer
input_data = [np.array([3, 5]), np.array([ 1, -1]), np.array([0, 0]), np.array([8, 4])]

# Setup network weights
weights = {'node_0': np.array([2, 1]), 'node_1': np.array([ 1, -1]), 'output': np.array([1, -1])}

# Create empty list to store prediction results
results = []

# For each set of feature values, calculate a prediction.
for input_data_row in input_data:
    # Append prediction to results
    model_output = predict_with_network(input_data_row, weights)
    results.append(model_output)

# Print results
print(results)


[11, 0, 0, 16]


## Deep Neural Networks

The real power of a neural network comes from the ability to have multiple hidden layers, where each layer can have a different number of nodes.

Even though you might have multiple layers with a different number of nodes, the forward propagation algorithm still works the same way.

- deep networks internally build representations of patterns in the data
- partially replace the need for feature engineering
- subsequent layers build increasingly sophisticated representations of the raw data because each prior layer provides more information on the feature interactions.

Note again, that we do not program the increased representation of the interactions, the network determines this.  This is where something called *back propagation* comes in.

### Example with 2 hidden layers

![2 Hidden Layers ](./docs/slide_images/slide_images.005.jpeg)

#### Solution

![2 Hidden Layers Solution](./docs/slide_images/slide_images.006.jpeg)

In [27]:
input_data = np.array([3, 4])
weights = {'node_0_0': np.array([2, 1]), 'node_0_1': np.array([ 1, 1]), 'node_1_0': np.array([1,2]), 'node_1_1': np.array([-2,-1]), 'output': np.array([1, -1])}

node_0 = relu(input_data @ weights['node_0_0'])
node_1 = relu(input_data @ weights['node_0_1'])

hidden_layer_0 = np.array([node_0, node_1])

node2 = relu(hidden_layer_0 @ weights['node_1_0'])
node3 = relu(hidden_layer_0 @ weights['node_1_1'])

hidden_layer_1 = np.array([node2, node3])

target = hidden_layer_1 @ weights['output']

print(f'Target prediction of # transactions: {target}')


Target prediction of # transactions: 24


## Determining the network weights

Assume the simple network below:

![Weights](./docs/slide_images/slide_images.007.jpeg)

Lets also assume that we know for 2 Children, and 3 Accounts that the expected number of transactions is 10.

What weights do we need to predict 10 so that the prediction error ( predicted value - target value ) is zero?

In [28]:
input_data = np.array([3,4])
weights = {'node_0': np.array([1,1]), 'node_1': np.array([1,1]), 'output': np.array([1,1])}

prediction = predict_with_network(input_data, weights)
target = 10           
print(f'Prediction: {prediction}, Error: {prediction - target}')

Prediction: 14, Error: 4


In [29]:
# brute force?
from random import randint
squared_error_values = []
for i in range(0,1000):
    weights['node_0'] = np.array([randint(-10,10),randint(-10,10)])
    weights['node_1'] = np.array([randint(-10,10),randint(-10,10)])
    weights['output'] = np.array([randint(-10,10),randint(-10,10)])
    
    prediction = predict_with_network(input_data, weights)
#     print(f'Prediction: {prediction}, Error: {prediction - target}')
    error = prediction - target
    if error == 0:
        print(f"************ SOLUTION found in {i} attempts ****************")
        print(weights)
        break
    squared_error_values.append(np.square(error))
else:
    print("Could not find solution")

************ SOLUTION found in 278 attempts ****************
{'node_0': array([4, 7]), 'node_1': array([-2,  9]), 'output': array([-5,  7])}


Making accurate predictions gets much more difficult as the number of hidden layers, and nodes increases.

Every set of weights can have a different error value, and many values of the weights can also solve the problem.

Solution:  Use a **loss function** to aggregate all of the errors into a single score.

The loss function represents a measure (or score) the of models predictive performance.

**For regression, mean squared error is a typical loss function.**



The lower the loss function value, the better the model.  Therefore, the lower the *mean squared error* the better the prediction.

The goal is to find a set of weights that gives the lowest value for the loss function.

This is done using **gradient descent**

### Gradient Descent of the loss function

Gradient descent is the process of finding the lowest value of a curve.  

A very good article on gradient descent can be found on [Towards Data Science](https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3)

![GD](./docs/gradient_descent.png)

Calculate the slope of the tangent line of the loss_function for the given data points.  If that slope is positive, you decrease the value of 'w'.  If the slope is negative, you increase the value of 'w'.  

Recall that the slope is the 1st derivative of the loss function.

The amount you increase or decrease is governed by the **learning rate**.  

The learning rate is a multiplier of the magnitude of the slope and that determines the step size:


<center>
<br>
$step\ size = w - (learning\ rate * slope)$
</center>


Where learning rate values are in the range of .1 to .001. 

The smaller the value of the learning rate the longer a model will take to train.  Too large a value for learning rate may cause the training to never converge because you will continually step over the minimum value.

#### Slope of / Derivative of Mean Squared Error Loss Function

Slope of the mean-squared error loss function with respect to a particular prediction or node value is:

$slope = 2 * ( predicted\ value - actual\ value)$

or

$slope = 2 * error\ value$

<font size="1">Recall that the derivative of X^2 is 2X</font>
    

#### Slope of / Derivative of ReLU Activation Function
See [James McCaffrey Blog](https://jamesmccaffrey.wordpress.com/2017/06/23/two-ways-to-deal-with-the-derivative-of-the-relu-function/) for details.

"If x is greater than 0 the derivative is 1 and if x is less than zero the derivative is 0. But when x = 0, the derivative does not exist.

There are two ways to deal with this. First, you can just arbitrarily assign a value for the derivative of y = ReLU(x) when x = 0. Common arbitrary values are 0, 0.5, and 1. Easy!"

For the not so easy - see reference above


#### Updated weights calculation Example between 2 nodes

Keep in mind that when we reference the value of the 'slope of the loss function' at a particular node, that is the same as saying the value of the 'derivative of the loss function' at a particular node.

![SlopeCalc](./docs/slide_images/slide_images.008.jpeg)

##### Slope of loss function with respect to value at target node

recall the derivative of the mean squared loss function is '2 * error'

2 * (predicted value-target value) = 2 * (target node value-target value ) = 2 * (6-10) = -8

##### Value of Source Node

3

##### Slope of activation function with respect to value at source node

1 (Since the value is positive)

##### Assume a learning rate of 0.01

The calculation to determine the new weight is:

derivative_of_relu(source node) * source node * derivative_of_mse(error) * learning rate

In [30]:
#new_weight = w - (1 * source_node_value * (2 * (predicted-target) * LR)  )
new_weight = 2 - 1 * 3 * -8 * 0.01
print(new_weight)

2.24


Repeat this for all of the nodes to update the weights and then perform forward propagation and measure error, perform back-propagation, rinse and repeat.

### 2 Node Example
![2SlopeCalc](./docs/slide_images/slide_images.009.jpeg)

In [2]:
import numpy as np

source_data = np.array([3,4])

weights = np.array([1,2])

target = 10

learning_rate = 0.01

preds = source_data @ weights

print(f"Prediction: {preds}")

error = preds - target
print(f'Error: {error}')

# gradient is the matrix form of the 3 steps above
gradient = 2 * error * source_data
print(f'Gradient: {gradient}')

# new weights
weights_updated = weights - (learning_rate * gradient)
print(f'Updated Weights: {weights_updated}')

pred_updated = source_data @ weights_updated
print(f'Updated Prediction: {pred_updated}')

error_updated = pred_updated - target
print(f'Updated Error: {error_updated}')

Prediction: 11
Error: 1
Gradient: [6 8]
Updated Weights: [0.94 1.92]
Updated Prediction: 10.5
Updated Error: 0.5


### Back-Propagation

Allows gradient descent to update all weights in a neural network, by calculating the graidents for all weights.

Relies on chain rule from calculus but this notebook will focus on understanding the general process.

A good article on Back-Propagation: [Towards Data Science](https://medium.com/@14prakash/back-propagation-is-very-simple-who-made-it-complicated-97b794c97e5c)
It is actually a bit more complex than this notebook will get into.

A model always performs forward propagation to get a prediction, and calculate the error.  Then the model will perform back-propagation to calculate new weights as we dicussed previously.

Back-propagation goes back one *layer* at a time.  For each layer, we calculate the gradient as we did above.

Slope of node values are the sum of the slopes for all weights that come out of them.

* Start with random values for weights

* Use forward propagation to make a prediction and calculate the error

* Use backward propagation to calculate the slope of the loss function with respect to each weight

* Multiple that slope by the learning rate and subtract the value from the current weights.

![BackProp](./docs/slide_images/slide_images.010.jpeg)

In [32]:
#new_weight = w - (1 * source_node_value * (2 * (predicted-target) * LR)  )
d_relu = 1
slope_error = 2*(7-4)
LR = 0.01
node_value = 3
existing_weight = 2
new_weight = existing_weight - d_relu * node_value * slope_error * LR
print(new_weight)

1.82


![BackProp](./docs/slide_images/slide_images.011.jpeg)

![BackProp](./docs/slide_images/slide_images.012.jpeg)

## Keras Model Building

You can find out all of details on Keras from the [website](http://keras.io)

The steps

- Specify the model architecture

This step will create the model and define the layers, from input layer, to any number of hidden layers, to an output layer. 

This architecture is:
- single input layer

- 2 hidden layers.  One with 50 nodes, the other with 32 nodes

- single output layer with a single node

```python
# Set up the model: model
model = Sequential()

# Add the first layer
# notice n_cols.  This is the number of features or columns in a dataframe.
model.add(Dense(50, activation='relu', input_shape=(n_cols,)))

# Add the second layer
model.add(Dense(32, activation='relu'))

# Add the output layer
# single node for the predictions of the model.
model.add(Dense(1))
```

- compile

Sets up the model to get ready for the fit method.

```python
model.compile(optimizer='adam', loss='mean_squared_error')
```

Keras has many different optimizers, which essentially handle the gradient descent and back-propagation.  The 'adam' optimizer dynamically adjusts the learning rate. 

For regression, the loss function is generally *mean_squared_error*.  For classifications problems we would use, *categorical_crossentropy*

- fit

Apply backpropagation and gradient descent with data to update the weigts.

The fit method takes an epoch parameter which tells Keras how many times to train the model on the data set.

```python
model.fit(X,y, epochs=10)
```


- predict

This method will make predictions given the features and the target values.

```python
y_pred = model.predict(pred_data)
```


### Saving and Loading a model

Because training can take a long time, once a model is trained you will want to save the model and reload it later to make predictions.  The models in this notebook are small - but even with small datasets this training can take some time.

#### Save
```python
# Save model for later
model.save('keras_titanic_model.h5')
```

#### Load
```python
model = load_model('keras_titanic_model.h5')
```

### Keras Regression Model

Use the hourly_wage dataset.

In [33]:
import pandas as pd
from keras.models import Sequential, load_model
from keras.layers import Dense
import numpy as np
import os

Using TensorFlow backend.


In [34]:
df = pd.read_csv('./data/hourly_wages.csv')
df.head()

Unnamed: 0,wage_per_hour,union,education_yrs,experience_yrs,age,female,marr,south,manufacturing,construction
0,8.75,0,12,9,27,0,0,0,0,0
1,11.35,1,12,17,35,0,1,0,0,0
2,11.5,1,12,19,37,0,0,0,1,0
3,6.5,0,8,27,41,0,1,1,0,0
4,6.25,1,9,30,45,0,0,1,0,0


**Note** we will see how to handle the categorical columns (union, construction, manufacturing)
soon

In [36]:
KERAS_WAGE_MODEL_H_ = 'keras_wage_model_1.h5'

df = pd.read_csv('./data/hourly_wages.csv')
df_test = pd.read_csv('./data/hourly_wages_test.csv')

print(df.head())
print(df.shape)

X = df.drop(columns=['wage_per_hour'])
y = df['wage_per_hour']

X_test = df_test.drop(columns=['wage_per_hour'])
y_test = df_test['wage_per_hour']

# number of features or number of inputs to the keras model
n_cols = X.shape[1]

if os.path.exists(KERAS_WAGE_MODEL_H_):
    model = load_model(KERAS_WAGE_MODEL_H_)
else:
    # Set up the model: model
    model = Sequential()

    # Add the first layer
    model.add(Dense(50, activation='relu', input_shape=(n_cols,)))

    # Add the second layer
    model.add(Dense(32, activation='relu'))

    # Add the output layer
    model.add(Dense(1))

    # compile the model
    model.compile(optimizer='adam', loss='mean_squared_error')

    # scaling data before fitting can ease optimization

    # fit will perform the backpropagation and gradient descent
    # fit(X_train, y_train)
    model.fit(X, y, epochs=100)

    model.save(KERAS_WAGE_MODEL_H_)

predictions = model.predict(X_test)
print(f'Actual vs Predictions:')
print(*list(zip(y_test, predictions.flatten())), sep="\n")
error = predictions.flatten() - y_test
print(f'RMSE: {np.sqrt(np.mean(np.square(error)))}')


   wage_per_hour  union  education_yrs  experience_yrs  age  female  marr  \
0           8.75      0             12               9   27       0     0   
1          11.35      1             12              17   35       0     1   
2          11.50      1             12              19   37       0     0   
3           6.50      0              8              27   41       0     1   
4           6.25      1              9              30   45       0     0   

   south  manufacturing  construction  
0      0              0             0  
1      0              0             0  
2      0              1             0  
3      1              0             0  
4      1              0             0  
(525, 10)
Actual vs Predictions:
(5.1, 4.514681)
(4.95, 5.8179517)
(6.67, 6.6517305)
(4.0, 7.4844313)
(7.5, 9.474591)
(13.07, 10.2209425)
(4.45, 7.370039)
(19.47, 8.56487)
(13.28, 12.070466)
RMSE: 4.13877512626986


### Keras Classification Model

For classification we need to change a couple of things.

- Set loss='categorical_crossentropy'
Similar to regression loss, the lower the number the better.

- Add metrics=['accuracy'] to compile options
This will provide familiar accuracy scores when fitting.

- Output layer has a separate node for each possible outcome

- Add activation='softmax' for the Output layer
Ensures the output sums to one so it can be interpreted as a probability.

#### OneHotEncode Target values


It is common to have single column as the target or outcome.  For example in the *titanic* data set, the *survived* target column has a 1 for survived and a 0 for not.  For Keras, we would want to one hot encode these values such that there is a Survived_Yes and a Survived_No column with a 1 in the appropriate column.  

Now our single column target is a multi-column.

In the output layer, we would create a Dense object and specify the number of possible outputs, and in this example that would be 2.

Keras provides a function to perform the one-hot encoding called, *to_category(y)*

In [38]:
df = pd.DataFrame({'target': [0,1,2,2,1,0,0,0,0,1,2,1,2,2,2,1,1,1,0,0]})
df.head()

Unnamed: 0,target
0,0
1,1
2,2
3,2
4,1


In [43]:
from keras.utils import to_categorical

y_cat = to_categorical(df)
df_cat = pd.DataFrame(y_cat)
df_cat

Unnamed: 0,0,1,2
0,1.0,0.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,0.0,1.0,0.0
5,1.0,0.0,0.0
6,1.0,0.0,0.0
7,1.0,0.0,0.0
8,1.0,0.0,0.0
9,0.0,1.0,0.0


#### Keras on the Titanic Dataset

Using the local titanic dataset, which has been cleaned up, use a Keras Deep Learning model to predict who will survive.

The structure of this model will be a single hidden layer with 100 nodes.



In [44]:
# Load the data and have a quick look at the results.
df = pd.read_csv('./data/titanic.csv')

print(df.head())

X = df.drop(columns=['survived'])
y = df['survived']

print(X.shape)
print(y.shape)
print(df.isnull().sum())


   survived  pclass   age  sibsp  parch     fare  male  age_was_missing  \
0         0       3  22.0      1      0   7.2500     1            False   
1         1       1  38.0      1      0  71.2833     0            False   
2         1       3  26.0      0      0   7.9250     0            False   
3         1       1  35.0      1      0  53.1000     0            False   
4         0       3  35.0      0      0   8.0500     1            False   

   embarked_from_cherbourg  embarked_from_queenstown  \
0                        0                         0   
1                        1                         0   
2                        0                         0   
3                        0                         0   
4                        0                         0   

   embarked_from_southampton  
0                          1  
1                          0  
2                          1  
3                          1  
4                          1  
(891, 10)
(891,)
survived 

In [49]:
import pandas as pd
from keras.layers import Dense
from keras.models import Sequential, load_model
from keras.utils import to_categorical
import numpy as np
import os

KERAS_TITANIC_MODEL_H_ = 'keras_titanic_model.h5'

df = pd.read_csv('./data/titanic.csv')

print(df.head())

X = df.drop(columns=['survived'])
y = df['survived']

print(X.shape)
print(y.shape)
print(df.isnull().sum())

n_input_features = X.shape[1]

# create a one-hot encoded version of the target output
y_categorical = to_categorical(y)
print(y_categorical)

if os.path.exists(KERAS_TITANIC_MODEL_H_):
    model = load_model(KERAS_TITANIC_MODEL_H_)
else:

    model = Sequential()

    # Layer 0 for Inputs, and 100 node hidden layer
    model.add(Dense(100, activation='relu', input_shape=(n_input_features,)))

    # Add the output layer.  2 because 2 possible outcomes
    model.add(Dense(2, activation='softmax'))

    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

    # Fit model
    model.fit(X, y_categorical, epochs=200)

    # Save model for later
    model.save(KERAS_TITANIC_MODEL_H_)

"pclass,age,sibsp,parch,fare,male,age_was_missing,embarked_from_cherbourg,embarked_from_queenstown,embarked_from_southampton"
pred_data = np.array([[2, 34.0, 0, 0, 13.0, 1, False, 0, 0, 1],
                      [2, 31.0, 1, 1, 26.25, 0, False, 0, 0, 1],
                      [1, 11.0, 1, 2, 120.0, 1, False, 0, 0, 1],
                      [3, 0.42, 0, 1, 8.5167, 1, False, 1, 0, 0],
                      [3, 27.0, 0, 0, 6.975, 1, False, 0, 0, 1],
                      [3, 31.0, 0, 0, 7.775, 1, False, 0, 0, 1],
                      [1, 39.0, 0, 0, 0.0, 1, False, 0, 0, 1],
                      [3, 18.0, 0, 0, 7.775, 0, False, 0, 0, 1],
                      [2, 39.0, 0, 0, 13.0, 1, False, 0, 0, 1],
                      [1, 33.0, 1, 0, 53.1, 0, False, 0, 0, 1],
                      [3, 26.0, 0, 0, 7.8875, 1, False, 0, 0, 1],
                      [3, 39.0, 0, 0, 24.15, 1, False, 0, 0, 1],
                      [2, 35.0, 0, 0, 10.5, 1, False, 0, 0, 1],
                      [3, 6.0, 4, 2, 31.275, 0, False, 0, 0, 1],
                      [3, 30.5, 0, 0, 8.05, 1, False, 0, 0, 1],
                      [1, 29.69911764705882, 0, 0, 0.0, 1, True, 0, 0, 1],
                      [3, 23.0, 0, 0, 7.925, 0, False, 0, 0, 1],
                      [2, 31.0, 1, 1, 37.0042, 1, False, 1, 0, 0],
                      [3, 43.0, 0, 0, 6.45, 1, False, 0, 0, 1],
                      [3, 10.0, 3, 2, 27.9, 1, False, 0, 0, 1],
                      [1, 52.0, 1, 1, 93.5, 0, False, 0, 0, 1],
                      [3, 27.0, 0, 0, 8.6625, 1, False, 0, 0, 1],
                      [1, 38.0, 0, 0, 0.0, 1, False, 0, 0, 1],
                      [3, 27.0, 0, 1, 12.475, 0, False, 0, 0, 1],
                      [3, 2.0, 4, 1, 39.6875, 1, False, 0, 0, 1],
                      [3, 29.69911764705882, 0, 0, 6.95, 1, True, 0, 1, 0],
                      [3, 29.69911764705882, 0, 0, 56.4958, 1, True, 0, 0, 1],
                      [2, 1.0, 0, 2, 37.0042, 1, False, 1, 0, 0],
                      [3, 29.69911764705882, 0, 0, 7.75, 1, True, 0, 1, 0],
                      [1, 62.0, 0, 0, 80.0, 0, False, 0, 0, 0],
                      [3, 15.0, 1, 0, 14.4542, 0, False, 1, 0, 0],
                      [2, 0.83, 1, 1, 18.75, 1, False, 0, 0, 1],
                      [3, 29.69911764705882, 0, 0, 7.2292, 1, True, 1, 0, 0],
                      [3, 23.0, 0, 0, 7.8542, 1, False, 0, 0, 1],
                      [3, 18.0, 0, 0, 8.3, 1, False, 0, 0, 1],
                      [1, 39.0, 1, 1, 83.1583, 0, False, 1, 0, 0],
                      [3, 21.0, 0, 0, 8.6625, 1, False, 0, 0, 1],
                      [3, 29.69911764705882, 0, 0, 8.05, 1, True, 0, 0, 1],
                      [3, 32.0, 0, 0, 56.4958, 1, False, 0, 0, 1],
                      [1, 29.69911764705882, 0, 0, 29.7, 1, True, 1, 0, 0],
                      [3, 20.0, 0, 0, 7.925, 1, False, 0, 0, 1],
                      [2, 16.0, 0, 0, 10.5, 1, False, 0, 0, 1],
                      [1, 30.0, 0, 0, 31.0, 0, False, 1, 0, 0],
                      [3, 34.5, 0, 0, 6.4375, 1, False, 1, 0, 0],
                      [3, 17.0, 0, 0, 8.6625, 1, False, 0, 0, 1],
                      [3, 42.0, 0, 0, 7.55, 1, False, 0, 0, 1],
                      [3, 29.69911764705882, 8, 2, 69.55, 1, True, 0, 0, 1],
                      [3, 35.0, 0, 0, 7.8958, 1, False, 1, 0, 0],
                      [2, 28.0, 0, 1, 33.0, 1, False, 0, 0, 1],
                      [1, 29.69911764705882, 1, 0, 89.1042, 0, True, 1, 0, 0],
                      [3, 4.0, 4, 2, 31.275, 1, False, 0, 0, 1],
                      [3, 74.0, 0, 0, 7.775, 1, False, 0, 0, 1],
                      [3, 9.0, 1, 1, 15.2458, 0, False, 1, 0, 0],
                      [1, 16.0, 0, 1, 39.4, 0, False, 0, 0, 1],
                      [2, 44.0, 1, 0, 26.0, 0, False, 0, 0, 1],
                      [3, 18.0, 0, 1, 9.35, 0, False, 0, 0, 1],
                      [1, 45.0, 1, 1, 164.8667, 0, False, 0, 0, 1],
                      [1, 51.0, 0, 0, 26.55, 1, False, 0, 0, 1],
                      [3, 24.0, 0, 3, 19.2583, 0, False, 1, 0, 0],
                      [3, 29.69911764705882, 0, 0, 7.2292, 1, True, 1, 0, 0],
                      [3, 41.0, 2, 0, 14.1083, 1, False, 0, 0, 1],
                      [2, 21.0, 1, 0, 11.5, 1, False, 0, 0, 1],
                      [1, 48.0, 0, 0, 25.9292, 0, False, 0, 0, 1],
                      [3, 29.69911764705882, 8, 2, 69.55, 0, True, 0, 0, 1],
                      [2, 24.0, 0, 0, 13.0, 1, False, 0, 0, 1],
                      [2, 42.0, 0, 0, 13.0, 0, False, 0, 0, 1],
                      [2, 27.0, 1, 0, 13.8583, 0, False, 1, 0, 0],
                      [1, 31.0, 0, 0, 50.4958, 1, False, 0, 0, 1],
                      [3, 29.69911764705882, 0, 0, 9.5, 1, True, 0, 0, 1],
                      [3, 4.0, 1, 1, 11.1333, 1, False, 0, 0, 1],
                      [3, 26.0, 0, 0, 7.8958, 1, False, 0, 0, 1],
                      [1, 47.0, 1, 1, 52.5542, 0, False, 0, 0, 1],
                      [1, 33.0, 0, 0, 5.0, 1, False, 0, 0, 1],
                      [3, 47.0, 0, 0, 9.0, 1, False, 0, 0, 1],
                      [2, 28.0, 1, 0, 24.0, 0, False, 1, 0, 0],
                      [3, 15.0, 0, 0, 7.225, 0, False, 1, 0, 0],
                      [3, 20.0, 0, 0, 9.8458, 1, False, 0, 0, 1],
                      [3, 19.0, 0, 0, 7.8958, 1, False, 0, 0, 1],
                      [3, 29.69911764705882, 0, 0, 7.8958, 1, True, 0, 0, 1],
                      [1, 56.0, 0, 1, 83.1583, 0, False, 1, 0, 0],
                      [2, 25.0, 0, 1, 26.0, 0, False, 0, 0, 1],
                      [3, 33.0, 0, 0, 7.8958, 1, False, 0, 0, 1],
                      [3, 22.0, 0, 0, 10.5167, 0, False, 0, 0, 1],
                      [2, 28.0, 0, 0, 10.5, 1, False, 0, 0, 1],
                      [3, 25.0, 0, 0, 7.05, 1, False, 0, 0, 1],
                      [3, 39.0, 0, 5, 29.125, 0, False, 0, 1, 0],
                      [2, 27.0, 0, 0, 13.0, 1, False, 0, 0, 1],
                      [1, 19.0, 0, 0, 30.0, 0, False, 0, 0, 1],
                      [3, 29.69911764705882, 1, 2, 23.45, 0, True, 0, 0, 1],
                      [1, 26.0, 0, 0, 30.0, 1, False, 1, 0, 0],
                      [3, 32.0, 0, 0, 7.75, 1, False, 0, 1, 0]])

# your predictions will be probabilities, which is the most common way for data scientists to communicate their predictions to colleagues.
y_pred = model.predict(pred_data)
probabilty_true = y_pred[:, 1]
for p in probabilty_true:
    survived = 'No'
    if p > 0.5:
        survived = 'Yes'
    print(f'{p:.2f} - {survived}')


   survived  pclass   age  sibsp  parch     fare  male  age_was_missing  \
0         0       3  22.0      1      0   7.2500     1            False   
1         1       1  38.0      1      0  71.2833     0            False   
2         1       3  26.0      0      0   7.9250     0            False   
3         1       1  35.0      1      0  53.1000     0            False   
4         0       3  35.0      0      0   8.0500     1            False   

   embarked_from_cherbourg  embarked_from_queenstown  \
0                        0                         0   
1                        1                         0   
2                        0                         0   
3                        0                         0   
4                        0                         0   

   embarked_from_southampton  
0                          1  
1                          0  
2                          1  
3                          1  
4                          1  
(891, 10)
(891,)
survived 

Epoch 154/200
Epoch 155/200
Epoch 156/200
Epoch 157/200
Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200
0.11 - No
0.74 - Yes
0.77 - Yes
1.00 - Yes
0.11 - No
0.09 - No
0.05 - No
0.48 - No
0.12 - No
0.96 - Yes
0.10 - No
0.04 - No
0.12 - No
0.13 - No
0.09 - No
0.04 - No
0.53 - Yes
0.10 - No
0.04 - No
0.03 - No
0.93 - Yes
0.09 - No
0.05 - No
0.50 - No
0.10 - No
0.04 - No
0.20 - No
0.97 - Yes
0.06 - No
0.96 - Yes
0.67 - Yes
0.95 - Yes
0.08 - No
0.

## Model Optimization

Model optimization can be difficult because the model is simultaneously optimize 1000s of weights and parameters in a complex network of relationships.

It is possible, even very likely, that some updates will not measurably improve the model and therefore new weights and parameters will have to be tried.

These updates might take too long, meaning the learning rate is very small, or will never converge on an optimal solution meaning the learning rate is too high.



### Learning Rates and Stochastic Gradient Descent

You can specify the learning rate when you specify a SGD optimizer.

```python
from keras.optimizers import SGD

learning_rate = 0.01

model.compile(optimizer=SGD(lr=learning_rate), loss='categorical_crossentropy', metrics=['accuracy'])
```

By specifying the optimizer this way, we can adjust the learning rate.

### Dying Neuron Problem

This is when a node starts to get negative inputs, it may continue to get negative inputs and will always be zero.

There are different activations functions, but for now just be aware this can happen.

### Model Validation

It is common to create training testing datasets.  In many cases we use crossvalidation to go through a dataset with a new hold out each time to test how the model setup is generalizing to unseen data.

However, in a deep learning is typically run on large datasets and so to use crossvalidation would just not be feasible because the computational expense is too large.

Keras exposes a *validation_split* parameter to handle this scenario without the overhead.

The parameter is part of the *fit* method like:
```python
model_1_training = model_1.fit(predictors, target, epochs=20, validation_split=0.4)
```

When you add *validation_split* will see in the output:

```python
0s 41us/step - loss: 0.4321 - acc: 0.8122 - val_loss: 0.4278 - val_acc: 0.8060
Epoch 56/2000```


**acc** is the accuracy of a batch of training data and **val_acc** is the accuracy of a batch of testing data.


### Early Stopping

When the model validation does not show the model is improving then stop the training.  This allows us to specify a complex model architecture knowing that once the model stops improving the training will stop.

```python
from keras.callbacks import EarlyStopping

early_stopping_monitor = EarlyStopping(patience=3)
model.fit(X, y_categorical, validation_split=0.3, callbacks=[early_stopping_monitor], epochs=2000)
```


## Conclusion

The main take aways are:

- Why does deep learning work?

Captures the non-linear interactions between features from the data, minimizing the need for exhaustive feature engineering.

- Understanding forward propagation.

The application of an activation function to the dot product of the input values and weights into a node.

- Understanding back propagation

How the system updates the weights of the network by pushing the error back through the network.


- Understanding how to use Keras with a tensorflow backend

How to setup a network, save and load the model, compile the model and fit the network.
