## Training the Network
This network does not always converge towards an optimal solution. Due to random initialization of weights and selected value for alpha, the network may produce strange results. Many of the variations for the aspects described above produced accurate networks. Ultimately, a learning rate of 1 was used with a 4 set as the random seed, 60 000 iterations and stopping criteria set to 1e-15. This formation consistently provides good results as well as weights and activations that illustrate the underlying properties of the network in a relatively straightforward manner. Training the network takes less than 20 seconds depending on the computer used and the cross-entrophy cost after the final iteratiion is 0.016892. Note the weights and activation values shown are rounded to the third decimal place so that they are easier to read. The tables below are based on the setup described, variations to the model would produce different results.


## Weights

The first striking aspect of this neural network is that despite simple input and outputs the scale of the weights is quite high. The first layer contains weights with values in the 1000s and the subsequent hidden layer has reduced weights in the 10s range. The degree to this first layer of weights appears to be related to the degree of accuracy of the final activation. The large the weight, the larger the separation between different inputs and their corresponding outputs. In the initial layer, the weight for a particular neuron is only, used when multiplied by the binary vector with the same entry. However in the hidden layer, the weights are now interconnected between all the input instances therefore, extreme values would greatly vary the activation into the final layer.


## Activation Values

Activation values for Layer 1:

|   |0	    |1	    |2	    |3   	|4   	|5   	|6   	|7    |
|---|-------|-------|-------|-------|-------|-------|-------|-----|
|0	|0.440	|0.000	|0.000	|0.439	|0.466	|0.000	|0.000	|0.394|
|1	|0.000	|0.537	|0.340	|0.474	|0.405	|0.000	|0.116	|0.000|
|2	|0.418	|0.095	|0.532	|0.449	|0.000	|0.334	|0.000	|0.000|


The activation values for each layer are far more interesting. Due to the nature of the inputs and desired outputs this network is an autoencoder. This type of network acts as the identify functions for a set of inputs. More importantly, it can be used to determine a compressed representation of the data preserving only necessary information. From the activations of layer 1 above we can see this at work. At first, these values appear all over the place but once the columns are ordered (6, 5, 1, 2, 7, 0, 4, 3) a pattern emerges. If the near-zero values are set to zero for these vectors and the non-zero values set to 1 the binary numbers 0 through to 7 emerge (see the table below, assume near-zero is < 0.12). This sequence contains the same information as our input (the number 1 shifted amongst 8 positions) in the least amount of information possible. The dimensions of our input 8 x 8 can now be represented in using only 3 x 8 dimensions while maintaining the desired information. This neural network acts as a mapping between the initial representation and the compressed form. 


|   |6	    |5	    |1	    |2   	|7   	|0   	|4   	|3    |
|---|-------|-------|-------|-------|-------|-------|-------|-----|
|0	|0.000	|0.000	|0.000	|0.000	|0.394	|0.440	|0.466	|0.439|
|1	|0.116	|0.000	|0.537	|0.340	|0.000	|0.000	|0.405	|0.474|
|2	|0.000	|0.334	|0.095	|0.532	|0.000	|0.418	|0.000	|0.449|

The final output reveals just how close the values are to reproducing the input. However, what is also clear is that many values are not exactly 0 and not exactly 1. This indicates that this particular network does not produce the global optimum value and can be further trained and improved. This improvement should also reinforce the properties described in the first activation layer.

-Final Activation Layer:

|   |0	    |1	    |2	    |3   	|4   	|5   	|6   	|7    |
|---|-------|-------|-------|-------|-------|-------|-------|-----|
|0	|0.994	|0.000	|0.000	|0.002	|0.000	|0.001	|0.000	|0.002|
|1	|0.000	|0.994	|0.002	|0.000	|0.002	|0.000	|0.002	|0.000|
|2	|0.000	|0.002	|0.992	|0.002	|0.000	|0.004	|0.000	|0.000|
|3	|0.002	|0.000	|0.003	|0.992	|0.003	|0.000	|0.000	|0.000|
|4	|0.000	|0.002	|0.000	|0.003	|0.991	|0.000	|0.000	|0.004|
|5	|0.002	|0.000	|0.004	|0.000	|0.000	|0.988	|0.006	|0.000|
|6	|0.000	|0.004	|0.000	|0.000	|0.000	|0.002	|0.990	|0.004|
|7	|0.001	|0.000	|0.000	|0.000	|0.003	|0.000	|0.004	|0.992|


## Detailed Code Below:
 - Run each cell in order
 - The final cell trains the network and displays results

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, HTML

%matplotlib inline

In [2]:
def layer_sizes(X, Y, hidden_layer_size = 3):
    """
    Arguments:
    X -- input dataset of shape (input size, number of examples)
    Y -- labels of shape (output size, number of examples)
    
    Returns:
    n_x -- the size of the input layer
    n_h -- the size of the hidden layer
    n_y -- the size of the output layer
    """
    n_x = np.size(X,axis=0) # size of input layer
    n_h = hidden_layer_size
    n_y = np.size(Y, axis=0) # size of output layer
    
    return (n_x, n_h, n_y)


In [3]:
def sigmoid(z):
    """
    Compute the sigmoid of z

    Arguments:
    z -- A scalar or numpy array of any size.

    Return:
    -- sigmoid(z)
    """
    return np.power( 1 + np.exp( -z ), -1)

def softmax(z):
    """
    Compute the softmax values for the output layer
    
    Arguments:
    z -- A numpy array of any size
    
    Return:
    -- softmax(x)
    """
    
    ex = np.exp( z - np.max(z))
    return ex / ex.sum( axis = 0)

In [4]:
def initialize_parameters(n_x, n_h, n_y):
    """
    Argument:
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer
    
    Returns:
    params -- python dictionary containing parameters:
        W1 -- weight matrix of shape (n_h, n_x)
        b1 -- bias vector of shape (n_h, 1)
        W2 -- weight matrix of shape (n_y, n_h)
        b2 -- bias vector of shape (n_y, 1)
    """
    
    # INITIALIZE WEIGHTS AND BIAS FOR EACH LAYER
    W1 = np.random.randn( n_h * n_x).reshape( n_h, n_x) * 0.01
    b1 = np.zeros( n_h).reshape( n_h, 1) * 0.01
    W2 = np.random.randn( n_y * n_h).reshape( n_y, n_h) * 0.01
    b2 = np.zeros( n_y).reshape( n_y, 1) * 0.01
    
    
    # ASSERT DIMENSIONS ARE CORRECT
    assert (W1.shape == (n_h, n_x))
    assert (b1.shape == (n_h, 1))
    assert (W2.shape == (n_y, n_h))
    assert (b2.shape == (n_y, 1))
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters

In [5]:
def forward_propagation(X, parameters, activation = 'sigmoid', output_type = 'binary'):
    """
    Argument:
    X -- input data of size (n_x, m)
    parameters -- python dictionary containing parameters 
    (output of initialization function)
    
    Returns:
    A2 -- The sigmoid output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
    """
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    
    # Implement Forward Propagation to calculate A2 (probabilities)
    
    
    # LAYER ONE
    Z1 = np.dot( W1, X) + b1
    
    if activation == 'tanh':
        A1 = np.tanh( Z1)
    else:
        A1 = sigmoid( Z1)
    
    # LAYER TWO
    Z2 = np.dot( W2, A1) + b2
    
    if output_type == 'multi':
        A2 = softmax( Z2)
    else:
        if activation == 'tanh':
            A2 = np.tanh( Z2)
        else:
            A2 = sigmoid( Z2)
    
    cache = {"Z1": Z1,
             "A1": A1,
             "Z2": Z2,
             "A2": A2}
    
    return A2, cache

In [6]:

def backward_propagation(parameters, cache, X, Y):
    """
    Arguments:
    parameters -- python dictionary containing parameters
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    
    Returns:
    grads -- python dictionary containing gradients with respect to different parameters
    """
    m = X.shape[1]
    
    # First, retrieve W1 and W2 from the dictionary "parameters".
    W1 = parameters['W1']
    W2 = parameters['W2']
        
    # Retrieve also A1 and A2 from dictionary "cache".
    A1 = cache['A1']
    A2 = cache['A2']
    
    # Backward propagation: calculate dW1, db1, dW2, db2. 
    dZ2 = (A2 - Y)
    dW2 = ( 1 / m) * np.dot( dZ2, A1.T)
    db2 = ( 1 / m) * np.sum( dZ2, axis=1, keepdims=True)
    
    dZ1 = np.dot( W2.T, dZ2) * ( 1 - np.power( A1, 2))
    #dZ1 = np.dot( W2.T, dZ2) * ( 1 - A1)
    dW1 = ( 1 / m) * np.dot( dZ1, X.T)
    db1 = ( 1 / m) * np.sum( dZ1, axis =1, keepdims=True)
    
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    
    return grads

def backward_propagation_regularization(parameters, cache, X, Y, lmbda = 1.0):
    """
    Arguments:
    parameters -- python dictionary containing parameters 
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    lmbda -- hyperparameter to give weight to the regularization term
    
    Returns:
    grads -- python dictionary containing gradients with respect to different parameters
    """
    m = X.shape[1]
    
    # First, retrieve W1 and W2 from the dictionary "parameters".
    W1 = parameters['W1']
    W2 = parameters['W2']
        
    # Retrieve also A1 and A2 from dictionary "cache".
    A1 = cache['A1']
    A2 = cache['A2']
    
    # Backward propagation: calculate dW1, db1, dW2, db2. 
    dZ2 = A2 - Y
    dW2 = ( 1 / m) * np.dot( dZ2, A1.T) + (lmbda * W2) / m
    db2 = ( 1 / m) * np.sum( dZ2, axis=1, keepdims=True)
    
    dZ1 = np.dot( W2.T, dZ2) * ( 1 - np.power( A1, 2))
    dW1 = ( 1 / m) * np.dot( dZ1, X.T)  + (lmbda * W1) / m
    db1 = ( 1 / m) * np.sum( dZ1, axis =1, keepdims=True)
    
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    
    return grads

In [7]:
def update_parameters(parameters, grads, learning_rate = 1.2):
    """
    Updates parameters using the gradient descent update rule
    
    Arguments:
    parameters -- python dictionary containing parameters 
    grads -- python dictionary containing gradients 
    
    Returns:
    parameters -- python dictionary containing updated parameters 
    """
    # Retrieve each parameter from the dictionary "parameters"
    
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    
    # Retrieve each gradient from the dictionary "grads"
    dW1 = grads['dW1']
    db1 = grads['db1']
    dW2 = grads['dW2']
    db2 = grads['db2']
    
    # Update rule for each parameter
    W1 = W1 - learning_rate * dW1
    b1 = b1 - learning_rate * db1
    W2 = W2 - learning_rate * dW2
    b2 = b2 - learning_rate * db2
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters

In [8]:
def compute_cost(A2, Y, parameters):
    """
    Computes the cross-entropy cost given in equation (13)
    
    Arguments:
    A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    parameters -- python dictionary containing your parameters W1, b1, W2 and b2
    
    Returns:
    cost -- cross-entropy cost given equation (13)
    """
    
    m = Y.shape[1] # number of example

    # Compute the cross-entropy cost
    logprobs = np.multiply( np.log( A2), Y) + np.multiply( np.log(1-A2), (1-Y))
    
    cost = -( 1 / m) * np.sum( logprobs)
    
    # CONFIRM DIMENSIONS
    cost = np.squeeze(cost)     # makes sure cost is the dimension we expect. 
                                # E.g., turns [[17]] into 17 
    assert(isinstance(cost, float))
    
    return cost

def compute_cost_regularization(A2, Y, parameters, lmbda = 1.0):
    """
    Computes the cross-entropy cost given in equation (13)
    
    Arguments:
    A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    parameters -- python dictionary containing your parameters W1, b1, W2 and b2
    lmbda -- hyperparameter to give weight to the regularization term

    Returns:
    cost -- cross-entropy cost given equation (13)
    """
    
    m = Y.shape[1] # number of example
    
    W1 = parameters['W1']
    W2 = parameters['W2']

    # Compute the cross-entropy cost
    logprobs = np.multiply( np.log( A2), Y) + np.multiply( np.log(1-A2), (1-Y))
    reg_cost = (np.sum( np.square(W1)) + np.sum( np.square(W2)))
    cost = -( 1 / m) * np.sum( logprobs) + (lmbda / (2* m)) * reg_cost
    
    # CONFIRM DIMENSIONS
    cost = np.squeeze(cost)     # makes sure cost is the dimension we expect. 
                                # E.g., turns [[17]] into 17 
    assert(isinstance(cost, float))
    
    return cost

In [9]:
def model(X, Y, n_h, num_iter = 10000, print_cost = False, output_type = 'binary', learning_rate = 1.2, epsilon = 1e-7, reg = False, lmbda = 1.0):
    """
    Arguments:
    X -- dataset of shape (2, number of examples)
    Y -- labels of shape (1, number of examples)
    n_h -- size of the hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 1000 iterations
    
    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """
    
    n_x = layer_sizes(X, Y)[0]
    n_y = layer_sizes(X, Y)[2]
    
    # Initialize parameters, then retrieve W1, b1, W2, b2. 
    # Inputs: "n_x, n_h, n_y". 
    # Outputs = "W1, b1, W2, b2, parameters".
    
    parameters = initialize_parameters(n_x, n_h, n_y) 
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    
    cost_prev = 10000
   
    # Loop (gradient descent)
    for i in range(0, num_iter):
         
        # Forward propagation. 
        # Inputs: "X, parameters". 
        # Outputs: "A2, cache".
       
        A2, cache = forward_propagation(X, parameters, output_type = output_type)
        
        # Cost function. 
        # Inputs: "A2, Y, parameters". 
        # Outputs: "cost".
        if reg == True:
            cost = compute_cost_regularization(A2,Y,parameters, lmbda)
        else:
            cost = compute_cost(A2,Y,parameters)
     
        # Backpropagation. 
        # Inputs: "parameters, cache, X, Y". 
        # Outputs: "grads".
        if reg == True:
            grads = backward_propagation_regularization(parameters, cache, X, Y, lmbda)
        else: 
            grads = backward_propagation(parameters, cache, X, Y)
 
        # Gradient descent parameter update. 
        # Inputs: "parameters, grads". 
        # Outputs: "parameters".
        parameters = update_parameters(parameters, grads, learning_rate)
          
        # Print the cost every 1000 iterations
        if print_cost and i % 5000 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
        if abs(cost_prev - cost) < epsilon and cost < 0.04:
            print ("Final cost after iteratiion %i: %f" %(i, cost))
            parameters['A1'] = cache['A1']
            parameters['A2'] = cache['A2']
            return parameters
        else:
            cost_prev = cost
    
    parameters['A1'] = cache['A1']
    parameters['A2'] = cache['A2']
    print ("Final cost after iteratiion %i: %f" %(num_iter, cost))
    return parameters

In [10]:
def predict(parameters, X, output_type = 'binary'):
    """
    Using the learned parameters, predicts a class for each example in X
    
    Arguments:
    parameters -- python dictionary containing parameters 
    X -- input data of size (n_x, m)
    
    Returns
    predictions -- vector of predictions of our model
    """
    
    # Computes probabilities using forward propagation
    A2, cache = forward_propagation( X, parameters, output_type = output_type)
    predictions = (A2 > 0.5)
    
    return predictions

In [None]:
# Regularized Model
'''
np.random.seed(7)
print("Regularized Model")
parameters_reg = model(
    train_set_x, 
    train_set_x, 
    n_h = 3, 
    num_iter = 40000, 
    print_cost = True,
    learning_rate = 1.0,
    epsilon = 1e-10,
    reg = True,
    lmbda = 0.001
)
display( pd.DataFrame( predict( parameters_reg, train_set_x) * 1) )

# Analysis of Weights for regularized model
W1 = pd.DataFrame( parameters_reg['W1'])
A1 = pd.DataFrame( parameters_reg['A1'])
b1 = pd.DataFrame( parameters_reg['b1'])
W2 = pd.DataFrame( parameters_reg['W2'])
A2 = pd.DataFrame( parameters_reg['A2'])
b2 = pd.DataFrame( parameters_reg['b2'])

print("Reg. Weights Layer 1:")
display(np.round(W1,3))
print("Reg. Activations Layer 1:")
display(np.round(A1,3))
print("Reg. Bias Layer 1:")
display(np.round(b1,3))

print("Reg. Weights Layer 2:")
display(np.round(W2,3))
print("Reg. Activations Layer 2:")
display(np.round(A2,3))
print("Reg. Bias Layer 2:")
display(np.round(b2,3))
'''

In [11]:
# Test
train_set_x = np.array(
    [[1,0,0,0,0,0,0,0],
    [0,1,0,0,0,0,0,0],
    [0,0,1,0,0,0,0,0],
    [0,0,0,1,0,0,0,0],
    [0,0,0,0,1,0,0,0],
    [0,0,0,0,0,1,0,0],
    [0,0,0,0,0,0,1,0],
    [0,0,0,0,0,0,0,1]]
)


np.random.seed(4)
parameters = model(
    train_set_x, 
    train_set_x, 
    n_h = 3, 
    num_iter = 60000, 
    print_cost = True,
    learning_rate = 1.0,
    epsilon = 1e-15,
    reg = False
)
display( pd.DataFrame( predict( parameters, train_set_x) * 1) )

# Analysis of Weights
W1 = pd.DataFrame( parameters['W1'])
A1 = pd.DataFrame( parameters['A1'])
b1 = pd.DataFrame( parameters['b1'])
W2 = pd.DataFrame( parameters['W2'])
A2 = pd.DataFrame( parameters['A2'])
b2 = pd.DataFrame( parameters['b2'])


# Values are rounded, for easy readability
print("Weights Layer 1:")
display(np.round(W1,3))
print("Activations Layer 1:")
display(np.round(A1,3))
print("Bias Layer 1:")
display(np.round(b1,3))

print("Weights Layer 2:")
display(np.round(W2,3))
print("Activations Layer 2:")
display(np.round(A2,3))
print("Bias Layer 2:")
display(np.round(b2,3))


Cost after iteration 0: 5.552610
Cost after iteration 5000: 0.142670
Cost after iteration 10000: 0.075613
Cost after iteration 15000: 0.280704
Cost after iteration 20000: 0.035790
Cost after iteration 25000: 0.028745
Cost after iteration 30000: 0.031650
Cost after iteration 35000: 0.012350
Cost after iteration 40000: 0.013594
Cost after iteration 45000: 0.015203
Cost after iteration 50000: 0.016312
Cost after iteration 55000: 0.016864
Final cost after iteratiion 60000: 0.016892


Unnamed: 0,0,1,2,3,4,5,6,7
0,1,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0
3,0,0,0,1,0,0,0,0
4,0,0,0,0,1,0,0,0
5,0,0,0,0,0,1,0,0
6,0,0,0,0,0,0,1,0
7,0,0,0,0,0,0,0,1


Weights Layer 1:


Unnamed: 0,0,1,2,3,4,5,6,7
0,1091.841,-1414.086,-209.158,1091.839,1091.947,-3234.671,-601.468,1091.653
1,-2000.182,1192.926,1192.115,1192.674,1192.394,-3767.683,1190.747,-1385.775
2,1383.171,1381.246,1383.628,1383.295,-1185.74,1382.81,-4240.325,-2871.586


Activations Layer 1:


Unnamed: 0,0,1,2,3,4,5,6,7
0,0.44,0.0,0.0,0.439,0.466,0.0,0.0,0.394
1,0.0,0.537,0.34,0.474,0.405,0.0,0.116,0.0
2,0.418,0.095,0.532,0.449,0.0,0.334,0.0,0.0


Bias Layer 1:


Unnamed: 0,0
0,-1092.084
1,-1192.779
2,-1383.5


Weights Layer 2:


Unnamed: 0,0,1,2
0,21.86,-24.872,24.072
1,-18.852,29.302,-13.033
2,-22.668,12.602,31.251
3,21.236,21.644,21.388
4,23.949,21.203,-25.695
5,-27.739,-38.055,15.408
6,-29.963,-15.186,-37.975
7,17.46,-29.539,-30.326


Activations Layer 2:


Unnamed: 0,0,1,2,3,4,5,6,7
0,0.994,0.0,0.0,0.002,0.0,0.001,0.0,0.002
1,0.0,0.994,0.002,0.0,0.002,0.0,0.002,0.0
2,0.0,0.002,0.992,0.002,0.0,0.004,0.0,0.0
3,0.002,0.0,0.003,0.992,0.003,0.0,0.0,0.0
4,0.0,0.002,0.0,0.003,0.991,0.0,0.0,0.004
5,0.002,0.0,0.004,0.0,0.0,0.988,0.006,0.0
6,0.0,0.004,0.0,0.0,0.0,0.002,0.99,0.004
7,0.001,0.0,0.0,0.0,0.003,0.0,0.004,0.992


Bias Layer 2:


Unnamed: 0,0
0,-14.626
1,-9.393
2,-16.06
3,-24.415
4,-14.999
5,-0.707
6,6.329
7,-2.029
