Jordan Totten 

1-15-2019

# Building a Neural Network in Tensorflow
### Solving a simple XOR problem

In [0]:
import tensorflow as tf
import time

## Model 1

* This model is setup step-by-step and used as a baselin for models 2 - 4

### Setup placeholders

*   Tensorflow will automatically fill them with the data when we run the network
*   In this XOR problem, we have four different training examples and each example has two features
*   There are also four expected outputs, each with just one value (either 0 or 1)



In [0]:
x_ = tf.placeholder(tf.float32, shape=[4,2], name = 'x-input')
y_ = tf.placeholder(tf.float32, shape=[4,1], name = 'y-input')

### Setup parameters for the network

* These are called variables in Tensorflow
* Variables will be modified by Tensorflow during the training steps
* For our theta matrices, we want them initialized to random values between -1 and +1, so we use the built-in random_uniform function to do that

In [0]:
Theta1 = tf.Variable(tf.random_uniform([2,2], -1, 1), name = "Theta1")
Theta2 = tf.Variable(tf.random_uniform([2,1], -1, 1), name = "Theta2")

### Setup bias nodes

*  Bias nodes are setup separately, but still as Variables
*  this lets the algorithms modify the values of the bias node
* This is mathematically equivalent to having a signal value of 1 and initial weights of 0 on the links from the bias nodes

In [0]:
Bias1 = tf.Variable(tf.zeros([2]), name = "Bias1")
Bias2 = tf.Variable(tf.zeros([1]), name = "Bias2")

### Tensorflow Model

Tensorflow runs a model inside of a $session$, which it uses to maintain the state of the variables as they pass through the network

*  Matmul is Tensorflow's matrix multiplication function
*  Sigmoid is the sigmoid activation function
*  The cost function is the average over all the training examples
*  The training algorithm used is the gradient descent algorithm with a learning rate of 0.1
* the training algorithm objective is to minimize the cost function

In [0]:
with tf.name_scope("layer2") as scope:
	A2 = tf.sigmoid(tf.matmul(x_, Theta1) + Bias1)

with tf.name_scope("layer3") as scope:
	Hypothesis = tf.sigmoid(tf.matmul(A2, Theta2) + Bias2)

with tf.name_scope("cost") as scope:
	cost = tf.reduce_mean(( (y_ * tf.log(Hypothesis)) + 
		((1 - y_) * tf.log(1.0 - Hypothesis)) ) * -1)

with tf.name_scope("train") as scope:
	train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
  
XOR_X = [[0,0],[0,1],[1,0],[1,1]]
XOR_Y = [[0],[1],[1],[0]]

init = tf.global_variables_initializer()
sess = tf.Session()

#writer = tf.summary.FileWriter("./logs/xor_logs", sess.graph)

sess.run(init)

### Training steps

*   Each time the training step is executed, the values in the dictionary "feed_dict" are loaded into the placeholders 
*   As the XOR problem is simple, each epoch will contain the entire training set
*  To see what's going on inside the loop, just print the values of the Variables

In [19]:
display_step = 1000
t_start = time.clock()
for i in range(100000):
	sess.run(train_step, feed_dict={x_: XOR_X, y_: XOR_Y})
	if i % display_step == 0:
		print('Epoch ', i)
		print('Hypothesis ', sess.run(Hypothesis, feed_dict={x_: XOR_X, y_: XOR_Y}))
		print('Theta1 ', sess.run(Theta1))
		print('Bias1 ', sess.run(Bias1))
		print('Theta2 ', sess.run(Theta2))
		print('Bias2 ', sess.run(Bias2))
		print('cost ', sess.run(cost, feed_dict={x_: XOR_X, y_: XOR_Y}))
t_end = time.clock()
print('Elapsed time ', t_end - t_start)

sess.close()

Epoch  0
Hypothesis  [[0.47025615]
 [0.47802955]
 [0.47609538]
 [0.4835917 ]]
Theta1  [[ 0.00044801 -0.21601436]
 [-0.14339478 -0.35495144]]
Bias1  [ 1.1300673e-05 -2.9248464e-05]
Theta2  [[ 0.19633265]
 [-0.43503365]]
Bias2  [0.0002309]
cost  0.69410974
Epoch  1000
Hypothesis  [[0.4930162 ]
 [0.5001736 ]
 [0.49881768]
 [0.5056674 ]]
Theta1  [[ 0.00164458 -0.22675171]
 [-0.14274408 -0.36269248]]
Bias1  [ 0.00327497 -0.0111783 ]
Theta2  [[ 0.22686814]
 [-0.40976214]]
Bias2  [0.06217931]
cost  0.6930344
Epoch  2000
Hypothesis  [[0.4934718]
 [0.5007502]
 [0.4994833]
 [0.5064262]]
Theta1  [[ 0.00120533 -0.2347917 ]
 [-0.14371558 -0.3678093 ]]
Bias1  [ 0.00334089 -0.01626994]
Theta2  [[ 0.22807842]
 [-0.41080216]]
Bias2  [0.06338622]
cost  0.6930219
Epoch  3000
Hypothesis  [[0.49330065]
 [0.50073564]
 [0.49954444]
 [0.50661296]]
Theta1  [[ 0.00069582 -0.24318355]
 [-0.14475238 -0.37325302]]
Bias1  [ 0.00331546 -0.02143289]
Theta2  [[ 0.22850539]
 [-0.41277915]]
Bias2  [0.06293681]
cost  0.6

## Model 2

Substitute hyperbolic (tanh) activation function for the sigmoid activation in layer 2

In [20]:
x_ = tf.placeholder(tf.float32, shape=[4,2], name = 'x-input')
y_ = tf.placeholder(tf.float32, shape=[4,1], name = 'y-input')

Theta1 = tf.Variable(tf.random_uniform([2,2], -1, 1), name = "Theta1")
Theta2 = tf.Variable(tf.random_uniform([2,1], -1, 1), name = "Theta2")

Bias1 = tf.Variable(tf.zeros([2]), name = "Bias1")
Bias2 = tf.Variable(tf.zeros([1]), name = "Bias2")

with tf.name_scope("layer2") as scope:
	A2 = tf.tanh(tf.matmul(x_, Theta1) + Bias1)

with tf.name_scope("layer3") as scope:
	Hypothesis = tf.sigmoid(tf.matmul(A2, Theta2) + Bias2)

with tf.name_scope("cost") as scope:
	cost = tf.reduce_mean(( (y_ * tf.log(Hypothesis)) + 
		((1 - y_) * tf.log(1.0 - Hypothesis)) ) * -1)

with tf.name_scope("train") as scope:
	train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
  
XOR_X = [[0,0],[0,1],[1,0],[1,1]]
XOR_Y = [[0],[1],[1],[0]]

init = tf.global_variables_initializer()
sess = tf.Session()

#writer = tf.summary.FileWriter("./logs/xor_logs", sess.graph)

sess.run(init)
display_step = 1000

t_start = time.clock()
for i in range(100000):
	sess.run(train_step, feed_dict={x_: XOR_X, y_: XOR_Y})
	if i % display_step == 0:
		print('Epoch ', i)
		print('Hypothesis ', sess.run(Hypothesis, feed_dict={x_: XOR_X, y_: XOR_Y}))
		print('Theta1 ', sess.run(Theta1))
		print('Bias1 ', sess.run(Bias1))
		print('Theta2 ', sess.run(Theta2))
		print('Bias2 ', sess.run(Bias2))
		print('cost ', sess.run(cost, feed_dict={x_: XOR_X, y_: XOR_Y}))
t_end = time.clock()
print('Elapsed time ', t_end - t_start)

sess.close()

Epoch  0
Hypothesis  [[0.49980992]
 [0.41589496]
 [0.6382813 ]
 [0.58805686]]
Theta1  [[ 0.94302815  0.3937673 ]
 [-0.3073131   0.36079615]]
Bias1  [-4.5324466e-04  1.3191513e-05]
Theta2  [[ 0.8831058]
 [-0.2190313]]
Bias2  [-0.0003572]
cost  0.72648394
Epoch  1000
Hypothesis  [[0.47225457]
 [0.4371503 ]
 [0.5478319 ]
 [0.51673466]]
Theta1  [[ 0.66299474  0.38065934]
 [-0.15968814  0.26013136]]
Bias1  [-0.18491289 -0.04502728]
Theta2  [[ 0.6019857 ]
 [-0.20283751]]
Bias2  [-0.01015957]
cost  0.6988989
Epoch  2000
Hypothesis  [[0.48339903]
 [0.45600292]
 [0.5335261 ]
 [0.50557655]]
Theta1  [[ 0.5468912   0.36666772]
 [-0.16837347  0.18909024]]
Bias1  [-0.26688185 -0.08920848]
Theta2  [[ 0.502036  ]
 [-0.18682891]]
Bias2  [0.04784033]
cost  0.6945877
Epoch  3000
Hypothesis  [[0.49143887]
 [0.46089914]
 [0.53046244]
 [0.49418417]]
Theta1  [[ 0.4888986   0.35863787]
 [-0.26455644  0.13525271]]
Bias1  [-0.3563244  -0.12337602]
Theta2  [[ 0.46642473]
 [-0.18297783]]
Bias2  [0.10279547]
cost 

## Model 3

*   Substitute Adam Optimizer for the Gradient Optimizer training algorithm 
*   Keep the thanh activation for layer 2

In [22]:
x_ = tf.placeholder(tf.float32, shape=[4,2], name = 'x-input')
y_ = tf.placeholder(tf.float32, shape=[4,1], name = 'y-input')

Theta1 = tf.Variable(tf.random_uniform([2,2], -1, 1), name = "Theta1")
Theta2 = tf.Variable(tf.random_uniform([2,1], -1, 1), name = "Theta2")

Bias1 = tf.Variable(tf.zeros([2]), name = "Bias1")
Bias2 = tf.Variable(tf.zeros([1]), name = "Bias2")

with tf.name_scope("layer2") as scope:
	A2 = tf.tanh(tf.matmul(x_, Theta1) + Bias1)

with tf.name_scope("layer3") as scope:
	Hypothesis = tf.sigmoid(tf.matmul(A2, Theta2) + Bias2)

with tf.name_scope("cost") as scope:
	cost = tf.reduce_mean(( (y_ * tf.log(Hypothesis)) + 
		((1 - y_) * tf.log(1.0 - Hypothesis)) ) * -1)

with tf.name_scope("train") as scope:
	train_step = tf.train.AdamOptimizer(0.01).minimize(cost)
  
XOR_X = [[0,0],[0,1],[1,0],[1,1]]
XOR_Y = [[0],[1],[1],[0]]

init = tf.global_variables_initializer()
sess = tf.Session()

#writer = tf.summary.FileWriter("./logs/xor_logs", sess.graph)

sess.run(init)

display_step = 1000

t_start = time.clock()
for i in range(100000):
	sess.run(train_step, feed_dict={x_: XOR_X, y_: XOR_Y})
	if i % display_step == 0:
		print('Epoch ', i)
		print('Hypothesis ', sess.run(Hypothesis, feed_dict={x_: XOR_X, y_: XOR_Y}))
		print('Theta1 ', sess.run(Theta1))
		print('Bias1 ', sess.run(Bias1))
		print('Theta2 ', sess.run(Theta2))
		print('Bias2 ', sess.run(Bias2))
		print('cost ', sess.run(cost, feed_dict={x_: XOR_X, y_: XOR_Y}))
t_end = time.clock()
print('Elapsed time ', t_end - t_start)

sess.close()

Epoch  0
Hypothesis  [[0.50373226]
 [0.45887563]
 [0.48469004]
 [0.44167027]]
Theta1  [[ 0.18327537 -0.238877  ]
 [-0.8056455  -0.2988128 ]]
Bias1  [0.00998853 0.00999981]
Theta2  [[0.0953771]
 [0.3976919]]
Bias2  [0.00999991]
cost  0.6966668
Epoch  1000
Hypothesis  [[0.00682797]
 [0.99053276]
 [0.9888236 ]
 [0.00582612]]
Theta1  [[ 3.7825935 -3.6295285]
 [-4.3004665  3.6410449]]
Bias1  [-1.8229561 -1.8795941]
Theta2  [[5.077916 ]
 [5.2123914]]
Bias2  [4.814736]
cost  0.008361549
Epoch  2000
Hypothesis  [[0.00196473]
 [0.99723107]
 [0.9967134 ]
 [0.0017097 ]]
Theta1  [[ 4.1762304 -4.0484457]
 [-4.695686   4.0510354]]
Bias1  [-2.0155435 -2.0797272]
Theta2  [[6.262021 ]
 [6.3874755]]
Bias2  [6.0041466]
cost  0.0024356577
Epoch  3000
Hypothesis  [[8.6080591e-04]
 [9.9877709e-01]
 [9.9854529e-01]
 [7.5722224e-04]]
Theta1  [[ 4.4033737 -4.2869887]
 [-4.9247766  4.2841187]]
Bias1  [-2.1271012 -2.1938312]
Theta2  [[7.0545974]
 [7.176002 ]]
Bias2  [6.800011]
cost  0.0010745281
Epoch  4000
Hypo

## Model 4

* Return to using the sigmoid activation function for layer 2
*  Continue using the Adam Optimizer learning algorithm

In [23]:
x_ = tf.placeholder(tf.float32, shape=[4,2], name = 'x-input')
y_ = tf.placeholder(tf.float32, shape=[4,1], name = 'y-input')

Theta1 = tf.Variable(tf.random_uniform([2,2], -1, 1), name = "Theta1")
Theta2 = tf.Variable(tf.random_uniform([2,1], -1, 1), name = "Theta2")

Bias1 = tf.Variable(tf.zeros([2]), name = "Bias1")
Bias2 = tf.Variable(tf.zeros([1]), name = "Bias2")

with tf.name_scope("layer2") as scope:
	A2 = tf.sigmoid(tf.matmul(x_, Theta1) + Bias1)

with tf.name_scope("layer3") as scope:
	Hypothesis = tf.sigmoid(tf.matmul(A2, Theta2) + Bias2)

with tf.name_scope("cost") as scope:
	cost = tf.reduce_mean(( (y_ * tf.log(Hypothesis)) + 
		((1 - y_) * tf.log(1.0 - Hypothesis)) ) * -1)

with tf.name_scope("train") as scope:
	train_step = tf.train.AdamOptimizer(0.01).minimize(cost)
  
XOR_X = [[0,0],[0,1],[1,0],[1,1]]
XOR_Y = [[0],[1],[1],[0]]

init = tf.global_variables_initializer()
sess = tf.Session()

#writer = tf.summary.FileWriter("./logs/xor_logs", sess.graph)

sess.run(init)

display_step = 1000

t_start = time.clock()
for i in range(100000):
	sess.run(train_step, feed_dict={x_: XOR_X, y_: XOR_Y})
	if i % display_step == 0:
		print('Epoch ', i)
		print('Hypothesis ', sess.run(Hypothesis, feed_dict={x_: XOR_X, y_: XOR_Y}))
		print('Theta1 ', sess.run(Theta1))
		print('Bias1 ', sess.run(Bias1))
		print('Theta2 ', sess.run(Theta2))
		print('Bias2 ', sess.run(Bias2))
		print('cost ', sess.run(cost, feed_dict={x_: XOR_X, y_: XOR_Y}))
t_end = time.clock()
print('Elapsed time ', t_end - t_start)

sess.close()

Epoch  0
Hypothesis  [[0.69957685]
 [0.7607996 ]
 [0.67321956]
 [0.7357846 ]]
Theta1  [[-0.804798    0.25924182]
 [ 0.7657949   0.7502915 ]]
Bias1  [-0.00999994 -0.00999992]
Theta2  [[0.91473985]
 [0.8044231 ]]
Bias2  [-0.00999999]
cost  0.8006557
Epoch  1000
Hypothesis  [[0.05215491]
 [0.9653619 ]
 [0.95054054]
 [0.0505263 ]]
Theta1  [[-7.2156253  7.017381 ]
 [ 6.745837  -6.8389673]]
Bias1  [-3.7363427 -3.8511417]
Theta2  [[6.841599]
 [6.407635]]
Bias2  [-3.1926484]
cost  0.04784709
Epoch  2000
Hypothesis  [[0.01322027]
 [0.991104  ]
 [0.98755556]
 [0.01286287]]
Theta1  [[-8.401024   8.407022 ]
 [ 7.9277945 -8.218552 ]]
Bias1  [-4.3046594 -4.547114 ]
Theta2  [[9.494708]
 [9.096239]]
Bias2  [-4.534606]
cost  0.011928259
Epoch  3000
Hypothesis  [[0.00566371]
 [0.99618644]
 [0.9946631 ]
 [0.00552484]]
Theta1  [[-9.006233  9.077481]
 [ 8.530447 -8.888406]]
Bias1  [-4.601094  -4.8805394]
Theta2  [[11.139842]
 [10.746735]]
Bias2  [-5.359721]
cost  0.005097988
Epoch  4000
Hypothesis  [[0.002

## Summary
*  Model 2 proves to be the most accurate network with a cost  of ~0.003 
*  Using tanh for the layer 2 activation function creates an advantage because the layer 2 nodes are not restricted to an output range between 0 and 1 
*  With tanh, the layer can better handle weight and bias inputs with negative values 
*  The ouput from a layer using tanh will be between -1 and 1. The sigmoid function left in layer 3 can work well with values in this range when trying to produce a binary output. 
*  Tanh is a superio activation in most hiddne layers of neural networks. 