# XOR as a Neural network (from Ph. Koehn, Neural MT, part 5.7)

This is a notebook containing the first Hands On in part 5.7 of Ph. Koehn's Neural MT book. It illustrate the non linearity of NN by simulating XOR operator taht can not be simulated with a linear system.

We first import elements and define our small Neural Network : 2 inputs, 1 output + 1 layer of 2 elements, + 1 bias on the input layer, and 1 bias on the output layer :

In [2]:
import math
import numpy as np

# Input -> First layer weights and biases 
W = np.array([[3,4],[2,3]])
b = np.array([-2, -4])
# First layer -> output layer weights and biases
W2 = np.array([5, -5])
b2 = np.array([-2])


Then we define the sigmoid activation function and its derivative. As the function operate element vize, we use numpy to make it a vector defined operation.

In [3]:
@np.vectorize
def sigmoid(x):
    return 1 / (1 + math.exp(-x))

@np.vectorize
def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

Then we define the input and output layers (for input 1/0, output of XOR is 1).

In [4]:
x = np.array([1,0])
t = np.array([1])

## Forward computation

Forward computation is pretty straight...forward...

In [7]:
# Computing the hidden layer
s = W.dot(x) + b 
h = sigmoid(s) 

# Computing the output layer
z = W2.dot(h) + b2
y = sigmoid(z)

Let's check the output layer value, not so bad (we expect 1)

In [9]:
y

array([0.7425526])

## Backward computation

Backward computation is used to calculate the gradient by which we will influence the weights for them to give better results. Computations is more tedious. We first compute the error, and set a learning rate (µ).

In [10]:
error = 1/2 * (t-y) **2
mu = 1 # usually mu has a smaller value (like .001)

We then propagate the error as a gradient to the hidden layer.

In [18]:
delta_2 = (t - y) * sigmoid_derivative(z)
delta_W2 = mu * delta_2 * h
delta_b2 = mu * delta_2

Second, we compute the gradient from hidden layer to input layer.

In [13]:
delta_1 = delta_2 * W2 * sigmoid_derivative(s)
delta_W = mu * np.array([delta_1]).T * x
delta_b = mu * delta_1

Now is a good time to observe the gradients we computed.

In [20]:
delta_2

array([0.04921577])

In [22]:
delta_W2

array([0.03597961, 0.00586666])

In [19]:
delta_b2

array([0.04921577])

In [24]:
display(delta_1)
display(delta_W)
delta_b

array([ 0.04838203, -0.0258367 ])

array([[ 0.04838203,  0.        ],
       [-0.0258367 , -0.        ]])

array([ 0.04838203, -0.0258367 ])