# XOR Analysis

### XOR analysis using dense, filly connected neural netowrks

ENGG 192 - Dartmouth COllege <br>
Winter 2019 <br>
Spencer Bertsch

Many of the ideas for this code were lifted directly from the sources including "Deep Learning" by by Ian Goodfellow and Yoshua Bengio and Aaron Courville. The point here was not to solve a difficult, existing problem, but rather to develop an intuation for deep, feedforward neural networks by solving an intentionally trivial problem which requires a small amount of training data. In "Deep Learning", they discuss the importance of deploying a very simple feed forward network beginning with one layer and eventually groeing to two total layers (one hidden layer). By performing this exercise, engineers can understand how a single function such as logistic regression - which can be thought of as a single neuron - can be built into a system of neurons to create a learning tool which can more effectively learn representations of complex functions.  

**Sources** <br> 
The main source for this notebook is the textbook "Deep Learning" by Ian Goodfellow and Yoshua Bengio and Aaron Courville. 

@book{Goodfellow-et-al-2016, <br>
    title={Deep Learning}, <br>
    author={Ian Goodfellow and Yoshua Bengio and Aaron Courville}, <br>
    publisher={MIT Press}, <br>
    note={\url{http://www.deeplearningbook.org}}, <br>
    year={2016} <br>
}

Other sources include: <br> 
https://gist.github.com/RichardKelley/17ef5f2291c273de11540c33dc1bfbf2 <br> 
https://github.com/hunkim/PyTorchZeroToAll <br> 
https://github.com/hunkim/PyTorchZeroToAll/blob/master/06_logistic_regression.py <br> 

The above three sources have been extremely helpful when familiarizing myself with pytorch and the XOR problem. 

https://pytorch.org/ <br> 
https://medium.com/dair-ai/a-simple-neural-network-from-scratch-with-pytorch-and-google-colab-c7f3830618e0 <br>
https://www.youtube.com/watch?v=GAKTBQn7yKo&t=516s <br>
https://www.youtube.com/watch?v=113b7O3mabY <br> 


## Imports

In [25]:
import numpy as np
import torch
import torch.nn as nn
from torch.autograd import Variable #<-- Variable can be thought of as a Torch matrix
import torch.nn.functional as F
import torch.optim as optim

# XOR Function

Source - "Deep Learning", Goodfellow, Bengio and Courville - Chapter 6.1

The XOR (exlusive or) function takes two values (x1 and x2) as its inputs and provides binary output. The function itself is very simple; if one of the inputs (x1 OR x2) are equal to 1, then the XOR function returns a one, otherwise the function returns a zero. 

Our goal here will be to develop a feedforward neural network with the function y = f(x;θ) and train the network on examples so that it can learn the parameters in θ. The end goal will be to find the correct values in θ so that our function y = f(x;θ) can closely approximate y = f*(x), the true XOR function. 

### Define the neural network dedicated to learning the XOR function f*(x)


In [26]:
class XorNet(nn.Module):
    def __init__(self):
        """
        super().__init__()
        "super can be used to refer to parent classes without naming them explicitly"
        docs.python.org/2/library/functions.html#super
        
        Here we construct a neural network model
        """
        super().__init__()
        
        #add one hidden layer with three neurons
        self.fc1 = nn.Linear(2,3)
        #add one hidden layer with three neurons
        self.fc2 = nn.Linear(3,1) 
        
    def forward(self, x):
        """
        Our forward function takes in the input data and uses the relu activation function to generate the output
        """
        x = F.relu(self.fc1(x)) #<-- from Torch.nn we get 'functional' and we use the relu activation
        y_pred = self.fc2(x)
        return y_pred

Now we can define our model and several hyperparameters including the loss function, optomizer, epochs, and minibatch size. Next week we will go over regularization techniques to optimize these parameters to create the model which best approximates the function f*(x). 

In [27]:
model = XorNet() #<-- Use the model created above 
#We use mean squared error as our loss function becasue the problem at hand is very simple 
loss_fn = nn.MSELoss()
#Adam is a very strong genric optimizer, so we can use it here
optimizer = optim.Adam(model.parameters(), lr=1e-3)

training_epochs = 3000
minibatch_size = 32

We can now generate our training data consisting of the set {(0,0), (1,0), (0,1), (1,1)}, and concatenate them to create our X train and y train datasets. 

In [28]:
# input-output pairs
pairs = [(np.asarray([0.0,0.0]), [0.0]),
         (np.asarray([0.0,1.0]), [1.0]),
         (np.asarray([1.0,0.0]), [1.0]),
         (np.asarray([1.0,1.0]), [0.0])]

#np.vstack stacks vectors row-wise, similar to vertcat in matlab 
X_train = np.vstack([x[0] for x in pairs]) 
y_train = np.vstack([x[1] for x in pairs])

And now we can finally train our model. We first predict our data using (model) to generate y_pred, then we calculate the loss (or error) for that neuron. Becasue this is a simple problem, we can just use the mean squared error function for loss. 

Remember that loss functions are simply the way that the network calculates the error for each neuron and then the optimizer can update the weights (W,b) which define θ and are the key learned parameters in the network. 


$J(θ) = (1/4)SUM(f*(x) - f(x;θ))^2$

Where $J(θ)$ is simply the error calculated at that point. Another popular loss function used for calculating the error between the predicted and the true value inside the netowrk is the mean absolute error function. 

$J(θ) = (1/4)SUM|f*(x) - f(x;θ)|$

This function would also work well for this application; the only difference being that the difference between the true function $f(x)$ and the approximated function $f*(x;θ)$ is taken as a positive real instead of being squared. 


There are many other loss functions which can be used to achieve specific goals. Binary crossentropy, for example, is often used for binary classification problems. 

In [29]:
for i in range(training_epochs):
        
    for batch_ind in range(4):
        # wrap the data in variables
        minibatch_state_var = Variable(torch.Tensor(X_train))
        minibatch_label_var = Variable(torch.Tensor(y_train))
                
        # forward pass
        y_pred = model(minibatch_state_var)
        
        # loss is MSE - defined above 
        # compute loss as the difference between the prediction and the true label 
        loss = loss_fn(y_pred, minibatch_label_var)

        # now that the forward pass is done, we need to reset all gradients
        optimizer.zero_grad()
        
        # backwards pass
        loss.backward()
        
        # step the optimizer - update the weights
        optimizer.step()

We can now test the model by providing each of the four possible function inputs and observing the model predictions. 

In [30]:
#Test the model with each of our four inputs: 
input_1 = Variable(torch.Tensor([0,0])) # "variable" simply creates a new tenosr
input_2 = Variable(torch.Tensor([1,0])) # "variable" simply creates a new tenosr
input_3 = Variable(torch.Tensor([0,1])) # "variable" simply creates a new tenosr
input_4 = Variable(torch.Tensor([1,1])) # "variable" simply creates a new tenosr
        
#Make predictions on the inputs 
pred_1 = model(input_1)
pred_2 = model(input_2)
pred_3 = model(input_3)
pred_4 = model(input_4)

#And finally print the results of prediction
print("Prediction for the 1st inputs f(x1=0, x2=0)", pred_1)
print("Prediction for the 2nd inputs f(x1=1, x2=0)", pred_2)
print("Prediction for the 3rd inputs f(x1=0, x2=1)", pred_3)
print("Prediction for the 4th inputs f(x1=1, x2=1)", pred_4)

Prediction for the 1st inputs f(x1=0, x2=0) tensor([2.9917e-30], grad_fn=<AddBackward0>)
Prediction for the 2nd inputs f(x1=1, x2=0) tensor([1.], grad_fn=<AddBackward0>)
Prediction for the 3rd inputs f(x1=0, x2=1) tensor([1.], grad_fn=<AddBackward0>)
Prediction for the 4th inputs f(x1=1, x2=1) tensor([2.9917e-30], grad_fn=<AddBackward0>)


As expected, the model performed very well. Although our network only has one hidden layer with three neurons, it was able to learn the correct values in θ which allowed it to approximate f(x,θ) to f*(x). Note that the model predicted ~2.99(10^-30) which is very close to zero for f(0,0) and f(1,1). 