# How to build a neural networks

Many of the NLP tasks rely on sequence prediction from trained neural networks.  Understanding the fundamentals of their operation and implementation can help navigate higher-level frameworks like Keras and guide selection of meta-parameters used to build the network.

This example works with a hand-coded, bare neural network implementation in Python to expose how the operations in the network are implemented.

Like all learning activities, we start with a search of available resources [build a neural network from scratch](https://www.google.com/search?client=ubuntu&channel=fs&q=build+a+neural+network+from+scratch&ie=utf-8&oe=utf-8).

Since we are interested in applying our networks to sequence prediction, let's see if we can train a network with one of the most basic sequences: counting.  That is, can we train a network to count.  Given input 1 our network should return 2.  Given 2, return 3. Given 3, return 4 and so on. We want our network to learning to add 1 to the input number.

## NN First Pass

Following the basic example of building a neural network from scratch.

https://towardsdatascience.com/how-to-build-your-own-neural-network-from-scratch-in-python-68998a08e4f6

In [None]:
import numpy as np

In [None]:
class NeuralNetwork:
    
    def sigmoid(x):
        return 1/(1+np.exp(-x))
    
    def __init__(self, x, y):
        self.input      = x
        self.weights1   = np.random.rand(self.input.shape[1],4) 
        self.weights2   = np.random.rand(4,1)                 
        self.y          = y
        self.output = np.zeros(y.shape)
        
    def feedforward(self):
        self.layer1 = self.sigmoid(np.dot(self.input, self.weights1))
        self.output = self.sigmoid(np.dot(self.layer1, self.weights2))
        
    def backprop(self):
        # application of the chain rule to find derivative of the loss function with respect to weights2 and weights1
        d_weights2 = np.dot(self.layer1.T, (2*(self.y - self.output) * sigmoid_derivative(self.output)))
        d_weights1 = np.dot(self.input.T,  (np.dot(2*(self.y - self.output) * sigmoid_derivative(self.output), self.weights2.T) * sigmoid_derivative(self.layer1)))

        # update the weights with the derivative (slope) of the loss function
        self.weights1 += d_weights1
        self.weights2 += d_weights2

The article does a nice job of laying out the motivations and logic behind the above implementation but then doesn't provide any code for testing and training

it just shows an input data set with the desired output, which looks like an xor function but doesn't give any insight on how to train for 1500 iterrations.

In [None]:
nn = NeuralNetwork(np.array([[0, 1, 2]]), np.array([[1, 2, 3]]))

It doesn't even provide an implementation for the referenced `sigmoid()` function.  This leaves the [sigmoid()](https://gist.github.com/jovianlin/805189d4b19332f8b8a79bbd07e3f598) from the web erroring out on the format of the input parameters.

In [None]:
nn.feedforward()

## Begin understanding numpy arrays

It's not entirely surprising that our random grab of `sigmoid()` from the web should fail with a different input in a new context.  Some portability is expected, however, given we are using numpy.  We kinda expect it to magically take care of applying the correct mathematical operations for a given input.  When it doesn't, like above, it's a good sign we aren't fully understanding our data struture.

Numpy let's us create vectors (numpy arrays) with some pretty simple syntax.

In [None]:
y = np.array([1,2,3])

We can inspect their sturcture wiht the shape method, in this case y has three elements.

In [None]:
y.shape

Here's a 3x4 matrix of random values.

In [None]:
np.random.rand(3, 4)

Some contrived vectors and their shape attributes.

In [None]:
x=np.array([[1, 2]])
y=np.array([[2, 3]])

In [None]:
x.shape[1]

In [None]:
y.shape

In [None]:
y

And another random matrix that relies on one dimension being defined by the shape of another variable.

In [None]:
np.random.rand(x.shape[1],4)

To multiply to 1x2 vectors (matricies) we need our inner dimensions to line up, a 1x2 and 2x1. Simply take the transpose() of the object (also the ".T" method). 

In [None]:
np.dot(x,y.transpose())

A stright forward attempt of a 1x2 dot 1x2 won't work.

In [None]:
np.dot(x,y)

In [None]:
x.transpose()

## NN Second  Pass

The http://medium.com post didn't provide a complete implementation, leaving off details on the training step an use of the created NN object.

Revisiting the search [how to build neuralnetwork in python](https://www.google.com/search?client=ubuntu&channel=fs&q=how+to+build+neuralnetwork+in+python&ie=utf-8&oe=utf-8) and looking past the first recommendation above leads to a more comprehensive example [Build a Neural Network](https://enlight.nyc/projects/neural-network/) on the http://enlight.nyc site. 

The example trains a network on a classic learning example, the expected test score given number of hours studied and slept as input.  That is a 2d input (hours studied and slept) leads to a 1d output (test score).

Our initial pass at the implementation is slightly simplified over the given example to accomidate our expected use case, a 1d input (starting whole number) leads to a 1d output (next whole number).

In [None]:


class Neural_Network(object):
    def __init__(self):
        #parameters
        self.inputSize = 1
        self.outputSize = 1
        self.hiddenSize = 3



        #weights
        self.W1 = np.random.randn(self.inputSize, self.hiddenSize) # (3x2) weight matrix from input to hidden layer
        self.W2 = np.random.randn(self.hiddenSize, self.outputSize) # (3x1) weight matrix from hidden to output layer
        return
        
    def forward(self, X):
    #forward propagation through our network
      self.z = np.dot(X, self.W1) # dot product of X (input) and first set of 3x2 weights
      self.z2 = self.sigmoid(self.z) # activation function
      self.z3 = np.dot(self.z2, self.W2) # dot product of hidden layer (z2) and second set of 3x1 weights
      o = self.sigmoid(self.z3) # final activation function
      return o
    
    
    

    def sigmoid(self, s):
      # activation function
      return 1/(1+np.exp(-s))



    def sigmoidPrime(self, s):
      #derivative of sigmoid
      return s * (1 - s)



    def backward(self, X, y, o):
        # print statements introduced during debug of np.array() shape errors during weight calculation
        #print("X: {0}".format(X.shape))
        #print("y: {0}".format(y.shape))
        #print("o: {0}".format(o.shape))
        
        # backward propagate through the network
        self.o_error = y - o # error in output
        #print("o_error: {0}".format(self.o_error.shape))
        self.o_delta = self.o_error*self.sigmoidPrime(o) # applying derivative of sigmoid to error
        #print("o_delta: {0}".format(self.o_delta.shape))

        self.z2_error = self.o_delta.dot(self.W2.T) # z2 error: how much our hidden layer weights contributed to output error
        #print("z2_error: {0}".format(self.z2_error.shape))
        self.z2_delta = self.z2_error*self.sigmoidPrime(self.z2) # applying derivative of sigmoid to z2 error
        #print("z2_delta: {0}, {1}".format(self.z2_delta.shape, self.z2_delta))
        
        self.W1 += X.T.dot(self.z2_delta) # adjusting first set (input --> hidden) weights
        #tmp = X.T.dot(self.z2_delta) # adjusting first set (input --> hidden) weights
        #print("tmp: {0}".format(tmp.shape))
        #print("W1: {0}".format(self.W1.shape))
        self.W2 += self.z2.T.dot(self.o_delta) # adjusting second set (hidden --> output) weights
        #print("W2: {0}".format(self.W2.shape))
   

    def train (self, X, y):
      o = self.forward(X)
      self.backward(X, y, o)

    def saveWeights(self):
        np.savetxt("w1.txt", self.W1, fmt="%s")
        np.savetxt("w2.txt", self.W2, fmt="%s")

    def predict(self, x):
        print("Predicted data based on trained weights: ")
        print("Input: \n" + str(x))
        print("Output: \n" + str(self.forward(x)))


Let's run our network on a first set of parameters.  Given the number 1, predict the number 2.

We'll create our input and output values as vectors.

In [None]:
X= np.array([1])

y= np.array([2])

In [None]:
NN = Neural_Network()

#defining our output
o = NN.forward(X)

print("Predicted Output: \n" + str(o))
print("Actual Output: \n" + str(y))

In [None]:

for i in range(1000): # trains the NN 1,000 times
  print ("Input: \n" + str(X))
  print ("Actual Output: \n" + str(y))
  print ("Predicted Output: \n" + str(NN.forward(X)))
  print ("Loss: \n" + str(np.mean(np.square(y - NN.forward(X))))) # mean sum squared loss
  print ("\n")
  NN.train(X, y)

Ugh and another syntax error on the nature of our variables.

Hand working this code a simple picture of NN with one input neuron, 3 hidden neurons, and 1 output neuron suggests the code should work.

On the forward pass we multipy our 1x1 input X by W1, a 1x3 weight matrix (the inner dimensions match).  Then we multiply our 1x3 hidden layer matrix by a 3x1 weight matrix W2 to get a 1x1 output, as expected.  

So, why are we seeing a dimension miss-match error on our backpropagation?

### Really understand numpy arrays

Clearly we can create a numpy array and print it's value and compute it's own dot product using the .T transpose.

In [None]:
a=np.array([[1, 2, 3]])
print("{0}".format(a))

In [None]:
np.dot(a, a.T)

We can create random initialized weight matricies like in the init() method that match our expected dimensions for the matrix multiplies.

In [None]:
W1 = np.random.randn(1, 3)
W1

In [None]:
W2 = np.random.randn(3,1)
W2

In [None]:
np.random.randn(1, 3)

All of these seem like legitimate variables that will match the expected matrix operations during foward and back propagation. 

What's going on with our input?

(Lots of time spent debugging the code with print statements to observe the shapes of the variables used -- the shapes appear to match.)

Let's look at how to create vectors in numpy.

The [scipy docs provide a consicise overview of numpy array creation](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.creation.html) but don't really speak in terms of vectors and matrix operations.

A little more searching on [numpy vector syntax](https://www.google.com/search?q=numpy+vector+syntax) leads to decent tutorial from IBM that compares numpy arrays and their use in matrix operations from Octave and Matlab.  Both of those environments provide very intuitive and expressive formats for specifying and manipulating vectors and matricies.  Because they provide an even higher level abstraction, they gloss over something that remains explicit in numpy: the difference between a vector and a matrix.

In numpy a vector is created with a numpy array 

In [None]:
X= np.array([1, 2, 3])
X

Vectors are 1-dimensional object in Python. The shape hints at that.  Here we a have a 3d vector, which has 3 entries as shown by the .shape method.

In [None]:
X.shape

We can create a two dimensional matrix by explicitly defining the second dimension

In [None]:
X = X[None, :]
X

Now we have a 1x3 "matrix".  And, as you can see comparing the last result with our initial definition of X, double nested brackets distinguish a matrix versus defining the original vector. A 1x3 matrix contains one 3d vector.

The shape now also returns values for both dimensions.

In [None]:
X.shape

The problem with our initial run using the Neural_Network object is that we passed it numpy vectors as input and not the expected numpy matrix format, which is needed to properly compute the back propagation operations.

A [summary of differences between Matlab/Octave matrix implementations](https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html) and syntax helps round out our understanding and further highlights the subtleties of working with the lower-level representation of structure in numpy arrays.

### Returning to our neural network training

Now that we know our code wants to work with matrix input and outputs, let's create our data with using the double-bracket notation to explicitly create the 1x1 input and output matrix.

In [None]:
X= np.array([[1]])

In [None]:
y= np.array([[2]])

In [None]:
np.dot(X, W1)

In [None]:
np.dot(X, a)

In [None]:
NN = Neural_Network()

#defining our output
o = NN.forward(X)

print("Predicted Output: \n" + str(o))
print("Actual Output: \n" + str(y))

In [None]:

for i in range(1000): # trains the NN 1,000 times
  print ("Input: \n" + str(X))
  print ("Actual Output: \n" + str(y))
  print ("Predicted Output: \n" + str(NN.forward(X)))
  print ("Loss: \n" + str(np.mean(np.square(y - NN.forward(X))))) # mean sum squared loss
  print ("\n")
  NN.train(X, y)

Now we have a neural network that takes a number as it's input and learns the weights to predict an output.

Notice that our network loss approaches 1 and doesn't go beyond that value. This actually indicates a flaw with our network.

Let's see what happens if we try to predict the next value.

In [None]:
NN.predict(X)

In [None]:
NN.predict([[2]])

In [None]:
NN.predict([[3]])

It looks like our trained network always predicts the value 1.  Even though we told it a thousand times over that the next value after 1 is 2 it couldn't learn that, much less how to add one to a number to get to the next value.

Clearly, our magical neural network is not very smart.

Looks like it's time to explore the construction of the network to see if there's anything we could change to make it smarter: training data, dimension of input or output, or size of the hidden layer.

## Refine Neural_Network for flexibility

In order to see if we can train a network to predict the next number in the sequence, we need more flexibility in our network.  Let's subclass the original class for more flexibility in the network structure.  

The main focus of this subclass will be to accept input parameters that allow us to change the size of the input, hidden and output layer.


In [None]:
class NN2(Neural_Network):
    
    # redefine init and don't call the subclass because we really want 
    # the flexibility to 
    def __init__(self, insize = 1, outsize = 1, hiddensize = 3):
        
        #parameters
        self.inputSize = insize
        self.outputSize = outsize
        self.hiddenSize = hiddensize



        #weights
        self.W1 = np.random.randn(self.inputSize, self.hiddenSize) # (3x2) weight matrix from input to hidden layer
        self.W2 = np.random.randn(self.hiddenSize, self.outputSize) # (3x1) weight matrix from hidden to output layer
        return

Let's try the network with an addititional input value and see if that allows us to learn to count.  We'll treat all our inputs as having a second value of 1.  Input for "1" will be [1, 1] or for 2 would be [2, 1].  This has the effect of introducing a bias parameter to the first layer computations.  The network will learn a bias weight on the first layer that allows the network to adjust the computations based on the inputs.  

In [None]:
nn2 = NN2(insize=2)

X = np.array([[1, 1]])

#defining our output
o = nn2.forward(X)

print("Predicted Output: \n" + str(o))
print("Actual Output: \n" + str(y))

In [None]:

for i in range(1000): # trains the NN 1,000 times
  print ("Input: \n" + str(X))
  print ("Actual Output: \n" + str(y))
  print ("Predicted Output: \n" + str(nn2.forward(X)))
  print ("Loss: \n" + str(np.mean(np.square(y - nn2.forward(X))))) # mean sum squared loss
  print ("\n")
  nn2.train(X, y)

Again the best our network learns is a loss of 1.

In [None]:
nn2.predict([[2, 1]])

Our network doesn't appear able to learn how to predict 2 for an input of 1, even with a bias introduced.  It always wants to predict 1 for an input of 1.

What if we increase the hidden size? Does that give our network more discriminitive power?

In [None]:
nn2 = NN2(insize=2, hiddensize=5)

X = np.array([[1, 1]])

#defining our output
o = nn2.forward(X)

print("Predicted Output: \n" + str(o))
print("Actual Output: \n" + str(y))

In [None]:

for i in range(1000): # trains the NN 1,000 times
  print ("Input: \n" + str(X))
  print ("Actual Output: \n" + str(y))
  print ("Predicted Output: \n" + str(nn2.forward(X)))
  print ("Loss: \n" + str(np.mean(np.square(y - nn2.forward(X))))) # mean sum squared loss
  print ("\n")
  nn2.train(X, y)

In [None]:
nn2.predict([[2, 1]])

Even with an arguably more powerful network, a bias on the input layer and 10 neurons on the hidden layer, we still can't learn to simply add one to the input value.

Maybe we could run better test iterations with more varied input. After all, we are just trying to hammer home that the next number after 1 is 2.  That may be a bit limited.

After considering these test iterations and noticing we always predict a number near 1, there is something to realize about our network. 

The reason the predicted result is always near 1 and the loss is always going to be "input - (input - 1)" is because we have built our network using sigmoid activation functions. This activation function is limited to maximim value approaching 1.  It is completely unable to go above 1.

In order to predict a literal value greater than 1, we will need a different activation function to get a different output value.  What if we do a ReLU() function?  This could generate all positive values on the output.   It would also require changing the back propagation to compute the derivatives of this linear function. That should be easy since the slope of the activation function should be 1 for all positive values.

Let's save this refactor for later and instead explore what we can do with a sigmoid. Since we are limited to output values between 0 and 1, we can re-define our "teach it to add" problem to a "what comes next from these N values".  That is we can classify what the next number should be.  We can describe that problem as "can we teach our network to count to 10".  If we give it an input number we should be able to teach it to return the correct next number class by "activating" the correct class designator.

This approach requires that we have 10 ouput values and the value that is "high" or closest to one is the value of our output.  Say we give the number 1 as input and the number 2 as out put we, would expect the output value to be something like [0, 0, 1, 0, 0, 0, 0, 0, 0, 0].  In other word, one-hot encode our numbers as classes.

To simplify training this network we should encode our training data as 10 element vectors. A toolkit with convenient vector encoding functions would be desirable here.

We could create a vector of numbers between 0-9 then create a +1 result vector and encode both.  This would give us a training set and allow us to predict output.

## Build "next number" or "add one" classifier

In order to train our classifier we need our training data.  This is a simple list of numbers between 0 and 1 as the input and their "plus 1" or next in sequence output.

### Create training data

Numpy has a good random number generator.

In [None]:
np.random.seed(42)
np.random.rand()

We could "manually" generate our random numbers and add them to a list.

In [None]:
for i in range(10):
    print(int(np.random.rand() * 10))

Numpy has [randint() convenience function to return an array of random integers](https://machinelearningmastery.com/how-to-generate-random-numbers-in-python) which is just what we need.

Let's build a 10 element example to see what the input and output training vectors will look like.

In [None]:
givenints = np.random.randint(0, 10, 20)
givenints

Numpy makes it easy to add one to each element to get our next number in sequence

In [None]:
nextints = givenints+1
nextints

Since we are dealing with sequence prediction, which is really one of a ten possible classes, we need to wrap our counter at 9 back to zero.  We simple select all the elements that reached "10" and wrap them around.  Again, numpy's array syntax makes this easy.

In [None]:
nextints[nextints > 9] = 0
nextints

We can throw this into a function to generate as much training data as desired.

In [None]:
def trainints(count):
    x = np.random.randint(0, 10, count)
    y = x + 1
    y[y > 9] = 0
    return x, y

In [None]:
given, result = trainints(10)
print(given)
print(result)

### Convert training to one-hot encoding

We are still dealing with an activation function that is constrained to values between 0-1, so we need to convert our input and output to a one-of-ten classes selector.  We'll train our network to return the next number by indicating what class it's in, ie. the "1" class, "2" class, "3" class, etc.

Rather than manually encode our data we'll use the [one-hot utilities from Keras](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/), since that's our target machine learning toolkit.

In [None]:
import keras as k

In [None]:
x = k.utils.to_categorical(given)
x

In [None]:
for num in x:
    print(np.argmax(num))

### Build and train classifier

Our classifier will take a one-hot encoded integer, identifying the input class, and predict the output class.   So the neural network will have an input and output size of 10 units (classes) and we can play around with the hidden layer to see if there is any need for greater discriminative power.  We'll start with 4 units as the hidden layer, assuming this mimics a binary encoding and 4 bits can represent 10 numbers.

In [None]:
nextnum = NN2(10, 10, 4)

In [None]:
x, y = trainints(100)
x = k.utils.to_categorical(x)
y = k.utils.to_categorical(y)

In [None]:
for i in range(100): # trains the NN 1,000 times
    print ("Input: {}\n".format(np.argmax(x[i])))
    print ("Actual Output: {}\n".format(np.argmax(y[i])))
    print ("Predicted Output: {}\n".format(np.argmax(nextnum.forward(x[i]))))
    print ("Loss: {}\n".format(np.mean(np.square(np.argmax(y[i]) - np.argmax(nextnum.forward(x[i])))))) # mean sum squared loss
    print ("\n")
    nextnum.train(x, y)

In [None]:
for i in range(10):
    print("Test: {}".format(np.argmax(x[i])))
    print("Truth: {}".format(np.argmax(y[i])))
    answer = nextnum.forward(x[i])
    #print(answer)
    print("Next Num: {}".format(np.argmax(answer)))
    print("\n")

Our first classifier predicts correctly sometimes, but doesn't seem very accurate. This is either due to limited training data or poor assumptions about what our hidden layer can discriminate.  We can see that during training our loss never settles near zero.

Let's try another network with 10x more training data.

In [None]:
nextnum2 = NN2(10, 10, 4)

In [None]:
x2, y2 = trainints(1000)
x2 = k.utils.to_categorical(x2)
y2 = k.utils.to_categorical(y2)

In [None]:
for i in range(1000): # trains the NN 1,000 times
    print ("Input: {}\n".format(np.argmax(x2[i])))
    print ("Actual Output: {}\n".format(np.argmax(y2[i])))
    print ("Predicted Output: {}\n".format(np.argmax(nextnum.forward(x2[i]))))
    print ("Loss: {}\n".format(np.mean(np.square(np.argmax(y2[i]) - np.argmax(nextnum.forward(x2[i])))))) # mean sum squared loss
    print ("\n")
    nextnum.train(x2, y2)

In [None]:
for i in range(10):
    print("Test: {}".format(np.argmax(x2[i])))
    print("Truth: {}".format(np.argmax(y2[i])))
    answer = nextnum.forward(x2[i])
    #print(answer)
    print("Next Num: {}".format(np.argmax(answer)))
    print("\n")

Hmm, ok.  Looks like more training data isn't the problem.  We just tought our network to predict 5 for all input.  Reviewing the training run, see that it indeed settled on 5 as the answer for all problems.  Seems it's just guessing the average number between 0-9.  Not an unjustified decision but clearly not very smart.

We obviously need a differnet approach.  It's a good guess that we should give our networks more power in the hidden layers, that and the training data is all that we can control for this problem after all.


In [None]:
nextnum3 = NN2(10, 10, 8)

Reuse our existing 1000 training data.

In [None]:
for i in range(1000): # trains the NN 1,000 times
    print ("Input: {}\n".format(np.argmax(x2[i])))
    print ("Actual Output: {}\n".format(np.argmax(y2[i])))
    print ("Predicted Output: {}\n".format(np.argmax(nextnum3.forward(x2[i]))))
    print ("Loss: {}\n".format(np.mean(np.square(np.argmax(y2[i]) - np.argmax(nextnum3.forward(x2[i])))))) # mean sum squared loss
    print ("\n")
    nextnum3.train(x2, y2)

In [None]:
for i in range(10):
    print("Test: {}".format(np.argmax(x2[i])))
    print("Truth: {}".format(np.argmax(y2[i])))
    answer = nextnum3.forward(x2[i])
    #print(answer)
    print("Next Num: {}".format(np.argmax(answer)))
    print("\n")

Well, that's disappointing.  Our network is still not predicting the next number well.  All we gained by doubling our hidden layer is another midpoint prediction "3".  

Two questions come to mind.  Are we training our network correctly?  This is an example implementation from the Internet. It could be there's an error in the backpropagation. 

The second question is, do we need to increase our hidden layer significantly to get additional "average" points.  Our two tests show we got one average for every 4 neurons in the hidden layer.  Would 4*10 = 40 units in the hidden layer lead to successfull predictions?

In [None]:
nextnum4 = NN2(10, 10, 40)

Reuse our existing 1000 training data.

In [None]:
for i in range(1000): # trains the NN 1,000 times
    print ("Input: {}\n".format(np.argmax(x2[i])))
    print ("Actual Output: {}\n".format(np.argmax(y2[i])))
    print ("Predicted Output: {}\n".format(np.argmax(nextnum4.forward(x2[i]))))
    print ("Loss: {}\n".format(np.mean(np.square(np.argmax(y2[i]) - np.argmax(nextnum4.forward(x2[i])))))) # mean sum squared loss
    print ("\n")
    nextnum4.train(x2, y2)

In [None]:
for i in range(10):
    print("Test: {}".format(np.argmax(x2[i])))
    print("Truth: {}".format(np.argmax(y2[i])))
    answer = nextnum4.forward(x2[i])
    #print(answer)
    print("Next Num: {}".format(np.argmax(answer)))
    print("\n")

hmm we aren't doing any better, but did notice a logic error. In the call to train() we aren't giving it the input and output.  Instead we are specifying all input and output.

Let's try again with the original network and this logic error fixed.

In [None]:
nextnum5 = NN2(10, 10, 4)

Reuse our existing 1000 training data.

In [None]:
for i in range(1000): # trains the NN 1,000 times
    print ("Input: {}\n".format(np.argmax(x2[i])))
    print ("Actual Output: {}\n".format(np.argmax(y2[i])))
    print ("Predicted Output: {}\n".format(np.argmax(nextnum5.forward(x2[i]))))
    print ("Loss: {}\n".format(np.mean(np.square(np.argmax(y2[i]) - np.argmax(nextnum5.forward(x2[i])))))) # mean sum squared loss
    print ("\n")
    nextnum5.train([x2[i]], [y2[i]])

In [None]:
for i in range(10):
    print("Test: {}".format(np.argmax(x2[i])))
    print("Truth: {}".format(np.argmax(y2[i])))
    answer = nextnum5.forward(x2[i])
    #print(answer)
    print("Next Num: {}".format(np.argmax(answer)))
    print("\n")