# Recurrent Neural Networks and LSTM

## Introduction

Recurrent Neural Networks are one of the most popular Neural Network architecture types due to their importance in processing sequential and time series data. While time-series data usually refers to real valued data with a temporal dimension such as stock market prices and sensor outputs, sequential data is a much broader data type which also includes audio, text, and video.

We begin with an overview of basic RNNs with a simple example. After that we look at LSTMs as perhaps the most popular example of a traditional recurrent neural network. Later in the lecture we start looking at the topic of generative machine learning by seeing how recurrent neural networks can be used to create a type of langauge model.

Note that some of the notes below are based on the tutorial here:
https://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html

Transformer Architectures and the idea of attention are also really important in processing sequential data. We won't cover them this week, but we will have a look at these in detail in the next lecture.

 ## Recurrent Neural Networks - Overview

A recurrent neural network is a neural network in which the output of the hidden state of input $i$ is fed to the hidden state alongside the $i+1^{th}$ input. That way, the output at time step k is not only dependent on the input at time step k, but also on the inputs up to point k.

We can illustrate the most basic RNN instance below with a hidden neuron which is connected back to itself.

<!-- rnn.png -->
<img width="300" src="https://drive.google.com/uc?id=1dZpj6awgBNmpLZSnCIYxU5kciLcp-nP0"/>

We can 'unfold' such a design in time to get a better sense of what is happening. In the figure below we see the RNN with three inputs applied in sequence. The input t=0 will result in a hidden state which is passed to the output at time t=0 but which is also passed into the hidden state at t=1. Thus at t=1 the output of the hidden neuron is dependent not only on the input, but also on the hidden state variable from time t=0. Similarly at t=2 the output of the hidden neuron is dependent on the input at t=2 but also on the output of the hidden state from t=1 which in turn was dependent on the output of the hidden state at t=0. This shows the power of the model, i.e., the hidden state output at t=2 is not only dependent on the input variable at t=2, but on all the inputs that have been seen to date.

<!-- rnn2.png -->
<img width="300" src="https://drive.google.com/uc?id=1OeVajqjv27Z44wV3e6immRGxeK-jp6wb"/>

The example above has just one input and output neuron, thus we only have one target variable and one feature. We can trivially extend this approach to multiple target variables and input features by connecting each hidden neuron to its corresponding neuron in subsequent instances of the network as it is unfolded over time. This is illustrated below. Note that for visual purposes we mark hidden-to-hidden links with a separate color, but that this has no bearing on the implementation.

<!-- rnn5.png -->
<img width="300" src="https://drive.google.com/uc?id=1gljfKYL5kduubxU5rv_w_qzT9CNf8qrK"/>

This is a vanilla form of recurrent neural network. In practice the architecture adopted is much more complex than this and allows us a greater degree of control in determining what parts of the state variable is saved or passed on to future time steps. But for the moment let's see how this type of model can be put to work.

## Recurrent Neural Networks - Model Description

For our Vanilla RNN, let us consider the most basic form of RNN possible, i.e., a RNN which has a link to a single neuron in an input layer, a single neuron in the output layer, and a state variable that is one neuron wide. Lets also assume that the input to this network, X, is a stream of values $\in R$ and that the output from the network, Y, is a stream of values $\in \{0,1\}$. We will refer to the output of the hidden RNN unit as its state, S, and this state is passed to the logit calculations of the output layer and also to the logit calculations of the hidden layer itself. The architecture for this network is illustrated below:

<!-- rnn_vanilla.png -->
<img width="500" src="https://drive.google.com/uc?id=1Nej2Ng92K-hjWK0Uz64GIMtIgrdXf0ce"/>

Some points to note:
 * Our input layer is clamped to input values $x \in X$ in the normal way.
 * Our output layer is in this case a hyperbolic tangent function (softmax is not necessary as we only have one output binary class).
 * Our hidden unit is recursive in that its output value is fed back and concatenated with the content from the input layer.
 * Weights and biases are assumed for the hidden and output layers in the normal way. We will return to the dimensionality of these variables later.

Operationally the hidden or recursive unit behaves like a typical hidden layer in that a logit is first computed and from this an activation is calculated. We will assume that the activation value is a hyperbolic tangent function, but this need not be the case. The activation function for our vanilla RNN could be a logistic function or other suitable activation function.

Our visualization above simplifies the temporal nature of the RNN as it shows the output of the RNN unit being fed directly back in to the same unit. For clarity it is useful to think about the network as having to be unrolled out over time to deal with sequences of inputs. We visualizae this below by extending the network out to a sequence of 3 input units at T=0, T=1, and T=2.

<!-- rnn_vanilla_unrolled.png -->
<img width="500" src="https://drive.google.com/uc?id=1my7wmbnwIs_AT7h5DOX7Vs-NMIS87GUS"/>

Referring to the above we can see now how the output of the hidden layer at time T is passed as input to the hidden layer at T+1. Specifically the $S_{T}$ is concatenated with the actual input $X_{T+1}$. At time T=1 we don't have a true hidden state from a previous step to concatenate with our input $X_{0}$. In this case we typically use an initial value for S which we set to 0.

As noted above, the output of the hidden RNN unit is commonly referred to as the hidden state, $S$. The value of S for a given time point $t$ is simply:

$$S_{t} = tanh( (W * (X \oplus S_{t-1})) + b) $$

where $\oplus$ denotes vector concatenation. Thus we see that for a given point in time that the calculation of the hidden layer is typical of any feedforward layer that we have dealt with to date. The difference is that this value is passed back to the network at time $t+1$ as well as being passed to the logit calculation for the output layer.

### Hidden State Size
In our figures above we assumed our hidden layer had just one hyperbolic tangent unit. This corresponds to having a hidden state size of 1. Naturally however having only one channel to carry information across time will limit the complexity of temporal features that our model will be able to learn. Therefore, in practice we will want a hidden state that is much wider.

Our equation for calculating $S$ above has no such assumptions about the size of the hidden state, and indeed we will see later our implementations of RNNs are built to accommodate much larger hidden states. The approach taken to this is very simple. Rather than having a single non-linear unit, we make use of a so-called RNN cell which can have an arbitrary state size. Internally the cell is implemented as a collection of neurons, i.e., logit and non-linearity calculations, but the specifics of these internal operations are abstracted for us.

We can illustrate this approach by looking again at the case where T=1 and assume instead a hidden state size of 4. In this illustration $S_{0}$ is the output state from the $0^{th}$ application of the model, and $S_{1}$ will be fed to the following application.

<!-- rnn_vanilla_states.png -->
<img width="500" src="https://drive.google.com/uc?id=1ZQlHUkDtnhdBp0FdQBK3zefPOebzAZbz"/>

By convention we won't illustrate each of the neurons within the RNN cell and will instead only illustrate the RNN cell itself. For our purposes here we will use a square to capture such RNN cells as opposed to the ovals we have used to date. As above we split the cell in half to illustrate the point that a logit is first calculated before a non-linearity.

<!-- rnn_vanilla_unrolled_2.png -->
<img width="500" src="https://drive.google.com/uc?id=1BmlEzu11ckYmMBws3uE6vFzflbz9RgiT"/>

### Aside: Unrolling the Vanilla RNN for Training

Above we unrolled the Vanilla RNN over a number of time steps in order to make its operation more transparent. It turns out however that unrolling the RNN is in fact essential to its use.

Lets consider first the case of using our Vanilla RNN model where the model parameters (i.e., the weights and bias for the hidden and output layers) have already been set. In this case it should be possible for us to feed an input sequence one unit at a time through an unrolled network and produce a set of valid outputs. We have to be careful to copy our output from time point $t$ to time point $t+1$ for concatenation with our actual input, but otherwise there should not be a problem.

However for training, things are a little bit more complex. At the end of an input sequence we should be able to calculate a total training loss on the sequence. That loss is not just from the error on the final output symbol, but should instead be based on the true versus predicted value for the output Y at each time step. We need to propagate errors back from the last time step through multiple steps.

One question is how many steps should we propagate errors backwards for? There are two extreme conditions. In the first extreme condition we don't propagate errors backwards in time at all. In this case we will not be able to learn temporal dependencies in our data - which defeats the whole purpose of what we are trying to achieve. In the other extreme we allow errors on the final symbol to propagate all the way back, so that in theory corrections to weights at the first time step could be influenced by an error in the final symbol. While this might be appealing in that it would allow arbitrarily long dependencies to be identified, in practice it does not work for two reasons. The first is a purely resource based concern. The further back we allow error propagation, the more complex our model and the more resources (time and memory) that are consumed. The second issue however is more problematic and is an instantiation of the vanishing gradient problem. Essentially over a long time step the repeated multiplication of error gradients will result in the gradient approaching either 0 or $\infty$.

The solution then in practice is to limit the propagation of errors to a fixed number of time steps. This is referred to as truncated backpropagation.

### Multiple RNN Layers

In the example above we used a single layer of LSTM units. In practice though we can use multiple layers where the output of one layer feeds into the next layer. This can be visualized as follows:

<!-- rnn_multi_layer.png -->
<img width="600" src="https://drive.google.com/uc?id=12CQWepvinVskswapE_GohcZ8apEwslp4"/>

When we are dealing with the output of an RNN, please keep in mind however that it can be used in two different ways. In one circumstance we may only be interested in the final output of the RNN, i.e., what was the output produced for the final chunk of input. This might be relevant if we want a simple classifier based almost entirely on the RNN's own internal processing. In another circumstance we might want to see the output of the RNN for each step. Here we have an output for each individual time step. This is often useful to further classifier steps as they can make a decision not just based on the final output of the RNN, but also on the intermediary decisions that were made by it. Typical implementations, like for example provided by Tensorflow, allow you to parameterise whether an RNN cell outputs only its final result, or its result unfolded in time.

## Usage Scenario 1: Alarm Tripping

To illustrate the model above, we will create a Trip Alarm for Negative Inputs. This alarm will produce as output the value 0 so long as the input is positive. However if ever the input value turns negative the output of the alarm will switch to output 1 and remain at 1 regardless of the inputs.

### Step 1 - Data Creation
We can easily generate some sample data that corresponds to our usage scenario. For training we will limit all our examples to being constant length (currently 20 units below in the code). We will also set the input data to be in the range -1 to 20. We also generate the Target data straightforwardly by iterating over our dataset and applying a function that checks for the occurrence of negative values, and if found sets the output to 1 from the point at which the negative value is found, until the end of the sequence. We will create a large training data set and a smaller test data set with the same methods.

In [1]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

# set this below to "./" if you just want to use the current directory to save temporary model files
# else set it to whereever makes sense for you
my_temp_folder = "./"

# Set up some variables used in data collection and in training
num_training_examples = 10000
num_test_examples = 100
example_length = 20
lower_bound = -1
upper_bound = 20

# Create Training and Test input variables
X_train = np.random.randint(low=lower_bound,high=upper_bound,
                      size=(num_training_examples,example_length)).astype(float)
X_test = np.random.randint(low=lower_bound,high=upper_bound,
                      size=(num_test_examples,example_length)).astype(float)

# Create Training and Test target variable
def scan_example(x):
    """Function Used to Create Individual Target Vectors by Scanning Input Vectors"""
    y = np.zeros(shape=np.shape(x))
    for index,val in np.ndenumerate(x):
        if val < 0:
            y[index[0]:] = 1
            break
    return y
Y_train = np.apply_along_axis(scan_example,1,X_train)
Y_test = np.apply_along_axis(scan_example,1,X_test)

# For processing with Tensorflor / Keras, we expand the dimensionality of our data by one.
# In other words our inputs will be sets of sets of individual values -- rather than sets of arrays
X_train = np.expand_dims(X_train, axis=2)
X_test = np.expand_dims(X_test, axis=2)
Y_train = np.expand_dims(Y_train, axis=2)
Y_test = np.expand_dims(Y_test, axis=2)

# Print examples from the test data (we add the T to transpose -- this just means the printing is neater. Remove it if you want to see what happens withut it. )
print(X_test[0].T)
print(Y_test[0].T)

[[ 4. 18. 16.  5. 16. 17. 13.  4. 17.  7.  9. 17. 17. 13. 19.  8.  4.  0.
   6.  8.]]
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


### Step 2: Model Creation

We will build our recurrent model for this 'alarm system' using a Keras wrapper for Tensorflow.

In [2]:
import tensorflow as tf
from tensorflow.keras import layers

Next, lets lust define our batch size etc. We will use a state_size of 8. This is already pretty small, but you can still get results if you reduce it to 4. It just might take longer to train.

In [3]:
batch_size = 100
state_size = 8
num_classes = 1

Next, let's define our model. We will be able to build our model based around the 'Sequential' style constructor that Keras uses. Keep in mind that not ever model that people build is sequential. For now though, it is perfect.

In [4]:
model = tf.keras.Sequential()

Next, let's define our RNN layer. We need to specify the hidden state size. We also ask Tensorflow not just to output values for every step in the sequence -- not just the final step. Finally, to help guide tensorflow we give it some guidence on the input data dimensions it should expect; here 20 is the length of each example and 1 is the dimensionality of each input element, i.e., just a single number. We don't need to specify the batch size, Keras assumes we will have data in batches -- even if they are just batches of length 1.

In [5]:
# Input will be passed to an basic RNN layer - sequences will be generated
model.add(layers.SimpleRNN(state_size,return_sequences=True,input_shape=(20,1)))

Next, we are going to let the full output of the recurrent network be passed to a single output unit. The output unit will be logistic since we are dealing with a single simple classification tasks.

In [6]:
# add an output layer consisting of 1 single logistic classifier
model.add(tf.keras.layers.Dense(num_classes, activation=tf.nn.sigmoid))

Next, we need to compile our model. We use cross entropy since we want our targets are 0s and 1s and our final layer neuron is pretty good at predicting them. We will use the adam optimiser, and ask for an accuracy metric to be calculated.

In [7]:
model.compile(loss="binary_crossentropy",
                   optimizer='adam',
                   metrics=['accuracy'])

Let's have a look at the built, but not yet trained model.

In [8]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 20, 8)             80        
                                                                 
 dense (Dense)               (None, 20, 1)             9         
                                                                 
Total params: 89 (356.00 Byte)
Trainable params: 89 (356.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Note that our output is a vector of 20 elements, i.e., it is a sequence. Our loss function will be calculating a total loss across all 20 elements. Loss will be calculated on an item by item basis. If we have made the correct prediction at some point, we do not portion some of the loss / blame to the unit at that point, i.e., we don't average the losses back out over the 20 elements.

We can see that in the great scheme of things this is a very small model. We could in practice make it much smaller by reducing the size of the hidden state. If we did this our model would likely still converge to a good model, but it would not do so as frequently. We might have to rerun training to get a good model.  

Next, time to train the model. We provide input and output data, and specify a batch size for training as well as the number of epochs. If the model didn't train for you, increase the number of epochs.

In [9]:
hist = model.fit(X_train, Y_train,
           batch_size=batch_size,
           epochs=100, verbose=0)

Note that I had verbosity turned off to save on some printing space, but feel free to set verbose=1 to see the actual progress of training.

Next, let's predict based on some test data and see what our outputs look like

In [10]:
res = model.predict(X_test,batch_size=batch_size)
print(X_test[0].T)
print(res[0].T)

[[ 4. 18. 16.  5. 16. 17. 13.  4. 17.  7.  9. 17. 17. 13. 19.  8.  4.  0.
   6.  8.]]
[[2.4406708e-04 1.8382258e-05 2.9999010e-05 6.5783228e-05 3.8903388e-05
  3.6048408e-05 4.0281215e-05 9.3944291e-05 3.9485414e-05 5.4756376e-05
  5.0010152e-05 3.6650163e-05 3.5826914e-05 4.0282059e-05 3.4810513e-05
  5.0123886e-05 9.7296994e-05 1.6356021e-03 1.5494174e-04 6.9037451e-05]]


See that in the input data that when there is a -1 that the output switches from values less than 0.5 to values greater than 0.5. In the ideal case this should be a clean separation between 0s and 1s, but typically we need a more powerful model to pull that off.



## Aside: Recurrent vs Recursive Networks  
A Recurrent Neural Network is one in which the hidden state of the network is reapplied as a parameter into the network when the next input comes along. The short hand RNN is often used for Recurrent Neural Networks. A Recursive Neural Network meanwhile is a distinct architectural approach in neural networks where we attempt to apply a model in a recursive way over an analysis.

This distinction is made clear with an example. Consider the following example of a Recurrent Neural Network. A single network layer produces an output for a given input. While the hidden state is reapplied into the network for the analysis of each individual input token, there is no recursive structure.

<!-- recurrent.png
Figure sourced without permission from http://www.slideshare.net/jiessiecao/parsing-natural-scenes-and-natural-language-with-recursive-neural-networks  -->
<img width="500" src="https://drive.google.com/uc?id=1gyvO2IToNQ2J7OXcVLxIyvLi1NnXCxCZ"/>

On the other hand consider the example below of a Recursive Neural Network. Here a model is applied to the total input to produce an output for the total input. That total output is then applied as input to a new higher-order tier of analysis. The output of that analysis can in turn be applied as the input to another higher-order tier of analysis and so forth. Thus the Recursive Neural Network recursively analyses the total input and can be thought of conceptually as being similar to a parser.  

<!-- recursive.png
Figure sourced without permission from http://www.slideshare.net/jiessiecao/parsing-natural-scenes-and-natural-language-with-recursive-neural-networks
 -->
<img width="500" src="https://drive.google.com/uc?id=1HLhImrJMSUKXHqdgZJYvweBlM6CNrxUv"/>

It should be noted that it is possible to construct Recursive Recurrent Neural Networks. Here a Recurrent Network is used to generate an output for one tier of the analysis, but that output is then fed into the network again to produce a higher order tier. For our purposes here we will stay focused on recurrent neural network architectures.

## Long-Short Term Memory

The vanilla RNN that we have looked at to this point is very useful at explaining the basics of how RNNs work, but in practice the vanilla RNN is not strong enough for us to perform complex sequential tasks.


### The Challenge: Travelling Back in Time!

Let's say for example we have a case where a change in the output at T=6 should be dependent on the input at T=1. In an ideal case the backpropogation should allow us to learn this sort of dependency -- but in practice it doesn't. The reason for this is essentially that the learning signal tends to be biased in errors coming from the current time point rather than future time points. We can understand this better by looking at the figure below. The dark coloring indicates the strength of signal between inputs and outputs in either the forward or backward propogation stages. We see that there is a strong relationship between X and Y at the first time step, but because extra layers are present as we move from X at T=1 down to Y at T=2,3,4 etc. that it is more difficult to have a strong signal between X at T=1 and Y's that are further away in time.

<!-- bit-1.png -->
<img width="500" src="https://drive.google.com/uc?id=1fQINGBbB7r9CpFMDsqnPQse0F2MQnIca"/>

This means that at T=6 for example, we cost function might be able to identify that something should be changed and start passing pack the error derivative to achieve that change, but by the time the error derivative makes it back to the first hidden layer at T=0, that error derivative signal is in practice too small relative to earlier error derivative time signals to be paid attention to.

What we need is some way to have more direct control over what error signal passes through the network. Our error signal will still get smaller as it backprrops through time, but we will learn not to corrupt it with error derivatives that just aren't as relevant to us.

This is illustrated in the figure below. Circles indicate that the unit is accepting signal for that neuron, while dashes indicates that the unit is closed to input in that direction.

<!-- bit-2.png -->
<img width="500" src="https://drive.google.com/uc?id=1nMYf9IyWn1bDy8QKlqtLmmcEXbGP5TfB"/>

The hand-waving idea here is that there is less leakage or corruption of the error derivative as we move through time. In practice it is a bit more complicated than that -- but not that much.



### Long Short Term Memory Overview

The best known example of an RNN that overcomes this challenge is called the Long Short Term Memory (LSTM) cell. The LSTM cell design is over 20 years old but has proven to be extremely beneficial. It would no longer be considered state-of-the-art for sequence processing tasks -- particularly if you have a lot of computing power to spare -- but for many common sequence processing tasks it is still extensively used.

There are many resources available online that provide excellent introductions to the model. The blog post by Christopher Olah is particularly excellent and well known. Rather than repeating that material here, please review Christopher's blog post on LSTM in detail.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

In the lecture I step through the model, but I refer you to this blog post as a source of information on the model.

### Putting LSTM to work: Classification

Despite the fact that the LSTM model is considerably more complex than the basic RNN cell, the actual usage of the model within Keras / Tensorflow is practically just as easy as the basic RNN case. Let's see how our text calssification task from the last session can be coverted over to use RNNs first, but then LSTMs instead.


In [11]:
import numpy as np
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Embedding, Conv1D, GlobalMaxPooling1D, GlobalAveragePooling1D, LSTM, SimpleRNN
from tensorflow.keras.datasets import imdb
import tensorflow as tf

Let's set up some paraemters and load up the dataset.

In [12]:
max_features = 5000
maxlen = 400
embedding_dims = 16
epochs = 5

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('Train shape:', x_train.shape)
print('Test shape:', x_test.shape)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
Train shape: (25000, 400)
Test shape: (25000, 400)


Let's define the same basic model we used the last time, and then set up the training on it.

In [13]:
model = Sequential()

model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))
model.add(Flatten())
model.add(Dense(2, activation='tanh'))
model.add(Dense(1,activation="sigmoid"))
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 400, 16)           80000     
                                                                 
 flatten (Flatten)           (None, 6400)              0         
                                                                 
 dense_1 (Dense)             (None, 2)                 12802     
                                                                 
 dense_2 (Dense)             (None, 1)                 3         
                                                                 
Total params: 92805 (362.52 KB)
Trainable params: 92805 (362.52 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [14]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=32,
          epochs=epochs,
          validation_data=(x_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7efe02d34310>

Now, let's set up a model that uses the basic RNN.

In [None]:
model = Sequential()
model.add(Embedding(max_features, embedding_dims))
model.add(SimpleRNN(embedding_dims))
model.add(Dense(2,activation="tanh"))
model.add(Dense(1,activation="sigmoid"))
model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=32,
          epochs=epochs,
          validation_data=(x_test, y_test))

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 16)          80000     
                                                                 
 simple_rnn_1 (SimpleRNN)    (None, 16)                528       
                                                                 
 dense_3 (Dense)             (None, 2)                 34        
                                                                 
 dense_4 (Dense)             (None, 1)                 3         
                                                                 
Total params: 80565 (314.71 KB)
Trainable params: 80565 (314.71 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
 31/782 [>.............................] - ETA: 4:35 - loss: 0.2352 - accuracy: 0.9103

In [None]:
model = Sequential()
model.add(Embedding(max_features, embedding_dims))
model.add(LSTM(embedding_dims))
model.add(Dense(2,activation="tanh"))
model.add(Dense(1,activation="sigmoid"))
model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=32,
          epochs=epochs,
          validation_data=(x_test, y_test))

Training is now complete - and in principle we could directly use the model now to pass through a complete batch of input data and get the outputs. In practice though it can be handy to reload the model assume a batch input size of 1 only. This makes it easier for us to run one example at a time through the network.

Therefore we first recreate our model -- but this time fixing the input layer batch size to one:

In [None]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(len(vocab), embedding_dim,
                          batch_input_shape=[1, None]))
model.add(layers.SimpleRNN(rnn_units,return_sequences=True))
model.add(tf.keras.layers.Dense(len(vocab)))

## Suggested Tasks

 1. Extend the example above to use a combination of filters of length 2, 3 and 4 as suggested on the source article http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
 2. Use the pre-trained embeddings layers in place of an on-the-fly embeddings layer