<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-in-action/blob/9-text-with-long-short-term-memory-networks/1_improving_retention_with_long_short_term_memory_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Improving retention with long short-term memory networks

For all the benefits recurrent neural nets provide for modeling relationships, and therefore possibly causal relationships, in sequence data they suffer from one main deficiency: a token’s effect is almost completely lost by the time two tokens have passed.

This is important to the basic structure of the net, but it prevents the common case in human language that the tokens may be deeply interrelated even when they’re far apart in a sentence.

Your challenge is to build a network that can pick up on the same core thought in both sentences. What you need is a way to remember the past across the entire input sequence. A long short-term memory (**LSTM**) is just what you need.

Modern versions of a long short-term memory network typically use a special neural network unit called a gated recurrent unit (**GRU**). A gated recurrent unit can maintain both long- and short-term memory efficiently, enabling an **LSTM** to process a
long sentence or document more accurately.

In fact, **LSTMs** work so well they have replaced recurrent neural networks in almost all applications involving time series, discrete sequences, and **NLP**.


First, you load the dataset, grab the labels, and shuffle the examples. Then you
tokenize it and vectorize it again using the Google Word2vec model. Next, you grab the labels. And finally you split it 80/20 into the training and test sets.

## Setup

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from tensorflow.keras import backend as keras_backend
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, LSTM
from tensorflow.keras.preprocessing import sequence

import os
import tarfile
import re
import tqdm

import glob
from random import shuffle
from nltk.tokenize import TreebankWordTokenizer

import requests

TensorFlow 2.x selected.


In [0]:
! pip install pugnlp

In [0]:
from pugnlp.futil import path_status, find_files

## Data Preparation

Each data point is prelabeled with a 0 (negative sentiment) or a 1 (positive sentiment).you’re going to swap out their example IMDB movie review dataset
for one in raw text, so you can get your hands dirty with the preprocessing of the text as well. And then you’ll see if you can use this trained network to classify text it has never seen before.

### Downloading data

In [0]:
BIG_URLS = {
    'w2v': ('https://www.dropbox.com/s/965dir4dje0hfi4/GoogleNews-vectors-negative300.bin.gz?dl=1', 1647046227),
    'slang': ('https://www.dropbox.com/s/43c22018fbfzypd/slang.csv.gz?dl=1', 117633024),
    'tweets': ('https://www.dropbox.com/s/5gpb43c494mc8p0/tweets.csv.gz?dl=1', 311725313),
    'lsa_tweets': ('https://www.dropbox.com/s/rpjt0d060t4n1mr/lsa_tweets_5589798_2003588x200.tar.gz?dl=1', 3112841563),  # 3112841312
    'imdb': ('https://www.dropbox.com/s/yviic64qv84x73j/aclImdb_v1.tar.gz?dl=1', 3112841563),  # 3112841312
}

In [0]:
# These functions are part of the nlpia package which can be pip installed and run from there.
def dropbox_basename(url):
    filename = os.path.basename(url)
    match = re.findall(r'\?dl=[0-9]$', filename)
    if match:
        return filename[:-len(match[0])]
    return filename

def download_file(url, data_path='.', filename=None, size=None, chunk_size=4096, verbose=True):
    """Uses stream=True and a reasonable chunk size to be able to download large (GB) files over https"""
    if filename is None:
        filename = dropbox_basename(url)
    file_path = os.path.join(data_path, filename)
    if url.endswith('?dl=0'):
        url = url[:-1] + '1'  # noninteractive download
    if verbose:
        tqdm_prog = tqdm
        print('requesting URL: {}'.format(url))
    else:
        tqdm_prog = no_tqdm
    r = requests.get(url, stream=True, allow_redirects=True)
    size = r.headers.get('Content-Length', None) if size is None else size
    print('remote size: {}'.format(size))

    stat = path_status(file_path)
    print('local size: {}'.format(stat.get('size', None)))
    if stat['type'] == 'file' and stat['size'] == size:  # TODO: check md5 or get the right size of remote file
        r.close()
        return file_path

    print('Downloading to {}'.format(file_path))

    with open(file_path, 'wb') as f:
        for chunk in r.iter_content(chunk_size=chunk_size):
            if chunk:  # filter out keep-alive chunks
                f.write(chunk)

    r.close()
    return file_path

def untar(fname):
    if fname.endswith("tar.gz"):
        with tarfile.open(fname) as tf:
            tf.extractall()
    else:
        print("Not a tar.gz file: {}".format(fname))

In [41]:
download_file(BIG_URLS['w2v'][0])

requesting URL: https://www.dropbox.com/s/965dir4dje0hfi4/GoogleNews-vectors-negative300.bin.gz?dl=1
remote size: 1647046227
local size: 1647046227
Downloading to ./GoogleNews-vectors-negative300.bin.gz


'./GoogleNews-vectors-negative300.bin.gz'

In [42]:
untar(download_file(BIG_URLS['imdb'][0]))

requesting URL: https://www.dropbox.com/s/yviic64qv84x73j/aclImdb_v1.tar.gz?dl=1
remote size: 84125825
local size: 84125825
Downloading to ./aclImdb_v1.tar.gz


### Preprocessing the loaded documents

The reviews in the train folder are broken up into text files in either the pos or neg folders. You’ll first need to read those in Python with their appropriate label and then shuffle the deck so the samples aren’t all positive and then all negative. Training with the sorted labels will skew training toward whatever comes last, especially when you use certain hyperparameters, such as momentum.

In [0]:
import glob
from random import shuffle

def pre_process_data(filepath):
  '''
  This is dependent on your training data source but we will try to generalize it as best as possible.
  '''
  positive_path = os.path.join(filepath, 'pos')
  negative_path = os.path.join(filepath, 'neg')

  pos_label = 1
  neg_label = 0

  dataset = []

  for filename in glob.glob(os.path.join(positive_path, '*.txt')):
    with open(filename, 'r') as f:
      dataset.append((pos_label, f.read()))

  for filename in glob.glob(os.path.join(negative_path, '*.txt')):
    with open(filename, 'r') as f:
      dataset.append((neg_label, f.read()))

  shuffle(dataset)

  return dataset

### Data tokenization and vectorization

The next step is to tokenize and vectorize the data. You’ll use the Google News pretrained Word2vec vectors, so download those directly from Google.

You’ll use gensim to unpack the vectors, You can
experiment with the limit argument to the load_word2vec_format method; a
higher number will get you more vectors to play with, but memory quickly becomes an issue and return on investment drops quickly in really high values for limit.

In [5]:
from nltk.tokenize import TreebankWordTokenizer
from gensim.models import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True, limit=200000)

def tokenize_and_vectorize(dataset):
  tokenizer = TreebankWordTokenizer()
  vectorized_data = []
  expected = []

  for sample in dataset:
    tokens = tokenizer.tokenize(sample[1])
    sample_vecs = []
    for token in tokens:
      try:
        sample_vecs.append(word_vectors[token])
      except KeyError:
        pass    # No matching token in the Google w2v vocab

    vectorized_data.append(sample_vecs)

  return vectorized_data

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


You also need to collect the target values—0 for a negative review, 1 for a positive review—in the same order as the training samples.

In [0]:
def collect_expected(dataset):
  '''Peel of the target values from the dataset'''
  expected = []
  for sample in dataset:
    expected.append(sample[0])
  
  return expected

### Padding and truncating token sequence(sequences of vectors)

Keras has a preprocessing helper method, pad_sequences, that in theory could be
used to pad your input data, but unfortunately it works only with sequences of scalars, and you have sequences of vectors. 

Let’s write a helper function of your own to pad your input data.

In [0]:
def pad_trunc(data, maxlen):
  '''For a given dataset pad with zero vectors or truncate to maxlen'''
  new_data = []

  # Create a vector of 0's the length of our word vectors
  zero_vector = []
  for _ in range(len(data[0][0])):
    zero_vector.append(0.0)
  #zero_vector = [0.0 for _ in range(len(data[0][0]))]

  for sample in data:
    if len(sample) > maxlen:
        temp = sample[:maxlen]
    elif len(sample) < maxlen:
        temp = sample
        additional_elems = maxlen - len(sample)
        for _ in range(additional_elems):
            temp.append(zero_vector)
    else:
        temp = sample
    new_data.append(temp)
  
  return new_data

### Train/Test splitting

Next you’ll split the prepared data into a training set and a test set. You’re just going to split your imported dataset 80/20, but this ignores the folder of test data.

In [10]:
# gather the data and prep it.
dataset = pre_process_data('./aclImdb_v1/train')
print(len(dataset))

0


In [74]:
vectorized_data = tokenize_and_vectorize(dataset)
expected = collect_expected(dataset)

# split the data into training and testing sets.
split_point = int(len(vectorized_data) * .8)

x_train = vectorized_data[:split_point]
y_train = expected[:split_point]
x_test = vectorized_data[split_point:]
y_test = expected[split_point:]

IndexError: ignored

### Hyper-parameters

The next sets most of the hyperparameters for the net.

In [0]:
maxlen = 400          # holds the maximum review length
batch_size = 32       # How many samples to show the net before backpropagating the error and updating the weights
embedding_dims = 300  # Length of the token vectors you’ll create for passing into the convnet
epochs = 2            # Number of times we will pass the entire training dataset through the network

Then you need to pass your train and test data into the padder/truncator. After that you can convert it to numpy arrays to make Keras happy. This is a tensor with the shape (number of samples, sequence length, word vector length) that you need for your CNN.

In [69]:
# further prep the data by making each point of uniform length.
x_train = pad_trunc(x_train, maxlen)
x_test = pad_trunc(x_test, maxlen)

IndexError: ignored

In [0]:
# reshape into a numpy data structure.
x_train = np.reshape(x_train, (len(x_train), maxlen, embedding_dims))
y_train = np.array(y_train)

x_test = np.reshape(x_test, (len(x_test), maxlen, embedding_dims))
y_test = np.array(y_test)

Phew; finally you’re ready to build a neural network.

## LSTM Long-Short-Term Memory network architecture

**LSTMs** introduce the concept of a state for each layer in the recurrent network. The state acts as its memory. You can think of it as adding attributes to a class in objectoriented programming. The memory state’s attributes are updated with each training example.

In LSTMs, the rules that govern the information stored in the state (memory) are trained neural nets themselves—therein lies the magic. They can be trained to learn what to remember, while at the same time the rest of the recurrent net learns to predict
the target label! With the introduction of a memory and state, you can begin to learn dependencies that stretch not just one or two tokens away, but across the entirety of each data sample. With those long-term dependencies in hand, you can start to see beyond the words themselves and into something deeper about language.

With LSTMs, patterns that humans take for granted and process on a subconscious level begin to be available to your model. And with those patterns, you can not only more accurately predict sample classifications, but you can start to generate novel text using those patterns. The state of the art in this field is still far from perfect, but the results you’ll see, even in your toy examples, are striking.

<img src='https://github.com/rahiakela/img-repo/blob/master/lstm-network-1.png?raw=1' width='800'/>

The memory state is affected by the input and also affects the layer output just as in a normal recurrent net. But that memory state persists across all the time steps of the time series (your sentence or document). So each input can have an effect on the memory state as well as an effect on the hidden layer output. The magic of the memory state is that it learns what to remember at the same time that it learns to reproduce the output, using standard backpropagation!

<img src='https://github.com/rahiakela/img-repo/blob/master/lstm-network-2.png?raw=1' width='800'/>

It looks similar to a normal recurrent neural net. However, in addition to the activation output feeding into the next time-step version of the layer, you add a memory state that also passes through time steps of the network. At each time-step iteration, the hidden recurrent unit has access to the memory unit. The addition of this memory unit, and the mechanisms that interact with it, make this quite a bit different from a traditional neural network layer. However, you may like to know that it’s possible to design a set of traditional recurrent neural network layers (a computational graph) that accomplishes all the computations that exist within an LSTM layer. An LSTM layer is just a highly specialized recurrent neural network.

<img src='https://github.com/rahiakela/img-repo/blob/master/lstm-network-3.png?raw=1' width='800'/>

So let’s take a closer look at one of these cells. Instead of being a series of weights on the input and an activation function on those weights, each cell is now somewhat more complicated. As before, the input to the layer (or cell) is a combination of the input sample and output from the previous time step. As information flows into the cell instead of a vector of weights, it’s now greeted by three gates: 
* a forget gate, 
* an input/candidate gate, and 
* an output gate

Each of these gates is a feed forward network layer composed of a series of weights that the network will learn, plus an activation function. Technically one of the gates is composed of two feed forward paths, so there will be four sets of weights to learn in this layer. The weights and activations will aim to allow information to flow through the cell in different amounts, all the way through to the cell’s state (or memory).

### LSTM layer in Keras

In [0]:
num_neurons = 50

model = Sequential()
model.add(LSTM(num_neurons, return_sequences=True, input_shape=(maxlen, embedding_dims)))
model.add(Dropout(0.2))

model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.summary()

One import and one line of Keras code changed. But a great deal more is going on under the surface. From the summary, you can see you have many more parameters to train than you did in the SimpleRNN from last chapter for the same number of neurons
(50). Recall the simple RNN had the following weights:

* 300 (one for each element of the input vector)
* 1 (one for the bias term)
* 50 (one for each neuron’s output from the previous time step)

For a total of 351 per neuron.

351 * 50 = 17,550 for the layer

The cells have three gates (a total of four neurons):

17,550 * 4 = 70,200

But what is the memory? The memory is going to be represented by a vector that is the same number of elements as neurons in the cell. Your example has a relatively simple 50 neurons, so the memory unit will be a vector of floats that is 50 elements long.

<img src='https://github.com/rahiakela/img-repo/blob/master/lstm-network-4.png?raw=1' width='800'/>

You take the first token from the first sample and pass its 300-element vector representation into the first LSTM cell. On the way into the cell, the vector representation of the data is concatenated with the vector output from the previous time step (which is a 0 vector in the first time step). In this example, you’ll have a vector that is 300 + 50 elements long. Sometimes you’ll see a 1 appended to the vector—this corresponds to the bias term. Because the bias term always multiplies its associated weight by a value of one before passing to the activation function, that input is occasionally omitted from the input vector representation, to keep the diagrams more digestible.

<img src='https://github.com/rahiakela/img-repo/blob/master/lstm-network-5.png?raw=1' width='800'/>

At the first fork in the road, you hand off a copy of the combined input vector to the ominous sounding forget gate. The forget gate’s goal is to learn, based on a given input, how much of the cell’s memory you want to erase.

The idea behind wanting to forget is as important as wanting to remember. As a human reader, when you pick up certain bits of information from text—say whether the noun is singular or plural—you want to retain that information so that later in the
sentence you can recognize the right verb conjugation or adjective form to match with it. 

In romance languages, you’d have to recognize a noun’s gender, too, and use that later in a sentence as well. But an input sequence can easily switch from one noun to another, because an input sequence can be composed of multiple phrases, sentences, or even documents. As new thoughts are expressed in later statements, the fact that the noun is plural may not be at all relevant to later unrelated text.

---

*A thinker sees his own actions as experiments and questions—as attempts to find out something. Success and failure are for him answers above all.*

---


In this quote, the verb “see” is conjugated to fit with the noun “thinker.” The next active verb you come across is “to be” in the second sentence. At that point “be” is conjugated into “are” to agree with “Success and failure.” If you were to conjugate it to match the first noun you came across, “thinker,” you would use the wrong verb form, “is” instead. 

So an LSTM must model not only long-term dependencies within a
sequence, but just as crucially, also forget long-term dependencies as new ones arise. This is what forgetting gates are for, making room for the relevant memories in your memory cells.

The network isn’t working with these kinds of explicit representations. Your network is trying to find a set of weights to multiply by the inputs from the sequence of tokens so that the memory cell and the output are both updated in a way that minimizes the error. It’s amazing that they work at all. And they work very well indeed. But enough marveling: back to forgetting.

<img src='https://github.com/rahiakela/img-repo/blob/master/lstm-network-6.png?raw=1' width='800'/>

The forget gate itself is just a feed forward network. It consists of n neurons each with m + n + 1 weights. So your example forget gate has 50 neurons each with 351 (300 + 50 + 1) weights. The activation function for a forget gate is the sigmoid function, because you want the output for each neuron in the gate to be between 0 and 1.

<img src='https://github.com/rahiakela/img-repo/blob/master/lstm-network-7.png?raw=1' width='800'/>

The output vector of the forget gate is then a mask of sorts, albeit a porous one, that erases elements of the memory vector. As the forget gate outputs values closer to 1, more of the memory’s knowledge in the associated element is retained for that time step; the closer it is to 0 the more of that memory value is erased.

Actively forgetting things, check. You better learn how to remember something new, or this is going to go south pretty quickly. Just like in the forget gate, you use a small network to learn how much to augment the memory based on two things: the
input so far and the output from the last time step. This is what happens in the next gate you branch into: **the candidate gate**.

The candidate gate has two separate neurons inside it that do two things:
* Decide which input vector elements are worth remembering (similar to the mask in the forget gate)
* Route the remembered input elements to the right memory “slot”

The first part of a candidate gate is a neuron with a sigmoid activation function whose goal is to learn which input values of the memory vector to update. This neuron closely resembles the mask in the forget gate.


<img src='https://github.com/rahiakela/img-repo/blob/master/lstm-network-8.png?raw=1' width='800'/>

The second part of this gate determines what values you’re going to update the memory with. This second part has a tanh activation function that forces the output value to range between -1 and 1. The output of these two vectors are multiplied
together elementwise. The resulting vector from this multiplication is then added, again elementwise, to the memory register, thus remembering the new details.

This gate is learning simultaneously which values to extract and the magnitude of those particular values. The mask and magnitude become what’s added to the memory state. As in the forget gate, the candidate gate is learning to mask off the inappropriate information before adding it to the cell’s memory.

So old, hopefully irrelevant things are forgotten, and new things are remembered. Then you arrive at the last gate of the cell: **the output gate**.

Up until this point in the journey through the cell, you’ve only written to the cell’s memory. Now it’s finally time to get some use out of this structure. The output gate takes the input (remember this is still the concatenation of the input at time step t and the output of the cell at time step t-1) and passes it into the output gate.

The concatenated input is passed into the weights of the n neurons, and then you apply a sigmoid activation function to output an n-dimensional vector of floats, just like the output of a SimpleRNN. But instead of handing that information out through the cell wall, you pause.

The memory structure you’ve built up is now primed, and it gets to weigh in on what you should output. This judgment is achieved by using the memory to create one last mask. This mask is a kind of gate as well, but you refrain from using that term because this mask doesn’t have any learned parameters, which differentiates it from the three previous gates described.


<img src='https://github.com/rahiakela/img-repo/blob/master/lstm-network-9.png?raw=1' width='800'/>

The mask created from the memory is the memory state with a tanh function applied elementwise, which gives an n-dimensional vector of floats between -1 and 1.

That mask vector is then multiplied elementwise with the raw vector computed in the output gate’s first step. The resulting n-dimensional vector is finally passed out of the cell as the cell’s official output at time step t.

Thereby the memory of the cell gets the last word on what’s important to output at time step t, given what was input at time step t and output at t-1, and all the details it has gleaned up to this point in the input sequence.







### Backpropagation through time

Backpropagation—as with any other neural net. For a moment, let’s step back and look at the problem you’re trying to solve with this new complexity. A vanilla RNN is susceptible to a vanishing gradient because the derivative at any given time step is a factor of the weights themselves, so as you step back in time coalescing the various deltas, after a few iterations, the weights may shrink the gradient away to 0. 

The update to the weights at the end of the backpropagation
(which would equate to the beginning of the sequence) are either
minuscule or effectively 0. A similar problem occurs when the weights are somewhat large: the gradient explodes and grows disproportionately to the network.

An LSTM avoids this problem via the memory state itself. The neurons in each of the gates are updated via derivatives of the functions they fed, namely those that update the memory state on the forward pass. 

So at any given time step, as the normal chain rule is applied backwards to the forward propagation, the updates to the neurons
are dependent on only the memory state at that time step and the previous one.

This way, the error of the entire function is kept “nearer” to the neurons for each time step. This is known as the error carousel.

So you can just swap out the Keras SimpleRNN layer for the Keras
LSTM layer, and all the other pieces of your classifier will stay the same.

### Traing and saving model

OK, now it’s time to actually train that recurrent network that we so carefully assembled
in the previous section. As with your other Keras models, you need to give the
.fit() method your data and tell it how long you want to run training (epochs).

In [0]:
# train the model
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_test, y_test))

You would like to save the model state after training.
Because you aren’t going to hold the model in memory for now, you can grab its
structure in a JSON file and save the trained weights in another file for later reinstantiation.

In [0]:
model_structure = model.to_json()   # Note that this doesn’t save the weights of the network, only the structure.

# Save your trained model before you lose it!
with open('lstm_model1.json', 'w') as json_file:
  json_file.write(model_structure)
model.save_weights('lstm_weights1.h5')

That is an enormous leap in the validation accuracy compared to the simple RNN. You can start to see how large a gain you can achieve by providing the model with a memory when the relationship of tokens is so important. The beauty of the algorithm is that it learns the relationships of the tokens it sees. The network is now able to model those relationships, specifically in the context of the cost function you provide.

Now your trained model will be persisted on disk; should it converge, you won’t have to train it again.

## Prediction

Let’s make up a sentence with an obvious negative sentiment and see what the network has to say about it.

In [0]:
# loading model
from tensorflow.keras.models import model_from_json

with open('lstm_model1.json', 'r') as json_file:
  json_string = json_file.read()
model = model_from_json(json_string)

model.load_weights('lstm_weights1.h5')

In [0]:
sample_1 = """
I'm hate that the dismal weather that had me down for so long, when will it break! Ugh, when does happiness return?  
The sun is blinding and the puffy clouds are too thin.  I can't wait for the weekend.
"""

With the model pretrained, testing a new sample is quick. The are still thousands and
thousands of calculations to do, but for each sample you only need one forward pass
and no backpropagation to get a result.

In [0]:
# You pass a dummy value in the first element of the tuple just because
# your helper expects it from the way you processed the initial data.
# That value won’t ever see the network, so it can be anything.
vec_list = tokenize_and_vectorize([(1, sample_1)])

# Tokenize returns a list of the data (length 1 here)
test_vec_list = pad_trunc(vec_list, maxlen)

test_vec = np.reshape(test_vec_list, (len(test_vec_list), maxlen, embedding_dims))
print("Sample's sentiment, 1 - pos, 2 - neg : {}".format(model.predict_classes(test_vec)))
print("Raw output of sigmoid function: {}".format(model.predict(test_vec)))

Going through this process of examining the probabilities and input data associated with incorrect predictions helps build your machine learning intuition so you can build better NLP pipelines in the future. This is backpropagation through the
human brain for the problem of model tuning.

## Dirty data

This more powerful model still has a great number of hyperparameters to toy with. But now is a good time to pause and look back to the beginning, to your data. You’ve been using the same data, processed in exactly the same way since you started with convolutional neural nets, specifically so you could see the variations in the types of models and their performance on a given dataset.

Padding or truncating each sample to 400 tokens was important for convolutional nets so that the filters could “scan” a vector with a consistent length. And convolutional nets output a consistent vector as well. It’s important for the output to be a consistent dimensionality, because the output goes into a fully connected feed forward layer at the end of the chain, which needs a fixed length vector as input.

Similarly, your implementations of recurrent neural nets, both simple and LSTM, are striving toward a fixed length thought vector you can pass into a feed forward layer for classification. A fixed length vector representation of an object, such as a thought vector, is also called an embedding. So that the thought vector is of consistent size, you have to unroll the net to a consistent number of time steps (tokens). 

Let’s look at the choice of 400 as the number of time steps to unroll the net.

### Optimize the thought vector size

In [0]:
num_neurons = 100

model1 = Sequential()
model1.add(SimpleRNN(num_neurons, return_sequences=True, input_shape=(maxlen, embedding_dims)))
model1.add(Dropout(0.2))

model1.add(Flatten())
model1.add(Dense(1, activation='sigmoid'))

model1.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model1.summary()

Train your larger network

In [0]:
model1.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_test, y_test))

The validation accuracy of 78.24% is only 0.04% better after we doubled the complexity of our model in one of the layers. This negligible improvement should lead you to think the model (for this network layer) is too complex for the data.

In [0]:
model_structure = model1.to_json()   # Note that this doesn’t save the weights of the network, only the structure.

# Save your trained model before you lose it!
with open('simple_rnn_model2.json', 'w') as json_file:
  json_file.write(model_structure)
model1.save_weights('simple_rnn_weights2.h5')

In [0]:
# loading model
from tensorflow.keras.models import model_from_json

with open('simple_rnn_model2.json', 'r') as json_file:
  json_string = json_file.read()
model = model_from_json(json_string)

model.load_weights('simple_rnn_weights2.h5')

In [0]:
sample_1 = """
I'm hate that the dismal weather that had me down for so long, when will it break! Ugh, when does happiness return?  
The sun is blinding and the puffy clouds are too thin.  I can't wait for the weekend.
"""

In [0]:
vec_list = tokenize_and_vectorize([(1, sample_1)])

# Tokenize returns a list of the data (length 1 here)
test_vec_list = pad_trunc(vec_list, maxlen)

test_vec = np.reshape(test_vec_list, (len(test_vec_list), maxlen, embedding_dims))
model.predict(test_vec)

In [0]:
model.predict_classes(test_vec)

If you feel the model is overfitting the training data but you can’t find a way to make
your model simpler, you can always try increasing the Dropout(percentage). This is
a sledgehammer (actually a shotgun) that can mitigate the risk of overfitting while
allowing your model to have as much complexity as it needs to match the data. If you
set the dropout percentage much above 50%, the model starts to have a difficult time
learning. Your learning will slow and validation error will bounce around a lot. But
20% to 50% is a pretty safe range for a lot of NLP problems for recurrent networks.

In [0]:
from tensorflow.keras.layers import Bidirectional

num_neurons = 10
maxlen = 100
embedding_dims = 300

model2 = Sequential()
model2.add(Bidirectional(SimpleRNN(num_neurons, return_sequences=True, input_shape=(maxlen, embedding_dims))))
model2.add(Dropout(0.2))

model2.add(Flatten())
model2.add(Dense(1, activation='sigmoid'))

model2.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

model2.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_test, y_test))

model2.summary()

In [0]:
model_structure = model2.to_json()   # Note that this doesn’t save the weights of the network, only the structure.

# Save your trained model before you lose it!
with open('simple_rnn_model3.json', 'w') as json_file:
  json_file.write(model_structure)
model2.save_weights('simple_rnn_weights3.h5')

In [0]:
# loading model
from tensorflow.keras.models import model_from_json

with open('simple_rnn_model3.json', 'r') as json_file:
  json_string = json_file.read()
model = model_from_json(json_string)

model.load_weights('simple_rnn_weights3.h5')

In [0]:
sample_1 = """
I'm hate that the dismal weather that had me down for so long, when will it break! Ugh, when does happiness return?  
The sun is blinding and the puffy clouds are too thin.  I can't wait for the weekend.
"""

In [0]:
vec_list = tokenize_and_vectorize([(1, sample_1)])

# Tokenize returns a list of the data (length 1 here)
test_vec_list = pad_trunc(vec_list, maxlen)

test_vec = np.reshape(test_vec_list, (len(test_vec_list), maxlen, embedding_dims))
model.predict(test_vec)

In [0]:
model.predict_classes(test_vec)

With these tools you’re well on your way to not just predicting and classifying text, but
actually modeling language itself and how it’s used. And with that deeper algorithmic
understanding, instead of just parroting text your model has seen before, you can
generate completely new statements!