# CS-6570 Lecture 26 - NLP with Keras
**Dylan Zwick**

*Weber State University*

Today's lecture will build off our previous one. We'll explore in greater depth techniques for representing and building models from sets of words, and then talk briefly about building more powerful models from sequences of words. We'll also see how we can use Keras and Tensorflow in the construction of these models.

How a machine learning model should represent *individual words* is relatively straightforward: they're categorical features. The more difficult question is how to encode the way words are woven into sentences. In other words, *word order*.

The problem of order in natural language is an interesting one: unlike the steps of a timeseries, words in a sentence don't have a natural, canonical order. Different languages order words in very different ways, or even within the same language there can be very different orderings of the same words with essentially the same meaning. Order is clearly important, but its relationship to meaning isn't straightforward.

Historically, most early applications of machine learning to NLP just involved models that didn't take word order into account. These are known as bag-of-words models. Interest in models that incorporate word order - sequence models - only started rising in 2015 with the rebirth of recurrent neural networks (RNNs) and then in 2017 with the advent of transformers - transformers are now the only game in town.

But first, let's import our favorite libraries:

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras

We'll also want the following text vectorization method:

In [3]:
from keras.layers import TextVectorization

**Bag-of-Words**

We'll start with some bag-of-words models. In particular, we'll explore some bag-of-words models using the IMDB review dataset you may recognize from Assignment 6. The modeling goal here is to classify reviews as either positive or negative. The reviews are very obviously one or the other.

This dataset is available from Keras, and can be downloaded from Keras as in Assignment 6. However, we'll want a more raw form of the data for the purposes of this lecture, and so we'll download the data directly from the source. The code below should download it for you, but you should only need to run it once. So, it's commented out on this notebook, but if it's your first time through, you'll want to remove the comment lines at the beginning and end and then run the code below.

In [4]:
"""
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

import os
import pathlib
import shutil
import random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)
""";

Now we'll read in the data to create a training dataset, a validation dataset, and a test dataset. We'll use the Keras utility *text_dataset_from_directory* to do so. Don't worry about the specifics of this utility - that's not important for our discussion. Just know we're creating a training, validation, and test dataset.

In [5]:
batch_size = 32
train_ds = keras.utils.text_dataset_from_directory("aclImdb/train", batch_size=batch_size)
val_ds = keras.utils.text_dataset_from_directory("aclImdb/val", batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory("aclImdb/test", batch_size=batch_size)

NotFoundError: Could not find directory aclImdb/train

We can check out what our training data looks like:

In [16]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'I have never seen a movie as bad as this. It is meant to be a "fun" movie, but the only joke is at the start, and it is NOT funny. If you like this sort of movie, then you may just be able to give it a vote of 2. If it had the necessary votes, it would truly belong on the bottom 100.<br /><br />', shape=(), dtype=string)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


We see there are 32 batches in both the inputs and targets. The inputs are strings, and the targets are numbers (the numbers 0 and 1, to be specific).

With a bag-of-words model you can represent an entire text as a single vector. Each entry in the vector indicates the presence of a given word within the text. This provides a single vector with $0$s almost everywhere and some $1$s for those words that are present.

First, we'll prepare our dataset so it only yields raw text inputs (no labels).

In [20]:
text_only_train_ds = train_ds.map(lambda x, y: x)

Next, we'll limit the vocabulary to only the 20,000 most common words. Then we'll use the *adapt* method to index the vocabulary. Then, we'll create our processed training, validation, and test datasets.

In [22]:
text_vectorization = TextVectorization(max_tokens = 20000, output_mode="multi_hot")
text_vectorization.adapt(text_only_train_ds) # Parses all the text in the dataset and builds the 20,000 words and the mapping we'll apply to it

binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

We can now take a look at what our transformed data looks like:

In [24]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


Each review is now a 20,000-dimension vector of $1$s and $0$s. Mostly $0$s.

Now, we'll create a simple, single-layer dense neural network for making our predictions.

In [27]:
def get_model(max_tokens = 20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = keras.layers.Dense(hidden_dim, activation="relu")(inputs)
    outputs = keras.layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
    return model

Let's run our model on our data, and see how it does.

In [29]:
model = get_model()
model.summary()

model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(), #We call cache() to put the data in memory, so we only need to do the preprossing once, during the first epoch.
          epochs=10)
#model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.859


Not bad!

Of course, discarding word order is very reductive, because even atomic concepts can be expressed via multiple words. For example, the term "United States" conveys a concept that is quite distinct from the meaning of the words "states" or "united" taken separately. For this reason, we can probably do a little better if we have some local order information. We do this using "N-grams", which are sequential combinations of $N$ words. Frequently $N = 2$, and these are called "bigrams".

We can use N-grams just as easily as we used individual words. We just add it as an argument to the *TextVectorization* call.

In [6]:
text_vectorization = TextVectorization(ngrams=2, max_tokens=20000, output_mode="multi_hot")
text_vectorization.adapt(text_only_train_ds)

binary_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
binary_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
binary_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

NameError: name 'text_only_train_ds' is not defined

Do bigrams do better than individual words? Let's find out:

In [34]:
model = get_model()
model.summary()

model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10)
#model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.878


A bit better than the test accuracy we saw for single words. I'll take it!

You can also add a bit more information to a bag-of-words representation by counting how many times each word or N-gram occurs. You can do this with "output_mode=count".

In [37]:
text_vectorization = TextVectorization(ngrams = 2, max_tokens=20000, output_mode="count")
text_vectorization.adapt(text_only_train_ds)

count_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
count_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
count_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [38]:
model = get_model()
model.summary()

model.fit(count_2gram_train_ds.cache(),
          validation_data=count_2gram_val_ds.cache(),
          epochs=10)
#model = keras.models.load_model("count_2gram.keras")
print(f"Test acc: {model.evaluate(count_2gram_test_ds)[1]:.3f}")

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.879


Not much improvement. However, for larger corpuses of text this wouldn't likely be the case.

Now, of course, some words are bound to occur more often than other no matter what the text is about. Words like "the", "and", "is", "it", and the like will almost always dominate your wordcount histograms. However, they're pretty useless features in a classification context. How can this be addressed?

Well, we can normalize the frequency using the tf-idf measure we introduced in the last lecture. In fact, it's one of the options for "output_mode".

In [40]:
text_vectorization = TextVectorization(ngrams = 2, max_tokens = 20000, output_mode = "tf_idf")
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
tfidf_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
tfidf_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [44]:
model = get_model()
model.summary()

model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10)
#model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_8 (Dense)             (None, 16)                320016    
                                                                 
 dense_9 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320033 (1.22 MB)
Trainable params: 320033 (1.22 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.873


Again, not a huge improvement here, but we'd probably see some gain on larger corpuses of text.

**Sequential Models**

As you might guess, there's more to be set about how we can take advantage of sequences of words. In fact, there's *much* more to be said. Too much to do justice to in this lecture. So, I'll just mention a few things, and encourage you to take an NLP or deep learning class, where they'll probably get into these in more depth.

* Recurrent neural networks (RNNs) are the type of neural network that you use for time series. Or, more broadly, for sequential data. As you might imagine, they can be useful for NLP. In fact, around 2016-2017 bidirectional RNNs (in particular, bidirectional LSTMs) were considered state of the art.

* In 2017 one of the most influential papers in the history of machine learning, titled "Attention Is All You Need", was published, and it introduced "transformer" models. One of the great applications of transformer models was NLP, and today NLP is almost universally done with transformers. It's a pretty cool subject.

