<a href="https://colab.research.google.com/github/utd-hltri/nlp/blob/main/hw1/neural_language_modeling_glove.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Modeling with newswire text


This notebook constructs, trains, and evaluates a simple feed-forward neural language model.

We'll use the [Reuters newswire dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/reuters) that contains the text of 11,228 newswires from Reuters. These are split into 8,982 newswires for training and 2246 newswires for testing.

This notebook uses [tf.keras](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow. For a more advanced text classification tutorial using `tf.keras`, see the [MLCC Text Classification Guide](https://developers.google.com/machine-learning/guides/text-classification/).

In [None]:
import tensorflow as tf
from tensorflow import keras

import numpy as np

print(tf.__version__)

N = 5
vocab_size=10000

print('Creating {}-gram LM with vocab size={}'.format(N+1, vocab_size))

## Download the Reuters dataset

The Reuters dataset comes packaged with TensorFlow. It has already been preprocessed such that the newswires (sequences of words) have been converted to sequences of integers, where each integer represents a specific word in a dictionary.

The following code downloads the Reuters dataset to your machine (or uses a cached copy if you've already downloaded it):

In [None]:
from keras.datasets import reuters

(train_data, _), (test_data, _) = reuters.load_data(num_words=vocab_size, seed=1337, test_split=0.2)

The argument `num_words=vocab_size` keeps the top `vocab_size=10,000` most frequently occurring words in the training data. The rare words are discarded to keep the size of the data manageable. Increasing this limit will result in a larger model that, while more accurate, will take longer to train and could result in overfitting.



## Explore the data 

Let's take a moment to understand the format of the data. The dataset comes preprocessed: each example is an array of integers representing the words of the newswire.

In [None]:
print("Training entries: {}, Testing entries: {}".format(len(train_data), len(test_data)))

The text of reviews have been converted to integers, where each integer represents a specific word in a dictionary. Here's what the first review looks like:

In [None]:
print(train_data[0])

Newswires may be different lengths. The below code shows the number of words in the first and second newswires. Since inputs to a neural network must be the same length, we'll need to resolve this later.

In [None]:
len(train_data[0]), len(train_data[1])

### Convert the integers back to words

It may be useful to know how to convert integers back to text. Here, we'll create a helper function to query a dictionary object that contains the integer to string mapping:

In [None]:
# A dictionary mapping words to an integer index
word_index = reuters.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_newswire(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

Now we can use the `decode_review` function to display the text for the first review:

In [None]:
decode_newswire(train_data[0])

## Prepare the data

The newswires—the variable-length arrays of integers—must be converted to n-gram tensors before being fed into the neural network.

The job of a language model is to predict the next word in a sequence given the previous N words. There fore, the training data for our neural language model will be `(N-gram, label)` tuples where the *label* is the word following the N-gram in some newswire in our corpus.

To create this training data, we extract each N-gram from each newswire and save the following word as the label for that N-gram.

In [None]:
def convert_sequences_to_ngrams_with_labels(sequences):
  ngrams = []
  labels =[]
  for seq in sequences:
    idx = 0
    while idx+N < len(seq)-1:
      ngrams.append(seq[idx:idx+N])
      labels.append(seq[idx+N])
      idx += 1
  return np.asarray(ngrams), np.asarray(labels)

train_data, train_labels = convert_sequences_to_ngrams_with_labels(train_data)
test_data, test_labels = convert_sequences_to_ngrams_with_labels(test_data)

Let's look at some of the examples now:

In [None]:
print("{} -> {}".format(decode_newswire(train_data[0]), reverse_word_index[train_labels[0]]))
print("{} -> {}".format(decode_newswire(train_data[1]), reverse_word_index[train_labels[1]]))
print("{} -> {}".format(decode_newswire(train_data[2]), reverse_word_index[train_labels[2]]))
print("{} -> {}".format(decode_newswire(train_data[3]), reverse_word_index[train_labels[3]]))
print("{} -> {}".format(decode_newswire(train_data[4]), reverse_word_index[train_labels[4]]))

### Download the Word Embeddings


In [None]:
import os

if not os.path.isfile('glove.6B.50d.txt'):
  !wget https://nlp.stanford.edu/data/glove.6B.zip
  !unzip glove.6B.zip

Unpack the word emebddings into a dict of word -> embedding vector.

In [None]:
word2vec = {}
with open('glove.6B.50d.txt') as f:
  for line in f:
    fields = line.split()
    word2vec[fields[0]] = np.asarray(fields[1:])
print('Loaded %d GloVe embeddings' % len(word2vec))

Finally, create the embedding matrix using only those embeddings which represent one of the 10,000 words in our dataset.

In [None]:
embedding_matrix = np.zeros((vocab_size, 50), dtype=float)
for word, i in word_index.items():
  if i < vocab_size:
    if word in word2vec:
      embedding_matrix[i] = word2vec[word]
    else:
      embedding_matrix[i] = word_index['<UNK>']

## Build the model

The neural network is created by stacking layers—this requires two main architectural decisions:

* How many layers to use in the model?
* How many *hidden units* to use for each layer?

In this example, the input data consists of an array of word-indices. The labels to predict are single word-indices. Let's build a model for this problem:

In [None]:
# input shape is the vocabulary count used for the newswire (10,000 words)
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 50, weights=[embedding_matrix]))
model.add(keras.layers.Reshape([50*N]))
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(vocab_size, activation=tf.nn.softmax))

model.summary()

The layers are stacked sequentially to build the classifier:

1. The first layer is an `Embedding` layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: `(batch, N, embedding)`.
2. Next, a `Reshape` layer flattens each (N, 50) dimension N-gram matrix into a (N\*50) dimension vector. This allows the model to process the entire N-gram in the same fully-connected layer, preserving sequential information.
3. This fixed-length output vector is piped through a fully-connected (`Dense`) layer with 16 hidden units.
4. The last layer is densely connected with a 10000 output nodes -- one for each word in the dictionary. Using the `softmax` activation function, this produces a probability distribution over the *next* word following the input N-gram.

### Hidden units

The above model has two intermediate or "hidden" layers, between the input and output. The number of outputs (units, nodes, or neurons) is the dimension of the representational space for the layer. In other words, the amount of freedom the network is allowed when learning an internal representation.

If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computationally expensive and may lead to learning unwanted patterns—patterns that improve performance on training data but not on the test data. This is called *overfitting*.

### Loss function and optimizer

A model needs a loss function and an optimizer for training. Since this is a categorical classification problem and the model outputs a probability distribution(a vector of values in [0,1] that sum to 1), we'll use the categorical cross-entropy loss function. Specifically, we use the `sparse_categorical_cross_entropy` loss function to avoid instantiating each 10,000-dimension one-hot label vector.

This isn't the only choice for a loss function, you could, for instance, choose `mean_squared_error`. But, generally, cross-entropy is better for dealing with probabilities—it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions.

We can evaluate our model using [Perplexity](https://en.wikipedia.org/wiki/Perplexity) and top-5 accuracy. Perplexity measures how likely our model thought the actual next word was. Lower values are better.
Top-5 accuracy is the percentage of N-grams for which the next word is among the top-5 most likely according to the model.

Now, configure the model to use an optimizer and a loss function:

In [None]:
import keras.backend as K

def perplexity(y_true, y_pred):
  cross_entropy = K.sparse_categorical_crossentropy(y_true, y_pred)
  perplexity = K.pow(2.0, cross_entropy)
  return perplexity
  

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=[perplexity, 'sparse_top_k_categorical_accuracy'])

## Create a validation set

When training, we want to check the quality of the model on data it hasn't seen before. Create a *validation set* by setting apart 10,000 examples from the original training data. (Why not use the testing set now? Our goal is to develop and tune our model using only the training data, then use the test data just once to evaluate our model).

In [None]:
x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

## Train the model

Train the model for 40 epochs in mini-batches of 512 samples. This is 40 iterations over all samples in the `x_train` and `y_train` tensors. While training, monitor the model's loss and perplexity on the 10,000 samples from the validation set:

In [None]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

## Evaluate the model

And let's see how the model performs. Three values will be returned. Loss (a number which represents our error, lower values are better), accuracy@5, and preplexity.

In [None]:
results = model.evaluate(test_data, test_labels)

print('\n'.join('%s: %.4f'%t for t in zip(['loss', 'perplexity', 'acc@5'], results)))

This fairly naive approach achieves an accuracy@5 of about 39%. With more advanced approaches, the model should get much higher.


## Create a graph of accuracy and loss over time

`model.fit()` returns a `History` object that contains a dictionary with everything that happened during training:

In [None]:
history_dict = history.history
history_dict.keys()

There are six entries: one for each monitored metric during training and validation. We can use these to plot the training and validation loss for comparison, as well as the training and validation accuracy:

In [None]:
import matplotlib.pyplot as plt

perp = history_dict['perplexity']
val_perp = history_dict['val_perplexity']
top5acc = history_dict['sparse_top_k_categorical_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']
val_top5acc = history_dict['val_sparse_top_k_categorical_accuracy']

epochs = range(1, len(perp) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [None]:
plt.clf()   # clear figure

plt.plot(epochs, perp, 'bo', label='Training perplexity')
plt.plot(epochs, val_perp, 'b', label='Validation perplexity')
plt.title('Training and validation perpelexity')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

In [None]:
plt.clf()   # clear figure

plt.plot(epochs, top5acc, 'bo', label='Training top5acc')
plt.plot(epochs, val_top5acc, 'b', label='Validation top5acc')
plt.title('Training and validation top5acc')
plt.xlabel('Epochs')
plt.ylabel('top5acc')
plt.legend()

plt.show()


In these plots, the dots represent the training loss and metrics, and the solid lines are the validation loss and metrics.

Notice the training loss *decreases* with each epoch and the training accuracy *increases* with each epoch. This is expected when using a gradient descent optimization—it should minimize the desired quantity on every iteration.

This isn't the case for the validation loss and accuracy—they seem to peak after about twenty and seven epochs, respectively. This is an example of overfitting: the model performs better on the training data than it does on data it has never seen before. After this point, the model over-optimizes and learns representations *specific* to the training data that do not *generalize* to test data.

For this particular case, we could prevent overfitting by simply stopping the training after twenty or so epochs.

Interestingly, while the loss does decrease initially for the validation set, the perplexity continuously increases. Can you think of a reason why this might be?