# Material
* Primarily from
    * Hands-On Machine-Learning with Scikit-Learn, Keras, and Tensorflow.  Aurelion Geron.
    * Book 2nd Edition, Code 3rd Edition
    * https://github.com/ageron/handson-ml3.git
* Some examples from Andrew Ng and Lawrence Moroney via Coursera/Deeplearning.

# Session's Content
* Very brief Introduction to RNNs
* RNN for time series (univariate) forecasting one step ahead
* RNN for time series (univariate) forecasting multiple steps ahead

# Common imports and directory setup

Make sure proper version of python

In [1]:
import sys

assert sys.version_info >= (3, 7)

Import tensorflow and make sure proper version

In [2]:
from packaging import version
import tensorflow as tf
from tensorflow import keras
import numpy as np

assert version.parse(tf.__version__) >= version.parse("2.8.0")

2022-11-27 12:43:54.303281: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-27 12:43:54.460445: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-11-27 12:43:54.460468: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-11-27 12:43:54.496033: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-27 12:43:55.199605: W tensorflow/stream_executor/platform/de

Import and configure matplotlib

In [3]:
import matplotlib.pyplot as plt

plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

Setup directories to save plots

In [4]:
from pathlib import Path

IMAGES_PATH = Path() / "images" / "nlp"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

Issue warnings if no GPU available, training RNNs is very computationally expensive.

In [5]:
if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected. Neural nets can be very slow without a GPU.")
    if "google.colab" in sys.modules:
        print("Go to Runtime > Change runtime and select a GPU hardware "
              "accelerator.")
    if "kaggle_secrets" in sys.modules:
        print("Go to Settings > Accelerator and select GPU.")

No GPU was detected. Neural nets can be very slow without a GPU.


2022-11-27 12:43:56.721728: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-11-27 12:43:56.721773: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2022-11-27 12:43:56.721798: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (user-virtual-machine): /proc/driver/nvidia/version does not exist


# Read Dataset

Could use actual ("real") data but creating the dataset allows us to experiment since we know the generating process.  We generate a univariate time series composed of two signals with added noise.  Though we use this throughout, we could experiment by adding more signals of different amplitude/frequency or different types of noise and determine how this affects each models performance.

In [6]:
shakespeare_url = "https://homl.info/shakespeare"  # shortcut URL
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

In [7]:
# extra code – shows a short text sample
print(shakespeare_text[:80])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


In [8]:
# extra code – shows all 39 distinct characters (after converting to lower case)
"".join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

In [9]:
#vocab = [x for x in " abcdefghijklmnopqrstuvwxyz0123456789?:;#,-"]
#text_vec_layer = tf.keras.layers.TextVectorization(split="whitespace", standardize="lower")
text_vec_layer = tf.keras.layers.TextVectorization(split="character", standardize="lower")
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]

2022-11-27 12:43:56.789942: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [10]:
encoded -= 2  # drop tokens 0 (pad) and 1 (unknown), which we will not use
n_tokens = text_vec_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)  # total number of chars = 1,115,394

In [11]:
n_tokens

39

In [12]:
dataset_size

1115394

### Utility function to convert a long sequence of character IDs into a dataset of input/target window pairs
* It takes a sequence as input (i.e., the encoded text), and creates a dataset containing all the windows of the desired length.

* It increases the length by one, since we need the next character for the target.

* Then, it shuffles the windows (optionally), batches them, splits them into input/output pairs, and activates prefetching.

## Data Pre-processing
<img src="images/dse/ch16_fig16_1_data_prep.png"
     width="50%"
     alt="Shakespeare Data Preprocessing"
     style="float: left; margin-right: 10px;" />

In [13]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

In [14]:
text_vec_layer(["to be or not to"])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=array([[ 4,  5,  2, 23,  3,  2,  5, 10,  2, 11,  5,  4,  2,  4,  5]])>

In [15]:
# extra code – a simple example using to_dataset()
# There's just one sample in this dataset: the input represents "to b" and the
# output represents "o be"
list(to_dataset(text_vec_layer(["To be"])[0], length=4))

[(<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 4,  5,  2, 23]])>,
  <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 5,  2, 23,  3]])>)]

In [16]:
length = 100
tf.random.set_seed(42)
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True,
                       seed=42)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

In [17]:
"""
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])
model.summary()

model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    "my_shakespeare_model", monitor="val_accuracy", save_best_only=True)
history = model.fit(train_set, validation_data=valid_set, epochs=10,
                    callbacks=[model_ckpt])
"""

'\ntf.random.set_seed(42)  # extra code – ensures reproducibility on CPU\nmodel = tf.keras.Sequential([\n    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),\n    tf.keras.layers.GRU(128, return_sequences=True),\n    tf.keras.layers.Dense(n_tokens, activation="softmax")\n])\nmodel.summary()\n\nmodel.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",\n              metrics=["accuracy"])\nmodel_ckpt = tf.keras.callbacks.ModelCheckpoint(\n    "my_shakespeare_model", monitor="val_accuracy", save_best_only=True)\nhistory = model.fit(train_set, validation_data=valid_set, epochs=10,\n                    callbacks=[model_ckpt])\n'

In [18]:
#shakespeare_model = tf.keras.Sequential([
#    text_vec_layer,
#    tf.keras.layers.Lambda(lambda X: X - 2),  # no  or  tokens
#    model
#])

If you don't want to wait for training to complete, I've pretrained a model for you. The following code will download it. Uncomment the last line if you want to use it instead of the model trained above.

In [19]:
# extra code – downloads a pretrained model
url = "https://github.com/ageron/data/raw/main/shakespeare_model.tgz"
path = tf.keras.utils.get_file("shakespeare_model.tgz", url, extract=True)
model_path = Path(path).with_name("shakespeare_model")
shakespeare_model = tf.keras.models.load_model(model_path)

In [20]:
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]
y_pred = tf.argmax(y_proba)  # choose the most probable character ID
text_vec_layer.get_vocabulary()[y_pred + 2]



'e'

# Generating Fake Shakespearean Text

In [21]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.set_seed(42)
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

<tf.Tensor: shape=(1, 8), dtype=int64, numpy=array([[0, 1, 0, 2, 1, 0, 0, 1]])>

In [22]:
def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

In [23]:
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [24]:
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU

In [25]:
print(extend_text("To be or not to be", n_chars=100, temperature=0.01))

To be or not to be the duke
as it is a proper strange death,
and then the sea to the death, and the duke and the death


In [None]:
print(extend_text("To be or not to be", temperature=1))

In [None]:
print(extend_text("To be or not to be", temperature=100))

# Sentiment Analysis

Generating text can be fun and instructive, but in real-life projects, one of the most common applications of NLP is text classification - especially sentiment analysis. The IMDb dataset consists of 50,000 movie reviews in English (25,000 for training, 25,000 for testing) extracted from the famous Internet Movie Database, along with a simple binary target for each review indicating whether it is negative (0) or positive (1).

Let’s load the IMDb dataset using the TensorFlow Datasets library. We’ll use the first 90% of the training set for training, and the remaining 10% for validation:

In [26]:
import tensorflow_datasets as tfds

raw_train_set, raw_valid_set, raw_test_set = tfds.load(
    name="imdb_reviews",
    split=["train[:90%]", "train[90%:]", "test"],
    as_supervised=True
)
tf.random.set_seed(42)
train_set = raw_train_set.shuffle(5000, seed=42).batch(32).prefetch(1)
valid_set = raw_valid_set.batch(32).prefetch(1)
test_set = raw_test_set.batch(32).prefetch(1)

2022-11-27 12:57:06.331879: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".


Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteMFBOZN/imdb_reviews-train.tfrecord*...…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteMFBOZN/imdb_reviews-test.tfrecord*...:…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteMFBOZN/imdb_reviews-unsupervised.tfrec…

Dataset imdb_reviews downloaded and prepared to ~/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


View a few of the moview reviews (can you predict the sentiment?).

In [27]:
for review, label in raw_train_set.take(4):
    print(review.numpy().decode("utf-8")[:200], "...")
    print("Label:", label.numpy())

This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0
I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0
Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Moun ...
Label: 0
This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful perf ...
Label: 1


2022-11-27 13:00:00.649860: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


Spaces for token boundaries should be good enough. So let’s go ahead with creating a TextVectorization layer and adapting it to the training set. We will limit the vocabulary to 1,000 tokens, including the most frequent 998 words plus a padding token and a token for unknown words, since it’s unlikely that very rare words will be important for this task, and limiting the vocabulary size will reduce the number of parameters the model needs to learn.

In [37]:
vocab_size = 1000
text_vec_layer = tf.keras.layers.TextVectorization(max_tokens=vocab_size)
text_vec_layer.adapt(train_set.map(lambda reviews, labels: reviews))

Now train the model.

In [38]:
embed_size = 128
tf.random.set_seed(42)
model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Embedding(vocab_size, embed_size),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=2)


Epoch 1/2
Epoch 2/2


In [39]:
model.save("imdb_sentiment_modelA", save_format="json")
model.summary()
#model = tf.keras.models.load_model("imdb_sentiment_modelA")

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_2 (TextV  (None, None)             0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, None, 128)         128000    
                                                                 
 gru_1 (GRU)                 (None, 128)               99072     
                                                                 
 dense_3 (Dense)             (None, 1)                 129       
                                                                 
Total params: 227,201
Trainable params: 227,201
Non-trainable params: 0
_________________________________________________________________


Lets use the model to predict sentiment for some of the test set instances.  We can do better if we use masking to handle padding to ensure all inputs are the same length.

In [40]:
for review, label in raw_test_set.take(10):
    instance = review.numpy().decode("utf-8")[:200]
    print(instance, "...")
    y_pred = model.predict( [instance] )
    print("Label:", label.numpy())
    print("Predict:", y_pred)

There are films that make careers. For George Romero, it was NIGHT OF THE LIVING DEAD; for Kevin Smith, CLERKS; for Robert Rodriguez, EL MARIACHI. Add to that list Onur Tukel's absolutely amazing DING ...
Label: 1
Predict: [[0.5198082]]
A blackly comic tale of a down-trodden priest, Nazarin showcases the economy that Luis Bunuel was able to achieve in being able to tell a deeply humanist fable with a minimum of fuss. As an output fro ...
Label: 1
Predict: [[0.5354872]]
Scary Movie 1-4, Epic Movie, Date Movie, Meet the Spartans, Not another Teen Movie and Another Gay Movie. Making "Superhero Movie" the eleventh in a series that single handily ruined the parody genre. ...
Label: 0
Predict: [[0.49090126]]
Poor Shirley MacLaine tries hard to lend some gravitas to this mawkish, gag-inducing "feel-good" movie, but she's trampled by the run-away sentimentality of a film that's not the least bit grounded in ...
Label: 0
Predict: [[0.530309]]
As a former Erasmus student I enjoyed this film very

2022-11-27 14:58:07.057112: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


## Pre-trained Word Embeddings vs Pre-trained Models

Imagine how good the embeddings would be if we had billions of reviews - perhaps we can reuse word embeddings trained on some other (very) large text corpus, even if it is not composed of movie reviews? After all, the word “amazing” generally has the same meaning whether you use it to talk about movies or anything else. Moreover, perhaps embeddings would be useful for sentiment analysis even if they were trained on another task: since words like “awesome” and “amazing” have a similar meaning, they will likely cluster in the embedding space even for tasks such as predicting the next word in a sentence. If all positive words and all negative words form clusters, then this will be helpful for sentiment analysis. So, instead of training word embeddings, we could just download and use pretrained embeddings, such as Google’s Word2vec embeddings, Stanford’s GloVe embeddings, or Facebook’s FastText embeddings.

Using pretrained word embeddings was popular for several years, but this approach has its limits. In particular, a word has a single representation, no matter the context. For example, the word “right” is encoded the same way in “left and right” and “right and wrong”, even though it means two very different things. To address this limitation, a 2018 paper⁠10 by Matthew Peters introduced Embeddings from Language Models (ELMo): these are contextualized word embeddings learned from the internal states of a deep bidirectional language model. Instead of just using pretrained embeddings in your model, you reuse part of a pretrained language model.

For example, let’s build a classifier based on the Universal Sentence Encoder, a model architecture introduced in a 2018 paper⁠12 by a team of Google researchers. This model is based on the transformer architecture, which we will look at later in this chapter. Conveniently, the model is available on TensorFlow Hub.

This model is quite large—close to 1 GB in size, so it may take a while to download. By default, TensorFlow Hub modules are saved to a temporary directory, and they get downloaded again and again every time you run your program. To avoid that, you must set the TFHUB_CACHE_DIR environment variable to a directory of your choice: the modules will then be saved there, and only downloaded once.

In [36]:
import os
import tensorflow_hub as hub

os.environ["TFHUB_CACHE_DIR"] = "my_tfhub_cache"
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU
model = tf.keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                   trainable=True, dtype=tf.string, input_shape=[]),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit(train_set, validation_data=valid_set, epochs=10)





Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
 82/704 [==>...........................] - ETA: 17:56 - loss: 0.0026 - accuracy: 0.9996

KeyboardInterrupt: 

Using a pre-trained model and further training it on our IMDB dataset produces impressive results!

In [None]:
for review, label in raw_test_set.take(10):
    instance = review.numpy().decode("utf-8")[:200]
    print(instance, "...")
    y_pred = model.predict( [instance] )
    print("Label:", label.numpy())
    print("Predict:", y_pred)

# Transformers

* "Attention is all you need"
    * Vectors are dense, provide contextual information
    * No RNNs!!!  Transformers much faster to train than RNNs
* Sentence Embeddings
    * Take word embeddings and average ("mean pool")
* Large amounts of text and computation required
    * Reuse weights!  Hugging Face

In [42]:
!pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer, util
sentences = ["I'm sad", "I'm full of happiness"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

#Compute embedding for both lists
embedding_1= model.encode(sentences[0], convert_to_tensor=True)
embedding_2 = model.encode(sentences[1], convert_to_tensor=True)

util.pytorch_cos_sim(embedding_1, embedding_2)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
You should consider upgrading via the '/home/user/handson-ml3/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

tensor([[0.3571]])