# Language Modelling (Natural Language Processing)

Natural Language Processing (NLP) is a vast subject with many different specializations. Here we are going to discuss a topic that gave rise to ground breaking models like BERT that changed the NLP landscape dramatically; language modelling. Language modelling is an unsupervised training method, where you ask a model to predict the next character/word/sentence given the previous characters/words/sentences.


<table align="left">
    <td>
        <a target="_blank" href="https://colab.research.google.com/github/thushv89/manning_tf2_in_action/blob/master/Ch10-NLP-with-TF2-Language-Modelling/10.1.Language_modelling.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
    </td>
</table>



In [1]:
import tensorflow as tf
import requests
import zipfile
import requests
import os
import time
import pandas as pd
import random
import shutil
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import os
import tensorflow.keras.layers as layers
import tensorflow.keras.models as models
from tensorflow.keras.losses import CategoricalCrossentropy
import tensorflow.keras.backend as K
from tensorflow.keras.callbacks import EarlyStopping, CSVLogger
import numpy as np
from PIL import Image
import pickle
from tensorflow.keras.models import load_model, Model
from PIL import Image
from PIL.PngImagePlugin import PngImageFile
import matplotlib.pyplot as plt
import glob
from functools import partial
import nltk

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except:
        print("Couldn't set memory_growth")
        pass
    
    
def fix_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")

# Fixing the random seed
random_seed=4321
fix_random_seed(random_seed)

print("TensorFlow version: {}".format(tf.__version__))

TensorFlow version: 2.9.2


## Download the bAbI's children story dataset

For this task, we'll be using a popular children story dataset from the [bAbI project](https://research.fb.com/downloads/babi/) of Facebook.

<a id="pgfId-1151013" href=""></a><span class="fm-combinumeral">#1</span> If the tgz file containing data has not been downloaded, download the data.<br>
<a id="pgfId-1151034" href=""></a><span class="fm-combinumeral">#2</span> Write the downloaded data to the disk.<br>
<a class="calibre7" id="pgfId-1151051" href=""></a><span class="fm-combinumeral">#3</span> If the tgz file is available but has not been extracted, extract it to the given directory.<br>

In [5]:
# Section 10.1

# Code listing 10.1

# Downloading the data
# http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz

import os
import requests
import tarfile

import shutil

# Retrieve the data
if not os.path.exists(os.path.join('data', 'lm','CBTest.tgz')):
    url = "http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz"
    # Get the file from web
    r = requests.get(url)

    if not os.path.exists(os.path.join('data','lm')):
        os.makedirs(os.path.join('data','lm'))
    
    # Write to a file
    with open(os.path.join('data', 'lm', 'CBTest.tgz'), 'wb') as f:
        f.write(r.content)
          
else:
    print("The tar file already exists.")
    
if not os.path.exists(os.path.join('data', 'lm', 'CBTest')):
    # Write to a file
    tarf = tarfile.open(os.path.join("data","lm","CBTest.tgz"))
    tarf.extractall(os.path.join("data","lm"))  
else:
    print("The extracted data already exists")





## Read the data

After downloading the data, let's read that into memory. There are three sets; training validation and test sets.

In [3]:
# Code listing 10.2

def read_data(path):
    stories = []

    with open(path, 'r') as f:    
        s = [] 
        for row in f:
            
            if row.startswith("_BOOK_TITLE_"):
                if len(s)>0:
                    stories.append(' '.join(s).lower())            
                s = []           

            s.append(row)
            
    if len(s)>0:
        stories.append(' '.join(s).lower())  
    
    return stories

stories = read_data(os.path.join('data','lm','CBTest','data','cbt_train.txt'))
val_stories = read_data(os.path.join('data','lm','CBTest','data','cbt_valid.txt'))
test_stories = read_data(os.path.join('data','lm','CBTest','data','cbt_test.txt'))

In [4]:
print("Collected {} stories (train)".format(len(stories)))
print("Collected {} stories (valid)".format(len(val_stories)))
print("Collected {} stories (test)".format(len(test_stories)))
print(stories[0][:100])
print('\n', stories[10][:100])


Collected 98 stories (train)
Collected 5 stories (valid)
Collected 5 stories (test)
_book_title_ : andrew_lang___prince_prigio.txt.out
 chapter i. -lcb- chapter heading picture : p1.jp

 _book_title_ : andrew_lang___the_violet_fairy_book.txt.out
 a tale of the tontlawald long , long ago


In [5]:
for i in range(10):
    print(len(stories[i]))


99464
99832
136758
761257
524783
522998
528840
531058
527598
674648


## Quick drive to vocabulary-ville

In [6]:
# Section 10.1

from collections import Counter
# Create a large list which contains all the words in all the reviews
data_list = [w for doc in stories for w in doc.split(' ')]

# Create a Counter object from that list
# Counter returns a dictionary, where key is a word and the value is the frequency
cnt = Counter(data_list)

# Convert the result to a pd.Series 
freq_df = pd.Series(list(cnt.values()), index=list(cnt.keys())).sort_values(ascending=False)

# Print most common words
print(freq_df.head(n=10))

# Count of words >= n frequent
n=10
print("\nVocabulary size (>={} frequent): {}".format(n, (freq_df>=n).sum()))

,      348650
the    242890
.\n    192549
and    179205
to     120821
a      101990
of      96748
i       79780
he      78129
was     66593
dtype: int64

Vocabulary size (>=10 frequent): 14473


## Convert strings to n-grams

For our language modelling task, we're going to split strings into bigrams. That is, given the string

`i went to the office`, it is converted to,

`["i ", "we", "nt", " t", "o ", "th", "e ", "of", "fi", "ce"]`

We will also look at what are the most common bigrams and some summary statistics

In [7]:
from itertools import chain
from collections import Counter

def get_ngrams(text, n):
    """ This function takes a given string and split it into desired sized n-grams """
    return [text[i:i+n] for i in range(0,len(text),n)]

# Test the ngrams function with a variety of ngrams
test_string = "I like chocolates"
print("Original: {}".format(test_string))
for i in list(range(3)):
    print("\t{}-grams: {}".format(i+1, get_ngrams(test_string, i+1)))

# Create a counter with the bi-grams
ngrams = 2

text = chain(*[get_ngrams(s, ngrams) for s in stories])
cnt = Counter(text)

# Create a pandas series with the counter results
freq_df = pd.Series(list(cnt.values()), index=list(cnt.keys())).sort_values(ascending=False)
print("\nSample of most-common bigrams")
print(freq_df.head(n=10))
print("\nMedian: {}\n".format(freq_df.median()))
# Get summary statistics
print(freq_df.describe(percentiles=[0.25,0.5,0.75,0.9]))

Original: I like chocolates
	1-grams: ['I', ' ', 'l', 'i', 'k', 'e', ' ', 'c', 'h', 'o', 'c', 'o', 'l', 'a', 't', 'e', 's']
	2-grams: ['I ', 'li', 'ke', ' c', 'ho', 'co', 'la', 'te', 's']
	3-grams: ['I l', 'ike', ' ch', 'oco', 'lat', 'es']

Sample of most-common bigrams
e     455626
 t    344361
he    310227
d     309291
th    284237
 a    268358
t     257890
s     228249
 h    192591
 s    183193
dtype: int64

Median: 143.5

count      1070.000000
mean      12152.204673
std       36425.625033
min           1.000000
25%           5.000000
50%         143.500000
75%        6465.000000
90%       34195.200000
max      455626.000000
dtype: float64


## Get the size of the vocabulary

We will set the vocabulary size to the number of words (bi-grams) that appear at least 10 times in the data

In [8]:
n_vocab = (freq_df>=10).sum()
print("Size of vocabulary: {}".format(n_vocab))

Size of vocabulary: 734


## Bi-grams to IDs: Defining a Keras tokenizer

Here, we're going to fit a tokenizer on the train data in order to convert bi-grams to IDs. The tokenizer will assign a specific ID to each unique bigram.

In [9]:
# Section 10.1

from tensorflow.keras.preprocessing.text import Tokenizer

# Define a tokenizer for the determined vocabulary size
tokenizer = Tokenizer(num_words=n_vocab, oov_token='unk', lower=False)

# Get ngrams in the training data
train_ngram_stories = [get_ngrams(s,ngrams) for s in stories]
# Fit the tokenizer
tokenizer.fit_on_texts(train_ngram_stories)

# Get the ID sequence for training data
train_data_seq = tokenizer.texts_to_sequences(train_ngram_stories)

# Get the ID sequence for validation data
val_ngram_stories = [get_ngrams(s,ngrams) for s in val_stories]
val_data_seq = tokenizer.texts_to_sequences(val_ngram_stories)

# Get the ID sequence for testing data
test_ngram_stories = [get_ngrams(s,ngrams) for s in test_stories]
test_data_seq = tokenizer.texts_to_sequences(test_ngram_stories)

## Let's look at some word ID sequences

In [10]:
for s, tokens, seq in zip(test_stories[:5], test_ngram_stories[:5], test_data_seq[:5]):
    print("Original: {}".format(s[:50]))
    print("n-grams: {}".format(tokens[:25]))
    print("Word ID sequence: {}".format(seq[:25]))
    print("\n")

Original: _book_title_ : andrew_lang___the_yellow_fairy_book
n-grams: ['_b', 'oo', 'k_', 'ti', 'tl', 'e_', ' :', ' a', 'nd', 're', 'w_', 'la', 'ng', '__', '_t', 'he', '_y', 'el', 'lo', 'w_', 'fa', 'ir', 'y_', 'bo', 'ok']
Word ID sequence: [549, 97, 554, 100, 175, 537, 325, 7, 22, 25, 707, 112, 37, 522, 594, 4, 1, 81, 107, 707, 170, 133, 586, 161, 194]


Original: _book_title_ : lewis_carroll___alice's_adventures_
n-grams: ['_b', 'oo', 'k_', 'ti', 'tl', 'e_', ' :', ' l', 'ew', 'is', '_c', 'ar', 'ro', 'll', '__', '_a', 'li', 'ce', "'s", '_a', 'dv', 'en', 'tu', 're', 's_']
Word ID sequence: [549, 97, 554, 100, 175, 537, 325, 53, 226, 49, 717, 52, 75, 54, 522, 1, 74, 120, 181, 1, 419, 39, 221, 25, 609]


Original: _book_title_ : lucy_maud_montgomery___lucy_maud_mo
n-grams: ['_b', 'oo', 'k_', 'ti', 'tl', 'e_', ' :', ' l', 'uc', 'y_', 'ma', 'ud', '_m', 'on', 'tg', 'om', 'er', 'y_', '__', 'lu', 'cy', '_m', 'au', 'd_', 'mo']
Word ID sequence: [549, 97, 554, 100, 175, 537, 325, 53, 227, 586, 10

## Defining the TensorFlow `tf.data` pipeline

Here we will define a `tf.data` pipeline to generate data for the model. In language modelling, data is generated as follows. Say you want to provide a `n` elements long sequence as the input to the model in order to generate text. Then you take a `n+1` long sequence `text` and split it into two parts; `text[:-1]` and `text[1:]`. At any step of the implementation, you can check the specification of the dataset with `print(tf.data.DatasetSpec.from_value(ds))`. 

In [11]:
# Section 10.1

# Code listing 10.3
def get_tf_pipeline(data_seq, n_seq, batch_size=64, shift=1, shuffle=True):
    """ Define a tf.data pipeline that takes a set of sequences of text and 
    convert them to fixed length sequences for the model """
    
    # Define a tf.dataset from a ragged tensor created from data_seq
    text_ds = tf.data.Dataset.from_tensor_slices(tf.ragged.constant(data_seq)) # tf.ragged.constant(data_seq)
    
    # If shuffle is set, shuffle the data (shuffle story order)
    if shuffle:
        text_ds = text_ds.shuffle(buffer_size=len(data_seq)//2)
    
    # This function will create windows from data, given a window size and a shift
    # Each window is a single entity    
    
    # windows function create neted dataset within text ds
    # This is a special trick we use to unwrap those nested structures
    #text_ds = text_ds.flat_map(lambda window: window.batch(n_seq+1, drop_remainder=True))    
    text_ds = text_ds.flat_map(
        lambda x: tf.data.Dataset.from_tensor_slices(
            x
        ).window(
            n_seq+1, shift=shift
        ).flat_map(
            lambda window: window.batch(n_seq+1, drop_remainder=True)
        )
    ) 
    
    # Shuffle the data (shuffle the order of n_seq+1 long sequences)
    if shuffle:
        text_ds = text_ds.shuffle(buffer_size=10*batch_size)
    
    # Batch the data
    text_ds = text_ds.batch(batch_size)
    
    # Split each sequence to an input and a target
    text_ds = tf.data.Dataset.zip(text_ds.map(lambda x: (x[:,:-1], x[:, 1:]))).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    
    return text_ds    


## Look at some data

Here you can see that `a` is a tuple with two Tensors; an input tensor and a target tensor. If you check the target tensor, each row in target is essentially a shift by 1 to the right of the input.

In [12]:
ds = get_tf_pipeline(train_data_seq, 5, batch_size=6)

for a in ds.take(2):

    print(a)


2022-07-27 07:59:34.190073: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-27 07:59:34.190513: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-27 07:59:34.190878: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-27 07:59:34.191179: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zer

(<tf.Tensor: shape=(6, 5), dtype=int32, numpy=
array([[ 15, 151,  85,  84,  30],
       [325,  53, 227, 586, 105],
       [175, 537, 325,  53, 227],
       [ 87,   6,   2,  72,  76],
       [  8,  49, 103,  22,  31],
       [ 31,   7,  22,  11, 280]], dtype=int32)>, <tf.Tensor: shape=(6, 5), dtype=int32, numpy=
array([[151,  85,  84,  30, 350],
       [ 53, 227, 586, 105, 298],
       [537, 325,  53, 227, 586],
       [  6,   2,  72,  76,  86],
       [ 49, 103,  22,  31,   7],
       [  7,  22,  11, 280,  63]], dtype=int32)>)
(<tf.Tensor: shape=(6, 5), dtype=int32, numpy=
array([[ 77,  23,  93, 205,  56],
       [586, 105, 298, 640,  43],
       [  2,  13, 122, 276, 488],
       [  6, 537,  49, 112,  22],
       [ 22,  31,   7,  22,  11],
       [100, 175, 537, 325,  53]], dtype=int32)>, <tf.Tensor: shape=(6, 5), dtype=int32, numpy=
array([[ 23,  93, 205,  56,  19],
       [105, 298, 640,  43, 573],
       [ 13, 122, 276, 488,  56],
       [537,  49, 112,  22, 584],
       [ 31,   7, 

## Print and save hyperparameters so far

In [13]:
print("n_grams uses n={}".format(ngrams))
print("Vocabulary size: {}".format(n_vocab))

n_seq=100
print("Sequence length for model: {}".format(n_seq))

with open(os.path.join('models', 'text_hyperparams.pkl'), 'wb') as f:
    pickle.dump({'n_vocab': n_vocab, 'ngrams':ngrams, 'n_seq': n_seq}, f)

n_grams uses n=2
Vocabulary size: 734
Sequence length for model: 100


## Defining the model

Here we're going to define an embedding layer, a single LSTM layer and two dense layers. 

More on regularizing LSTM models: https://arxiv.org/pdf/1708.02182.pdf

In [14]:
# Section 10.2

# Code listing 10.4
K.clear_session()

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(input_dim=n_vocab+1, output_dim=512, input_shape=(None,)),
    # Defining an LSTM layer
    tf.keras.layers.GRU(1024, return_state=False, return_sequences=True),
    
    # Defining a Dense layer
    tf.keras.layers.Dense(512, activation='relu'),
    
    # Defining a final Dense layer and softmax activation
    tf.keras.layers.Dense(n_vocab, name='final_out'),
    tf.keras.layers.Activation(activation='softmax')
])

## Defining the Perplexity Metric

Perplexity measures given a sequence of $n-1$ words, how surprised (or perplexed) the model was to see the $n^{th}$ word.

In [15]:
# Section 10.3

# Code listing 10.5
import tensorflow.keras.backend as K

# Inspired by https://gist.github.com/Gregorgeous/dbad1ec22efc250c76354d949a13cec3
class PerplexityMetric(tf.keras.metrics.Mean):
    
    def __init__(self, name='perplexity', **kwargs):
      super().__init__(name=name, **kwargs)
      self.cross_entropy = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False, reduction='none')

    def _calculate_perplexity(self, real, pred):
      # The next 4 lines zero-out the padding from loss calculations, 
      # this follows the logic from: https://www.tensorflow.org/beta/tutorials/text/transformer#loss_and_metrics 			      
      loss_ = self.cross_entropy(real, pred)
      
      # Calculating the perplexity steps: 
      step1 = K.mean(loss_, axis=-1)
      perplexity = K.exp(step1)
      #perplexity = K.mean(step2)
    
      return perplexity 

    def update_state(self, y_true, y_pred, sample_weight=None):            
      perplexity = self._calculate_perplexity(y_true, y_pred)
      # Remember self.perplexity is a tensor (tf.Variable), so using simply "self.perplexity = perplexity" will result in error because of mixing EagerTensor and Graph operations 
      super().update_state(perplexity)

## Test the Perpelxity calculations

In [16]:
p = PerplexityMetric()
# Define a set of true targets
true = [[0, 1,2],[0, 1,2]]
# Define a set of predictions
pred = [[[0.9, 0.1, 0.0], [0.3, 0.7, 0.0], [0.0, 0.1, 0.9]],[[0.9, 0.1, 0.0], [0.3, 0.7, 0.0], [0.0, 0.1, 0.9]]]

# Compute perplexity
p.update_state(true, pred)
print(p.result())

tf.Tensor(1.2082006, shape=(), dtype=float32)


## Compiling the model

We will compile the model with `sparse_categorical_crossentropy`, `adam` optimizer and `accuracy` and `perplexity` metrics.

In [17]:
# Compile the model
model.compile(
    loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy', PerplexityMetric()]
)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 512)         376320    
                                                                 
 gru (GRU)                   (None, None, 1024)        4724736   
                                                                 
 dense (Dense)               (None, None, 512)         524800    
                                                                 
 final_out (Dense)           (None, None, 734)         376542    
                                                                 
 activation (Activation)     (None, None, 734)         0         
                                                                 
Total params: 6,002,398
Trainable params: 6,002,398
Non-trainable params: 0
_________________________________________________________________


## Training the model

Here we're going to train the model. To keep the training shorter, we will only use 50/98 storings in the training set. We will also generate sequences at a shift of 25.

In [18]:
# Section 10.4

train_ds = get_tf_pipeline(train_data_seq[:50], n_seq, shift=25, batch_size=128)
valid_ds = get_tf_pipeline(val_data_seq, n_seq, shift=n_seq, batch_size=128)

os.makedirs('eval', exist_ok=True)

# Logging the performance metrics to a CSV file
csv_logger = tf.keras.callbacks.CSVLogger(os.path.join('eval','1_language_modelling.log'))

monitor_metric = 'val_perplexity'
mode = 'min' 
print("Using metric={} and mode={} for EarlyStopping".format(monitor_metric, mode))

# Reduce LR callback
# This function keeps the initial learning rate for the first ten epochs
# and decreases it exponentially after that.
def scheduler(epoch, lr):  
    if epoch==0:
        return lr
    else:
        return lr * 0.1

#lr_callback = tf.keras.callbacks.LearningRateScheduler(scheduler)


lr_callback = tf.keras.callbacks.ReduceLROnPlateau(
    monitor=monitor_metric, factor=0.1, patience=2, mode=mode, min_lr=1e-8
)

# EarlyStopping itself increases the memory requirement
# restore_best_weights will increase the memory req for large models
es_callback = tf.keras.callbacks.EarlyStopping(
    monitor=monitor_metric, patience=5, mode=mode, restore_best_weights=False
)

t1 = time.time()

model.fit(train_ds, epochs=50, 
          validation_data = valid_ds,
          callbacks=[es_callback, lr_callback, csv_logger])
t2 = time.time()

print("It took {} seconds to complete the training".format(t2-t1))

Using metric=val_perplexity and mode=min for EarlyStopping
Epoch 1/50


2022-07-27 07:59:58.621601: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8400
Could not load symbol cublasGetSmCountTarget from libcublas.so.11. Error: /usr/local/cuda-11.0/lib64/libcublas.so.11: undefined symbol: cublasGetSmCountTarget


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
It took 5223.01819396019 seconds to complete the training


## Evaluating the model (test)

In [19]:
batch_size = 128
test_ds = get_tf_pipeline(test_data_seq, n_seq, shift=n_seq, batch_size=batch_size)
model.evaluate(test_ds)



[2.2948660850524902, 0.4552602171897888, 11.214558601379395]

## Save the model

In [20]:
os.makedirs('models', exist_ok=True)
tf.keras.models.save_model(model, os.path.join('models', '2_gram_lm.h5'))

## Load model

In [21]:
with open(os.path.join('models', 'text_hyperparams.pkl'), 'rb') as f:
    hparams = pickle.load(f)

ngrams = hparams['ngrams']
n_vocab = hparams["n_vocab"]
n_seq = hparams["n_seq"]

model = tf.keras.models.load_model(os.path.join('models', '2_gram_lm.h5'), compile=False)
model.compile(
    loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy', PerplexityMetric()]
)

## Defining the inference model (Functional API)

Here, we're going to define an inference model. We need to actually define a new model with identical weights to the original but will make changes to inputs and outputs. Essentially, we will define a model to which we can pass in an intial state (hidden state of GRU) and outputs the final prediction as well as the new hidden state.

This way we can recursively call our model on new predictions to generate a story for any number of steps.

In [22]:
# Section 10.5

# Code listing 10.6

# Define inputs to the model
inp = tf.keras.layers.Input(shape=(None,))
inp_state = tf.keras.layers.Input(shape=(1024,))

# Define embedding layer and output
emb_layer = tf.keras.layers.Embedding(input_dim=n_vocab+1, output_dim=512, input_shape=(None,))
emb_out = emb_layer(inp)

# Defining a GRU layer and output
gru_layer = tf.keras.layers.GRU(1024, return_state=True, return_sequences=True)
gru_out, gru_state = gru_layer(emb_out, initial_state=inp_state)

# Defining a Dense layer and output
dense_layer = tf.keras.layers.Dense(512, activation='relu')
dense_out = dense_layer(gru_out)

# Defining the final Dense layer and output
final_layer = tf.keras.layers.Dense(n_vocab, name='final_out')
final_out = final_layer(dense_out)
softmax_out = tf.keras.layers.Activation(activation='softmax')(final_out)

# Define final model
infer_model = tf.keras.models.Model(inputs=[inp, inp_state], outputs=[softmax_out, gru_state])

# Copy the weights from the original model
emb_layer.set_weights(model.get_layer('embedding').get_weights())
gru_layer.set_weights(model.get_layer('gru').get_weights())
dense_layer.set_weights(model.get_layer('dense').get_weights())
final_layer.set_weights(model.get_layer('final_out').get_weights())

# Summary
infer_model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_1 (Embedding)        (None, None, 512)    376320      ['input_1[0][0]']                
                                                                                                  
 input_2 (InputLayer)           [(None, 1024)]       0           []                               
                                                                                                  
 gru_1 (GRU)                    [(None, None, 1024)  4724736     ['embedding_1[0][0]',            
                                , (None, 1024)]                   'input_2[0][0]']            

## Generating text with greedy decoding

Here we will generate text with the simplest approach we can think of. At time $t=1$, we start with a predefined sequence, and feed that to `infer_model`. At the end of the sequence we get $w_1$ (the prediction at $t=1$). $w_1$ will be the input to the model at $t=2$ and the model will generate $w_2$ and so on.

In [23]:
# Section 10.5

text = get_ngrams(
    "CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank ,".lower(), 
    ngrams
)

seq = tokenizer.texts_to_sequences([text])

# build up model state using the given string
print("Making predictions from a {} element long input".format(len(seq[0])))

# Reset the state of the model initially
model.reset_states()
# Definin the initial state as all zeros
state = np.zeros(shape=(1,1024))
# Recursively update the model by assining new state to state
for c in seq[0]:    
    out, state = infer_model.predict([np.array([[c]]), state], verbose=0)

# Get final prediction after feeding the input string
wid = int(np.argmax(out[0],axis=-1).ravel())
word = tokenizer.index_word[wid]
text.append(word)

# Define first input to generate text recursively from
x = np.array([[wid]])

# Code listing 10.7
for _ in range(500):
    
    # Get the next output and state
    out, state = infer_model.predict([x, state], verbose=0)
    
    # Get the word id and the word from out
    out_argsort = np.argsort(out[0], axis=-1).ravel()        
    wid = int(out_argsort[-1])
    word = tokenizer.index_word[wid]
    
    # If the word ends with space, we introduce a bit of randomness
    # Essentially pick one of the top 3 outputs for that timestep depending on their likelihood
    if word.endswith(' '):
        if np.random.normal()>0.5:
            width = 3
            i = np.random.choice(list(range(-width,0)), p=out_argsort[-width:]/out_argsort[-width:].sum())    
            wid = int(out_argsort[i])    
            word = tokenizer.index_word[wid]
            
    # Append the prediction
    text.append(word)
    
    # Recursively make the current prediction the next input
    x = np.array([[wid]])
    
# Print the final output    
print('\n')
print('='*60)
print("Final text: ")
print(''.join(text))

Making predictions from a 54 element long input


Final text: 
chapter i. down the rabbit-hole alice was beginning to get very tired of sitting by her sister on the bank , and then they went to the station .
 `` it 's all right , '' said mrs jo to herself , `` and you 'llitte , and i 'll be able to see your father 's .
 i 'm not goin ' to the house , anyhow , '' said the story girls .
 `` i 'm not going to be something in the world , '' said mrs. march , who wanted to be as much as her own .
 `` i 'm sorry to be able to say that , '' said the princess , `` any more than yours , and i 'lld thee the most beautiful sea ... and i 'm going to bring you a little while , and i 'll be all right .
 i 'm sure i shall be able to say that , if you will be ablemant to be done .
 i 'm sure i shall be some way to the sea .
 i 'm surprised they were all there .
 i 'm sure i shall be able-tone , '' said mrs. march , who was so sorry to be ablazed at the door of the way .
 `` what are you doing there ? 

## Beam search decoding

Beam search is a more sophisticated and better decoding technique. In beam search we predict several timesteps in to the future and pick the sequence that gives the best joint probability. Remember that, in greedy decoding we only predicted 1 step into the future.

### Defining the beam search logic

In [24]:
# Section 10.6

# Code listing 10.8

def beam_one_step(model, input_, state):    
    """ Perform the model update and output for one step"""
    output, new_state = model.predict([input_, state], verbose=0)
    return output, new_state


def beam_search(model, input_, state, beam_depth=5, beam_width=3, ignore_blank=True):
    """ Defines an outer wrapper for the computational function of beam search """
    
    def recursive_fn(input_, state, sequence, log_prob, i):
        """ This function performs actual recursive computation of the long string"""
        
        if i == beam_depth:
            """ Base case: Terminate the beam search """
            results.append((list(sequence), state, np.exp(log_prob)))            
            return sequence, log_prob, state
        else:
            """ Recursive case: Keep computing the output using the previous outputs"""
            output, new_state = beam_one_step(model, input_, state)
            
            # Get the top beam_widht candidates for the given depth
            top_probs, top_ids = tf.nn.top_k(output, k=beam_width)
            top_probs, top_ids = top_probs.numpy().ravel(), top_ids.numpy().ravel()
            
            # For each candidate compute the next prediction
            for p, wid in zip(top_probs, top_ids):                
                new_log_prob = log_prob + np.log(p)
                
                # we are going to penalize joint probability whenever the same symbol is repeating
                if len(sequence)>0 and wid == sequence[-1]:
                    new_log_prob = new_log_prob + np.log(1e-1)
                    
                sequence.append(wid)                
                _ = recursive_fn(np.array([[wid]]), new_state, sequence, new_log_prob, i+1)                                         
                sequence.pop()
        
    
    results = []
    sequence = []
    log_prob = 0.0
    recursive_fn(input_, state, sequence, log_prob, 0)    

    results = sorted(results, key=lambda x: x[2], reverse=True)

    return results

## Generating the actual text

In [25]:
# Section 10.6

# Code listing 10.9

text = get_ngrams(
    "CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank ,".lower(),     
    ngrams
)

seq = tokenizer.texts_to_sequences([text])

# build up model state using the given string
print("Making {} predictions from input".format(len(seq[0])))

#model.reset_states()
state = np.zeros(shape=(1,1024))
for c in seq[0]:    
    out, state = infer_model.predict([np.array([[c]]), state], verbose=0)

# get final prediction after feeding the input string
wid = int(np.argmax(out[0],axis=-1).ravel())
word = tokenizer.index_word[wid]
text.append(word)

x = np.array([[wid]])

# Predict for 100 time steps
for i in range(100):    
    
    # Get the results from beam search
    result = beam_search(infer_model, x, state, 7, 2)
    
    # Get one of the top 10 results based on their likelihood
    n_probs = np.array([p for _,_,p in result[:10]])
    p_j = np.random.choice(list(range(n_probs.size)), p=n_probs/n_probs.sum())                    
    best_beam_ids, state, _ = result[p_j]
    x = np.array([[best_beam_ids[-1]]])
            
    text.extend([tokenizer.index_word[w] for w in best_beam_ids])    

print('\n')
print('='*60)
print("Final text: ")
print(''.join(text))

Making 54 predictions from input


Final text: 
chapter i. down the rabbit-hole alice was beginning to get very tired of sitting by her sister on the bank , and they were sitting on the grass , and then there was a great deal of trouble .
 the princess was sitting on the grass , and her eyes were standing at the door of the window , and then they went on again , and then he went into the house and there was nothing to do with them .
 `` what is it ? ''
 said mrs. jo , smiling .
 `` it 's all right , and i 'm going to tell you that you will never be afraid of that , '' said the king .
 `` i 'll tell you what i want .
 there is nothing else to do , '' answered mrs. jo , smiling at the entrance of the world .
 `` i do n't want any more , and i 'm going to tell you that you were not going to be there , '' said mrs. jo .
 `` i do n't know what to do .
 i 'm going to teach you to see that you will never be the best thing to do with you . ''
 `` i 'm not going to tell you , '' she said , with