In this example, we are going to generating Shakespearean text using a Character RNN. We will build and train an RNN to predict the next
character in a sentence.

#### Importing Libraries

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import os

#### Getting Data

In [2]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt','https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


#### Printing few characters and getting length of corpus

In [3]:
text = open(path_to_file, 'rb').read()
text = text.decode(encoding='utf-8')
print ('Total number of characters in the corpus is:', len(text))
print('The first 100 characters of the corpus are as follows:\n', text[:100])

Total number of characters in the corpus is: 1115394
The first 100 characters of the corpus are as follows:
 First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


#### Seeing the all unique  characters

In [4]:
"".join(sorted(set(text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

#### Tokenization
In this step, we encode every character as an integer. We set char_level=True to get character-level encoding rather than the default word-level encoding. Tokenizer converts the text to lowercase by default.

In [5]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(text)

Now the tokenizer can encode a sentence (or a list of sentences) to a list of character IDs and back, and it tells us how many distinct characters there are and the total number of characters in the text.

In [6]:
tokenizer.texts_to_sequences(["First"])

[[20, 6, 9, 8, 3]]

In [7]:
tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]])

['f i r s t']

In [8]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters
print("number of distinct characters:", max_id,'\n',"total number of characters:",dataset_size)

number of distinct characters: 39 
 total number of characters: 1115394


Let’s encode the full text so each character is represented by its ID (we subtract 1 to get IDs from 0 to 38, rather than from 1 to 39):

In [9]:
[encoded] = np.array(tokenizer.texts_to_sequences([text])) - 1


#### Splitting the Sequential Dataset
Avoid any overlap between the training set, the validation set, and the test set. For example, we can take the first 90% of the text for the training set, then the next 5% for the validation set, and the final 5% for the test set.

Creating a tf.data.Dataset that will return each character one by one from this set.

In [10]:
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

#### Chopping the Sequential Dataset into Multiple Windows

By default, the window() method creates non-overlapping windows, but to get the largest possible training set we use shift=1 so that the first window contains characters 0 to 100, the second contains characters 1 to 101, and so on. To ensure that all windows are exactly 101 characters long (which will allow us to create batches
without having to do any padding), we set drop_remainder=True (otherwise the last 100 windows will contain 100 characters, 99 characters, and so on down to 1 character).

The window() method creates a dataset that contains windows, each of which is also represented as a dataset. It’s a nested dataset, analogous to a list of lists. This is useful when you want to transform each window by calling its dataset methods (e.g., toshuffle them or batch them).

In [11]:
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

We cannot use a nested dataset directly for training, as our model will expect tensors as input, not datasets. So, we must call the
flat_map() method: it converts a nested dataset into a flat dataset (one that does not contain datasets). For example, suppose {1, 2, 3} represents a dataset containing the sequence of tensors 1, 2, and 3. If you flatten the nested dataset {{1, 2}, {3, 4, 5, 6}}, you get back the flat dataset {1, 2, 3, 4, 5, 6}. Moreover, the flat_map() method takes a function as an argument, which allows you to transform each dataset in a nested dataset before flattening. For example, if you pass the function lambda ds: ds.batch(2) to flat_map(), then it will transform the nested dataset {{1, 2}, {3, 4, 5, 6}} into the flat dataset {[1, 2], [3, 4], [5, 6]}: it’s a dataset of tensors of size 2. With that in mind, we are ready to flatten our dataset.

In [12]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

In generating Shakespearean text using a Character RNN, when the instances in the training set are independent and identically distributed, we need to shuffle the windows. Then we can batch the windows and separate the inputs (the first 100 characters) from the target (the last character).

In [13]:
np.random.seed(42)
tf.random.set_seed(42)
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

The figure below summarizes the dataset preparation steps discussed so far (showing windows of length 11 rather than 101, and a batch size of 3 instead of 32).

![Img](https://cdn.iisc.talentsprint.com/CDS/Images/Prep_shuffled_windows.PNG)

$\text{Figure :  Preparing a dataset of shuffled windows
}$ [Image Source : Ref.1]

The categorical input features should generally be encoded, usually as one-hot vectors or as embeddings. Here, we will encode each character using a one-hot vector because there are fairly few distinct characters (only 39):


In [14]:
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

Finally, we just need to add prefetching:


In [15]:
dataset = dataset.prefetch(1)

In [16]:
for X_batch, Y_batch in dataset.take(1):
    print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


#### Creating and Training the Model # Run on GPU It may take 1 to 2 hour

The GRU class will only use the GPU, when using the default values for the following arguments: activation, recurrent_activation, recurrent_dropout, unroll, use_bias and reset_after.

In [None]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],dropout=0.2),
    keras.layers.GRU(128, return_sequences=True,dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,activation="softmax"))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
history = model.fit(dataset, epochs=10)

  super().__init__(**kwargs)


Epoch 1/10
   1061/Unknown [1m89s[0m 76ms/step - loss: 2.4727

#### Using the Model to Generate Text
Now we have a model that can predict the next character in text written by Shake‐
speare. To feed it some text, we first need to preprocess it like we did earlier, so let’s
create a little function for this:


In [None]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

Now let’s use the model to predict the next letter in some text:


In [None]:
X_new = preprocess(["How are yo"])
Y_pred = np.argmax(model(X_new), axis=-1)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1] # 1st sentence, last char

The model guessed right. Now let’s use this model to generate new text.


#### Generating Fake Shakespearean Text

To generate new text using the Char-RNN model, we could feed it some text, make
the model predict the most likely next letter, add it at the end of the text, then give the
extended text to the model to guess the next letter, and so on. But in practice this
often leads to the same words being repeated over and over again.

Instead, we can
pick the next character randomly, with a probability equal to the estimated probability, using TensorFlow’s `tf.random.categorical()` function. This will generate more
diverse and interesting text. The categorical() function samples random class indices, given the class log probabilities (logits). To have more control over the diversity of the generated text, we can divide the logits by a number called the temperature, which we can tweak as we wish: a temperature close to 0 will favor the highprobability characters, while a very high temperature will give all characters an equal probability. The following `next_char()` function uses this approach to pick the next character to add to the input text:


In [None]:
tf.random.set_seed(42)
tf.random.categorical([[np.log(0.5), np.log(0.4), np.log(0.1)]], num_samples=40).numpy()

In [None]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [None]:
tf.random.set_seed(42)

next_char("How are yo", temperature=1)

In [None]:
def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [None]:
tf.random.set_seed(42)

print(complete_text("t", temperature=0.2))

In [None]:
print(complete_text("t", temperature=1))

In [None]:
print(complete_text("t", temperature=2))