<a href="https://colab.research.google.com/github/splAcharya/RNN_Text_Generation/blob/main/Generating_Text_Using_RNNS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## References
 
The following tutortials were used to learn about text generation using RNNs. 
Majority of code was take from the tensorflow. The tutorial was simply redone to learn in depth.

1.   Text Generation Using RNN: https://www.tensorflow.org/text/tutorials/text_generation
2. https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/



## Generating Text Reccurent Neural Networks

This notebook attmepts to generate Story for Alice in Wonderland, using Gated Recurent Units

## Import Python libraries

In [1]:
import tensorflow as tf
import numpy as np
from matplotlib import pyplot as plt
import os
import time

## Load Bad Jokes Dataset


In [8]:
!curl https://www.gutenberg.org/files/11/11-0.txt -o alice.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  170k  100  170k    0     0   355k      0 --:--:-- --:--:-- --:--:--  354k


In [9]:
# Read, then decode for py2 compat.
text = open("alice.txt", 'rb').read().decode(encoding='utf-8')

# length of text is the number of characters in it
print(f'Length of text: {len(text)} characters')

Length of text: 167808 characters


In [10]:
#raw_text[:1485]
#raw_text[148847 :]
text = text[1485 : 148860]

In [12]:
len(text)

147375

In [13]:
# The unique characters in the file
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

74 unique characters


## Preprocess Data

### Vectorize the text

In [14]:
example_texts = ['abcdefg', 'xyz']

chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

In [15]:
#Now create the tf.keras.layers.StringLookup layer:
ids_from_chars = tf.keras.layers.StringLookup(vocabulary=list(vocab), mask_token=None)

In [16]:
ids = ids_from_chars(chars)
ids

<tf.RaggedTensor [[43, 44, 45, 46, 47, 48, 49], [66, 67, 68]]>

In [17]:
chars_from_ids = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=ids_from_chars.get_vocabulary(),
                                                                         invert=True,
                                                                         mask_token=None)

In [18]:
chars = chars_from_ids(ids)
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

In [19]:
tf.strings.reduce_join(chars, axis=-1).numpy()

array([b'abcdefg', b'xyz'], dtype=object)

In [20]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1).numpy()

In [21]:
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids

<tf.Tensor: shape=(147375,), dtype=int64, numpy=array([2, 1, 2, ..., 1, 2, 1])>

In [22]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

In [23]:
for ids in ids_dataset.take(1000):
  print(chars_from_ids(ids).numpy().decode('UTF-8'), end="")



CHAPTER I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, “and what is the use of a book,” thought Alice
“without pictures or conversations?”

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure of
making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so _very_ remarkable in that; nor did Alice think it
so _very_ much out of the way to hear the Rabbit say to itself, “Oh
dear! Oh dear! I shall be late!” (when she thought it over afterwards,
it occurred to her that she ought to have wondered at this, but at the
time it all seemed quite natural); but when the Rabbit actually _t

In [24]:
seq_length = 100
examples_per_epoch = len(text)//(seq_length + 1)

In [25]:
sequences = ids_dataset.batch(seq_length + 1, drop_remainder= True)
for seq in sequences.take(1):
  print(seq)
  print()
  print(chars_from_ids(seq))

tf.Tensor(
[ 2  1  2  1 16 21 14 29 33 18 31  3 22 10  2  1 17 57 65 56  3 62 50 47
  3 31 43 44 44 51 62  9 21 57 54 47  2  1  2  1  2  1 14 54 51 45 47  3
 65 43 61  3 44 47 49 51 56 56 51 56 49  3 62 57  3 49 47 62  3 64 47 60
 67  3 62 51 60 47 46  3 57 48  3 61 51 62 62 51 56 49  3 44 67  3 50 47
 60  3 61 51 61], shape=(101,), dtype=int64)

tf.Tensor(
[b'\r' b'\n' b'\r' b'\n' b'C' b'H' b'A' b'P' b'T' b'E' b'R' b' ' b'I' b'.'
 b'\r' b'\n' b'D' b'o' b'w' b'n' b' ' b't' b'h' b'e' b' ' b'R' b'a' b'b'
 b'b' b'i' b't' b'-' b'H' b'o' b'l' b'e' b'\r' b'\n' b'\r' b'\n' b'\r'
 b'\n' b'A' b'l' b'i' b'c' b'e' b' ' b'w' b'a' b's' b' ' b'b' b'e' b'g'
 b'i' b'n' b'n' b'i' b'n' b'g' b' ' b't' b'o' b' ' b'g' b'e' b't' b' '
 b'v' b'e' b'r' b'y' b' ' b't' b'i' b'r' b'e' b'd' b' ' b'o' b'f' b' '
 b's' b'i' b't' b't' b'i' b'n' b'g' b' ' b'b' b'y' b' ' b'h' b'e' b'r'
 b' ' b's' b'i' b's'], shape=(101,), dtype=string)


In [26]:
for seq in sequences.take(5):
  print(text_from_ids(seq))

b'\r\n\r\nCHAPTER I.\r\nDown the Rabbit-Hole\r\n\r\n\r\nAlice was beginning to get very tired of sitting by her sis'
b'ter on the\r\nbank, and of having nothing to do: once or twice she had peeped into\r\nthe book her sister'
b' was reading, but it had no pictures or\r\nconversations in it, \xe2\x80\x9cand what is the use of a book,\xe2\x80\x9d though'
b't Alice\r\n\xe2\x80\x9cwithout pictures or conversations?\xe2\x80\x9d\r\n\r\nSo she was considering in her own mind (as well as s'
b'he could, for the\r\nhot day made her feel very sleepy and stupid), whether the pleasure of\r\nmaking a d'


<p> For model training purposes, for each each input we will need a label. The goal here is, for every character (input) set the next character as label </p>

In [27]:
#for training need, (input and label), p
def split_input_target(sequence):
  input_text = sequence[:-1] #skip last character
  target_text = sequence[1 : ] #skip first character
  return input_text, target_text

In [28]:
split_input_target("Tensorflow")

('Tensorflo', 'ensorflow')

In [29]:
dataset = sequences.map(split_input_target)

In [30]:
for in_eg, tar_eg in dataset.take(1):
  print(f"Input: {text_from_ids(in_eg)}")
  print(f"Target: {text_from_ids(tar_eg)}")

Input: b'\r\n\r\nCHAPTER I.\r\nDown the Rabbit-Hole\r\n\r\n\r\nAlice was beginning to get very tired of sitting by her si'
Target: b'\n\r\nCHAPTER I.\r\nDown the Rabbit-Hole\r\n\r\n\r\nAlice was beginning to get very tired of sitting by her sis'


In [31]:
#create training set
BATCH_SIZE = 64 #number of data per batch for trainning

#Buffer size to shiffle dataset
BUFFER_SIZE = 20000
dataset= (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

## Build RNN Model

In [32]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [33]:
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

In [34]:
model = MyModel(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=len(ids_from_chars.get_vocabulary()),
    embedding_dim=embedding_dim,
    rnn_units = rnn_units)

## Try the Model

In [35]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 75) # (batch_size, sequence_length, vocab_size)


In [36]:
model.summary()

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  19200     
                                                                 
 gru (GRU)                   multiple                  3938304   
                                                                 
 dense (Dense)               multiple                  76875     
                                                                 
Total params: 4,034,379
Trainable params: 4,034,379
Non-trainable params: 0
_________________________________________________________________


## Sample Output 
*Other wasys to sample output----*
(https://medium.com/deep-learning-with-keras/sampling-in-text-generation-b2f4825e1dad)

In [37]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

In [38]:
sampled_indices

array([24, 26,  1, 39, 42, 67, 48,  6, 33, 72, 35, 68, 57, 21, 27, 25,  0,
       49,  0, 60, 53, 53, 55, 24, 73, 47, 38, 43, 46, 48, 30, 10, 15,  4,
       30,  2, 49,  1, 62, 18, 45,  5, 34, 34, 40, 70, 40, 41, 20,  2,  9,
       58, 67, 32, 58, 20, 68,  3, 60, 62, 30, 62, 35, 37, 50, 68,  9, 28,
       21, 44, 68, 45, 49, 55, 49, 32, 56, 66, 35, 59, 66, 14, 28, 14, 42,
       45, 45, 24, 71, 60, 55, 44, 33, 18, 24, 73, 24, 34, 25, 25])

In [39]:
print("Input:\n", text_from_ids(input_example_batch[0].numpy()))
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices))

Input:
 b' it as well as she could.\r\n\r\n\xe2\x80\x9cThe game\xe2\x80\x99s going on rather better now,\xe2\x80\x9d she said, by way of keeping up'

Next Char Predictions:
 b'KM\nZ_yf)T\xe2\x80\x99VzoHNL[UNK]g[UNK]rkkmK\xe2\x80\x9ceYadfQ.B!Q\rg\ntEc(UU[\xe2\x80\x94[]G\r-pySpGz rtQtVXhz-OHbzcgmgSnxVqxAOA_ccK\xe2\x80\x98rmbTEK\xe2\x80\x9cKULL'


## Train the Model 

### Attach and optimizer and a loss function

In [40]:
target_example_batch.shape

TensorShape([64, 100])

In [41]:
example_batch_predictions.shape

TensorShape([64, 100, 75])

In [42]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

In [43]:
example_batch_loss = loss(target_example_batch, example_batch_predictions)
mean_loss = example_batch_loss.numpy().mean()
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", mean_loss)

Prediction shape:  (64, 100, 75)  # (batch_size, sequence_length, vocab_size)
Mean loss:         4.317149


In [44]:
tf.exp(mean_loss).numpy()

74.97458

In [45]:
model.compile(optimizer='adam', loss=loss)

### Configure Checkpoints

In [46]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

### Execute the training

In [47]:
EPOCHS = 50

In [48]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


### Generate Text

In [49]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [50]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

In [51]:
start = time.time()
states = None
next_char = tf.constant(['CHAPTER:'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

CHAPTER: Afiting its even give in having head to find my weak
in the world! Oh, my dook, I’ve sorntemed to make it scrowl of the sly.

“In that case, escope!” said Alice; “as you pleased a
moment to be Number of comifir.”

“It must be a very pretty jame, the Duchess only getting on the sont,” she said to herself, for she had
drunk half the bottle, she found her head pressing against the door, and the other was sitting on the thing,
but it is all the things get is: but she knew the White Rabbit was still in sight, hurriedly went on.

“I do,” Alice has veid nothing but out-of-the-way
things to happen, that it seemed quite nabused impatiently until it caws befween into a comfort, one way—never to be a book of
little cartwheels, and the sound of a good many voices all
dogithout trying, that she let the Lozying of the hall in her lef on the game
neck again, statid “The fartages and question, Ubblictly as she chose, for some wine of gleas togething about it is all she cradded, 