<a href="https://colab.research.google.com/github/s34836/WUM/blob/main/Lab_13_RNN_Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNN Text Generation

## Tasks

The `shakespeare.txt` file contains a sample of Shakespeare's works. Use the data to train an RNN model to generate similar text by predicting the next character.

The code below encodes the text and splits it into sequences of 100 charcters (where the first 99 characters are the input and the last character is the target).

1. Create network with an Embedding layer, a GRU layer and a Dense output layer and train it on the prepared dataset.
2. Use the model to generate new text based on a seed  (you can use the one provided below). Encode the seed sequence and pass it as input to the model and predict the next character. Then append the character to the sequence and pass it back to the model. Repeat the process to generate a given number of characters. Then decode the generated sequence.

Hint: The model should output a probability distribution generated by the softmax function. You will get better results if you sample from the distribution instead of always predicting the maximum-probability character.

In [1]:
import numpy as np
import tensorflow as tf

with open('shakespeare.txt', 'r', encoding='utf-8') as f:
    shakespeare_text = f.read()

vocab = list(set(shakespeare_text))
vocab_size = len(vocab)
char_to_index = {char: idx for idx, char in enumerate(vocab)}

def encode(text):
    return np.array([char_to_index[char] for char in text if char in char_to_index])

def decode(indices):
    return ''.join(vocab[i] for i in indices if i < vocab_size)

encoded_text = encode(shakespeare_text)
encoded_text_dataset = tf.data.Dataset.from_tensor_slices(encoded_text)

sequence_length = 100
sequences = encoded_text_dataset.batch(sequence_length, drop_remainder=True)

dataset = sequences.map(lambda seq: (seq[:-1], seq[-1]))

batch_size = 64
dataset = dataset.shuffle(buffer_size=10000).batch(batch_size)

In [5]:
for input, target in dataset.unbatch().take(1):
    print("Input sequence:")
    print(decode(input.numpy()))
    print("\nTarget:", decode([target.numpy()]))

Input sequence:
put on their cloaks;
When great leaves fall, the winter is at hand;
When the sun sets, who doth not

Target:  


In [6]:
seed = """To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To die—to sleep,
No more; and by a sleep to say we end
"""

In [7]:

embedding_dim = 64
rnn_units = 256

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(sequence_length - 1,), dtype=tf.int32),
    tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim),
    tf.keras.layers.GRU(rnn_units),
    tf.keras.layers.Dense(vocab_size, activation="softmax"),
])

model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

model.summary()

In [8]:
history = model.fit(dataset, epochs=5)

Epoch 1/5
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m78s[0m 423ms/step - accuracy: 0.1533 - loss: 3.4147
Epoch 2/5
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m89s[0m 467ms/step - accuracy: 0.2828 - loss: 2.5658
Epoch 3/5
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m73s[0m 407ms/step - accuracy: 0.3105 - loss: 2.3878
Epoch 4/5
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 415ms/step - accuracy: 0.3296 - loss: 2.3206
Epoch 5/5
[1m175/175[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 419ms/step - accuracy: 0.3456 - loss: 2.2292


In [15]:


def generate_text(model, seed_text, num_chars, temperature):
    # encode seed (ignoruje znaki spoza vocab)
    seed_idx = encode(seed_text)
    if len(seed_idx) == 0:
        raise ValueError("Seed po enkodowaniu jest pusty (same znaki spoza vocab).")

    generated = list(seed_idx)

    for _ in range(num_chars):
        # bierzemy ostatnie 99 znaków jako input; jak seed krótszy to dopadujemy zerami
        x = np.array(generated[-(sequence_length - 1):], dtype=np.int32)
        if len(x) < (sequence_length - 1):
            x = np.pad(x, (sequence_length - 1 - len(x), 0), mode="constant", constant_values=0)

        x = x.reshape(1, -1)

        probs = model.predict(x, verbose=0)[0]  # (vocab_size,)

        # temperatura: logits = log(p), skalowanie, softmax
        probs = np.asarray(probs).astype(np.float64)
        probs = np.log(probs + 1e-9) / temperature
        probs = np.exp(probs)
        probs = probs / np.sum(probs)

        next_idx = np.random.choice(vocab_size, p=probs)
        #next_idx = np.argmax(probs)
        generated.append(next_idx)

    return decode(generated)

print(generate_text(model, seed, num_chars=300, temperature=1.2))

To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To dieto sleep,
No more; and by a sleep to say we end
Cpaml sheliu thous, sacl, nost, bomy teektel!
Bey tale. I'ccod-oked. Cely an I wormarecr.

KmeSI Gror:
tfive fel. WI Romk, ceres
Be livy ond fnowline. -ut; dounlpd pridednar sextisct, and you.

MTIOU:
O tanD on int beqwit thy ve dtislle
On,
Keole mist
mndes selle'n! the rure, g ofcliwhenc:
And in be
