# Text Generation with Simple Recurrent Neural Networks (RNN)

**Introduction:**
    
In this activity, we'll explore basic text generation using a simple RNN on a small dataset containing the phrase "hello world". 

RNNs are suitable for sequence data, making them useful for tasks like text generation.

**Objective:**
We aim to train the RNN to understand the structure of the phrase so that it can generate text resembling it, given an initial character.

**Steps:**

1. Prepare the data: Tokenize the text into characters and organize it into input-output pairs.
2. Define the RNN model: Utilize TensorFlow to create a simple RNN model.
3. Train the model: Train the RNN on the data to learn the text structure.
4. Generate text: Use the trained model to generate new text based on a given seed character.
5. Discussing the key challenges encountered

This activity will provide a hands-on understanding of basic text generation concepts using RNNs and TensorFlow.

### Install the necessary library using pip:

First, you would need to install TensorFlow if you haven't already:

In [1]:
#pip install tensorflow

## Importing Necessary Libraries

In [2]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences

import numpy as np

The necessary libraries and modules from TensorFlow are imported for processing the data and building the RNN model.

## Data Preparation

In [3]:
# Our tiny dataset
data = "hello world"

In [4]:
len(data)

11

A minimal dataset "hello world" is defined. This dataset is what the model will learn from.

## 2. Tokenization

The text data is converted into a numerical format using a character-level tokenizer. Each unique character gets mapped to a unique number.

In [5]:
tokenizer = Tokenizer(char_level=True)  # char_level=True for character tokenization
tokenizer.fit_on_texts([data])
sequences = tokenizer.texts_to_sequences([data])[0]

In this block:

- A Tokenizer object is created with char_level=True to tokenize the text at the character level.
- `fit_on_texts()` method is called to fit the tokenizer on the data.
- `texts_to_sequences()` method is used to convert the text to a sequence of integers, where each integer represents a unique character.

In [6]:
print(tokenizer)
print(sequences)

<keras.preprocessing.text.Tokenizer object at 0x0000026A5C8FEBB0>
[3, 4, 1, 1, 2, 5, 6, 2, 7, 1, 8]


## 3. Preparing Input and Output Data

Sequences of characters are prepared as input for the model, with corresponding next characters as output. This teaches the model the sequence in which characters appear.

In [7]:
input_sequences = []
output_sequences = []

for i in range(1, len(sequences)):
    input_sequences.append(sequences[:i])
    output_sequences.append(sequences[i])

In [8]:
print('input_sequences:', input_sequences)
print('output_sequences:', output_sequences)

input_sequences: [[3], [3, 4], [3, 4, 1], [3, 4, 1, 1], [3, 4, 1, 1, 2], [3, 4, 1, 1, 2, 5], [3, 4, 1, 1, 2, 5, 6], [3, 4, 1, 1, 2, 5, 6, 2], [3, 4, 1, 1, 2, 5, 6, 2, 7], [3, 4, 1, 1, 2, 5, 6, 2, 7, 1]]
output_sequences: [4, 1, 1, 2, 5, 6, 2, 7, 1, 8]


In [9]:
len(output_sequences)

10

Here:

- Two empty lists input_sequences and output_sequences are created to hold - the input and output data for the model.
- A loop iterates through the sequence of integers, creating sub-sequences. Each sub-sequence (excluding the last character) is added to `input_sequences`, and the corresponding next character is added to `output_sequences`.

## 4. Padding Sequences

All input sequences are padded to have the same length to ensure consistent input shape for the model.

In [10]:
input_sequences = pad_sequences(input_sequences)
output_sequences = tf.keras.utils.to_categorical(output_sequences, num_classes=len(tokenizer.word_index) + 1)

- `pad_sequences()` is used to ensure that all sequences in the same size.
- `input_sequences` have the same length by padding them with zeros.
- `output_sequences` are one-hot encoded using to_categorical() to prepare them for training the model.

In [11]:
print('input_sequences:', input_sequences)
#print('output_sequences:', output_sequences)

input_sequences: [[0 0 0 0 0 0 0 0 0 3]
 [0 0 0 0 0 0 0 0 3 4]
 [0 0 0 0 0 0 0 3 4 1]
 [0 0 0 0 0 0 3 4 1 1]
 [0 0 0 0 0 3 4 1 1 2]
 [0 0 0 0 3 4 1 1 2 5]
 [0 0 0 3 4 1 1 2 5 6]
 [0 0 3 4 1 1 2 5 6 2]
 [0 3 4 1 1 2 5 6 2 7]
 [3 4 1 1 2 5 6 2 7 1]]


## 5. Split Data into Features (X) and Target (y)

Data is organized into features (X) and targets (y), where X is the input to the model and y is the expected output.

In [12]:
X, y = input_sequences, output_sequences

The input and output data are assigned to X and y respectively.

## 6. Defining the RNN Model:

A simple RNN model is defined using:
- an embedding layer (to handle the input data), 
- a SimpleRNN layer (the recurrent part of the network), 
- and a dense layer (for final output predictions).

In [13]:
model = tf.keras.models.Sequential([
    # input layer
    tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=8, input_length=input_sequences.shape[1]),
    # RNN layer
    tf.keras.layers.SimpleRNN(8),
    # Output layer
    tf.keras.layers.Dense(len(tokenizer.word_index) + 1, activation='softmax')
])

In this block:

- A sequential model is defined using the Sequential() class.
- An Embedding layer is added to learn a dense representation of the input sequences.
- A SimpleRNN layer with 8 units is added.
- A Dense layer with a softmax activation is added for multi-class classification.

## 7. Compiling and Training the Model

In [14]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, verbose=0)

<keras.callbacks.History at 0x26a5cb5bbe0>

- The model is compiled using a loss function, optimizer, and a metric to monitor during training. These settings are standard for a classification task.

- The model is trained on the data for 50 epochs to learn the patterns in the input text and to predict the next character.

## 8. Defining a Function to Generate Text

A function to generate text is defined. Given a seed text, it predicts the next characters iteratively using the trained model.

In [15]:
def generate_text(seed_text, next_chars=10):
    for _ in range(next_chars):
        sequences = tokenizer.texts_to_sequences([seed_text])[0]
        sequences = pad_sequences([sequences], maxlen=input_sequences.shape[1])
        predicted = model.predict(sequences, verbose=0)
        predicted_index = np.argmax(predicted, axis=-1)[0]
        seed_text += tokenizer.index_word[predicted_index]  
    return seed_text

In this block:

- A function `generate_text` is defined to generate text given a seed text.
- The function iterates for the specified number of characters (`next_chars`), tokenizes the current text, pads it, predicts the next character using the model, and appends the predicted character to the seed text.

## 9. Generating and Printing Text

Finally, the generate_text function is used to generate text, starting with the character "h", and prints the generated text.

In [16]:
generated_text = generate_text("h", next_chars=20)
print(generated_text)

hello worldrdeellorll


- The generate_text function is called with a seed text of "h" and a request to generate 10 additional characters.
- The generated text is printed to the console.

This structure illustrates a simplified approach to text generation with an RNN, from data preparation to model training and text generation.

## 10. Key challenges encountered in the simple example

The simple example provided serves as an introductory exercise to text generation using RNNs, but it has several limitations and challenges, including:

**Small Dataset:**

- The dataset used is extremely small ("hello world"), which does not allow the model to learn a rich representation of language.
- A small dataset may lead to overfitting where the model memorizes the data rather than learning the underlying patterns.

**Simplicity of Model:**

- A Simple RNN is used, which is quite basic and may struggle with complex text generation tasks.

**Character-Level Tokenization:**

- The model operates on individual characters, which can be limiting compared to understanding whole words or phrases.

**Lack of Generalization:**

- The setup is quite simplistic and may not perform well on different or more complex text data.

**Text Generation Quality:**

- The quality of the generated text might not meet expectations, especially with varying input or on a larger scale.

These challenges highlight the need for a more sophisticated approach, larger datasets, and potentially more advanced models to achieve better text generation results.
