# Recurrent Neural Networks (RNNs) and Language Modeling for Text Generation

## Introduction

In this notebook, we will:

- **Introduce** the concept of Recurrent Neural Networks (RNNs) and why they are useful for sequence modeling.
- **Explore** one of the key applications of RNNs: language modeling and text generation.
- **Implement** a simple RNN-based language model using PyTorch.
- **Generate** new text sequences from the trained model.

**Resources for Further Reading:**

- [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy
- [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Christopher Olah

**Prerequisites:**

- Familiarity with Python and basic machine learning concepts.
- A basic understanding of feedforward neural networks.

**Note:** RNNs have largely been supplanted in many NLP tasks by architectures like Transformers (e.g., BERT, GPT), but RNNs are still crucial to understand for foundational knowledge and certain specialized sequence tasks.

## What Are Recurrent Neural Networks?

Traditional neural networks assume that inputs are independent of each other. However, this isn't always the case, especially with sequential data like text, time-series, or any data that has a notion of order.

Recurrent Neural Networks (RNNs) are a class of neural networks designed to handle sequential data. They maintain a **hidden state** that acts as a kind of memory of what has been processed so far. At each timestep, the RNN takes in:

- The current input vector
- The previous hidden state

It then outputs a new hidden state. This recurrence allows RNNs to "remember" previous inputs in a sequence.

Formally, for a sequence of inputs \((x_1, x_2, ..., x_T)\), an RNN computes:

$$
h_t = f(W_{hh} h_{t-1} + W_{xh} x_t)
$$

and often we also produce an output \( y_t \):

$$
y_t = W_{hy}h_t
$$

where \(h_t\) is the hidden state at time t, and \(f\) is often a nonlinearity such as \(\tanh\).

**Problems with Vanilla RNNs:**  
RNNs struggle with long-term dependencies due to issues like vanishing and exploding gradients. This led to the development of more sophisticated variants like LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units).

## Language Modeling and Text Generation

A language model assigns probabilities to sequences of words. For example, it estimates the probability of a sentence:

$$
P(w_1, w_2, ..., w_T) = \prod_{t=1}^{T} P(w_t | w_1, w_2, ..., w_{t-1})
$$

An RNN-based language model uses the hidden state to encode the history of words seen so far. The model is trained to predict the next word given the previous words.

Once we train such a model on a corpus of text, we can use it for:

- **Text generation:** Start with a seed text and sample from the model to generate new sentences.
- **Other NLP tasks:** Language modeling is a fundamental building block for many downstream tasks.

In this tutorial, we'll focus on a simple text generation task. We'll:

1. **Load** a text corpus.
2. **Preprocess** it into a suitable form (convert words or characters into integers).
3. **Train** an RNN-based model to predict the next token.
4. **Generate** new text from the trained model.

## Setup

We'll use PyTorch for building and training the RNN model. Make sure you have PyTorch installed.

**Installation:**


```bash
pip install torch torchvision torchaudio
```
We'll also use standard Python libraries for data loading and text processing.


In [40]:
import numpy as np
import requests
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Enable GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')


## Data Preparation

For simplicity, let's use a public domain text. We'll download a small text like a part of Shakespeare's works (public domain).


In [41]:
# Let's download a small text snippet (if you have no internet, you can just define text manually)
url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
response = requests.get(url)
text = response.text

print("Length of text:", len(text))
print("Sample text:\n", text[:1000])

Length of text: 1115394
Sample text:
 First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger f


We have a large text (Shakespeare). For demonstration, this will do. If the text is too large for memory or training time, we can truncate it for this example.

In [42]:
# Let's shorten the text to make training faster for demonstration
# In a real scenario, you'd keep the full text.
text = text[:50000]

# Let's consider character-level modeling for simplicity.
# We'll map each unique character to an integer.

chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Unique characters:", chars)
print("Vocab size:", vocab_size)

# Create mappings
char_to_idx = { ch:i for i,ch in enumerate(chars) }
idx_to_char = { i:ch for i,ch in enumerate(chars) }

Unique characters: ['\n', ' ', '!', "'", ',', '-', '.', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Vocab size: 59


We'll split the text into sequences. For example, we can chunk the text into sequences of a fixed length, and the model will learn to predict the next character from the previous characters.


In [43]:
seq_length = 100  # length of the input sequence
step_size = 1     # how far to step through the text each time

def create_dataset(text, seq_length, step_size):
    inputs = []
    targets = []
    for i in range(0, len(text)-seq_length, step_size):
        seq_in = text[i:i+seq_length]
        seq_out = text[i+seq_length]
        inputs.append([char_to_idx[ch] for ch in seq_in])
        targets.append(char_to_idx[seq_out])
    return np.array(inputs), np.array(targets)

X, Y = create_dataset(text, seq_length, step_size)
print("Dataset size:", X.shape, Y.shape)

Dataset size: (49900, 100) (49900,)


We'll now create a PyTorch `Dataset` and `DataLoader` to handle batching.


In [44]:
class TextDataset(Dataset):
    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets
        
    def __len__(self):
        return len(self.inputs)
    
    def __getitem__(self, idx):
        return torch.tensor(self.inputs[idx], dtype=torch.long), torch.tensor(self.targets[idx], dtype=torch.long)

dataset = TextDataset(X, Y)
batch_size = 64
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)


## Defining the RNN Model

We will use a simple `nn.LSTM` or `nn.RNN` layer for our model. The model will:

- Take a sequence of character indices as input.
- Embed them into a vector space.
- Feed the embeddings into an RNN (LSTM or GRU).
- Project the output to vocabulary size to predict the next character.

Let's use `nn.LSTM` for better handling of long-term dependencies.

In [45]:
class CharRNN(nn.Module):
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=256, num_layers=2):
        super(CharRNN, self).__init__()
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        
    def forward(self, x, hidden=None):
        x = self.embedding(x)
        out, hidden = self.lstm(x, hidden)
        out = self.fc(out)
        return out, hidden
    
    def init_hidden(self, batch_size):
        # LSTM hidden state: (num_layers, batch_size, hidden_dim)
        # LSTM cell state: (num_layers, batch_size, hidden_dim)
        return (torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device))

In [46]:
#Instantiate the model
model = CharRNN(vocab_size, embed_dim=128, hidden_dim=256, num_layers=2).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.002)


## Training the Model

We'll train the model for a few epochs. Note that training language models can be time-consuming and may require more epochs or larger models for good results. Here, we'll just run a few epochs to illustrate the process.


In [47]:
epochs = 21  # Feel free to increase if you want better results
model.train()

for epoch in range(epochs):
    hidden = model.init_hidden(batch_size)
    total_loss = 0
    for i, (inp, target) in enumerate(dataloader):
        inp, target = inp.to(device), target.to(device)
        
        optimizer.zero_grad()
        out, hidden = model(inp, hidden)
        
        # Detach hidden state to prevent backprop through entire history
        hidden = (hidden[0].detach(), hidden[1].detach())
        
        # Reshape output to (batch*seq_length, vocab_size) and targets to (batch*seq_length)
        # Actually here seq_length is fixed, and out shape is [batch_size, seq_length, vocab_size]
        # We only need the last character to predict? Actually we need to predict all next chars.
        # But in this dataset, we only predicted one next char per sequence. Let's align dimensions:
        
        # Wait, we structured dataset so that for each sequence input of length=seq_length
        # we have one target character (the next char). So we should only consider the last timestep:
        # The output includes predictions for every timestep in the sequence. We only want the prediction of the last character in the sequence.
        # The last output is out[:, -1, :]
        
        out = out[:, -1, :]  # get the prediction of the last time step
        loss = criterion(out, target)
        loss.backward()
        
        optimizer.step()
        
        total_loss += loss.item()
        
        if (i+1) % 100 == 0:
            print(f"Epoch [{epoch+1}/{epochs}], Step [{i+1}/{len(dataloader)}], Loss: {total_loss/(i+1):.4f}")

Epoch [1/21], Step [100/779], Loss: 2.9117
Epoch [1/21], Step [200/779], Loss: 2.6223
Epoch [1/21], Step [300/779], Loss: 2.4781
Epoch [1/21], Step [400/779], Loss: 2.3798
Epoch [1/21], Step [500/779], Loss: 2.3109
Epoch [1/21], Step [600/779], Loss: 2.2480
Epoch [1/21], Step [700/779], Loss: 2.1952
Epoch [2/21], Step [100/779], Loss: 1.7705
Epoch [2/21], Step [200/779], Loss: 1.7581
Epoch [2/21], Step [300/779], Loss: 1.7480
Epoch [2/21], Step [400/779], Loss: 1.7420
Epoch [2/21], Step [500/779], Loss: 1.7346
Epoch [2/21], Step [600/779], Loss: 1.7235
Epoch [2/21], Step [700/779], Loss: 1.7191
Epoch [3/21], Step [100/779], Loss: 1.5600
Epoch [3/21], Step [200/779], Loss: 1.5582
Epoch [3/21], Step [300/779], Loss: 1.5568
Epoch [3/21], Step [400/779], Loss: 1.5612
Epoch [3/21], Step [500/779], Loss: 1.5655
Epoch [3/21], Step [600/779], Loss: 1.5727
Epoch [3/21], Step [700/779], Loss: 1.5717
Epoch [4/21], Step [100/779], Loss: 1.4111
Epoch [4/21], Step [200/779], Loss: 1.4347
Epoch [4/21

## Generating Text

Now that the model is trained, we can use it to generate text. The process is:

1. **Start** with a seed string (prompt).
2. **Feed** it into the model and sample from the output distribution to select the next character.
3. **Append** the sampled character to the seed string and use the last `seq_length` characters as input for the next step.
4. **Repeat** for as many characters as you want to generate.

**Note:** With such a short training time and a small model, the generated text will likely not be very coherent. But it should reflect some patterns from Shakespeare's text.

In [48]:
def generate_text(model, start_str='ROMEO:', length=500, temperature=1.0):
    model.eval()
    chars = list(start_str)
    # We will use the model one character at a time now:
    # To do that, we must feed the model a sequence of the last seq_length chars seen so far.
    # If we have fewer than seq_length chars at start, we can pad or just start with fewer characters.
    
    # Encode the seed
    input_seq = torch.tensor([char_to_idx[ch] for ch in chars], dtype=torch.long).unsqueeze(0).to(device)
    hidden = model.init_hidden(1)
    
    for _ in range(length):
        # If our input_seq is longer than seq_length, we only take the last seq_length characters
        if input_seq.size(1) < seq_length:
            inp = input_seq
        else:
            inp = input_seq[:, -seq_length:]
        
        with torch.no_grad():
            out, hidden = model(inp, hidden)
        
        # Focus on the last character's output
        out = out[:, -1, :] / temperature
        probs = torch.softmax(out, dim=1).squeeze()
        # Sample from the distribution
        idx = torch.multinomial(probs, 1).item()
        
        chars.append(idx_to_char[idx])
        # Append the new character index to input_seq
        input_seq = torch.cat([input_seq, torch.tensor([[idx]], device=device)], dim=1)
    
    return ''.join(chars)

In [49]:
generated_text = generate_text(model, start_str="ROMEO:", length=500, temperature=0.8)
print(generated_text)

ROMEO:
We have pluck Rome.

BRUTUS:
Come, come, that were sent to see me honours,
That they passing tree, with a man were an-hungry in his former
To the icher sound more: whereof, if they were afoot.

Second Citizen:
Care for us thou get him and tent thouge heaven, I make a grave.

COMINIUS:
If I should tell you.

BRUTUS:
In am our Cominius.

BRUTUS:
So it must speaks! restands!

SICINIUS:
Besisat lamb the cormorn.

BRUTUS:
I am constant of these shrease, and they true,
engn entre their like upon him.


In [52]:
generated_text = generate_text(model, start_str="Caius Marcius", length=100, temperature=0.4)
print(generated_text)

Caius Marcius:
Leave your commissians to visit the only sons,
We prove that they would have their love or no.

ME


## Analysis of Results

You will likely see somewhat "Shakespeare-like" text (with letters and some semblance of English structure) but since we only trained briefly and on a small model, don't expect coherent text.

With longer training (more epochs), larger hidden sizes, and better optimization, the generated text becomes more coherent. The technique demonstrated here is a foundational approach. More advanced techniques (like using LSTM, GRU, or Transformer models trained for longer) can produce very fluent text.

## Further Steps

- **Increase** the number of epochs.
- **Increase** model complexity (hidden_dim, num_layers).
- **Try** a word-level model instead of a character-level model.
- **Experiment** with different temperatures during generation to control randomness.

**Remember:** RNN-based text generation was a breakthrough approach back in the day, but it has been largely surpassed by Transformer-based models (like GPT). Nevertheless, understanding RNNs and LSTMs is a key foundational concept in sequence modeling.

## References

- [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)  
- [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)  
```