# Simple RNN for Character-Level Text Generation

**Inspired by A. Karpathy's "The Unreasonable Effectiveness of Recurrent Neural Networks"**

**Goal:** To build and understand a *minimal* character-level Recurrent Neural Network (RNN) using PyTorch, similar to the basic examples discussed by Andrej Karpathy. We aim to predict the next character in a sequence and generate some text.

**Focus:** Understanding the core RNN mechanism (hidden state as memory), character embeddings, and the generation process via sampling.

**Disclaimer:** We are intentionally using a *simple `nn.RNN`* (not LSTM/GRU) and a very small dataset/short training time (~5 mins) for illustrative purposes. As Karpathy notes, simple RNNs struggle with long-term dependencies. **Do not expect high-quality generated text.** Expect repetition and nonsensical sequences. The goal is to see the *potential* and understand the mechanics.

## 1. Setup: Imports and Configuration

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import numpy as np
import time
import math
import random

# Configuration
SEQ_LENGTH = 40      # How many steps to unroll the RNN for backpropagation
EMBEDDING_DIM = 64   # Dimension of character embeddings
HIDDEN_DIM = 256     # Size of the RNN's hidden state 'memory'
NUM_LAYERS = 2       # Number of stacked RNN layers (1 or 2 is typical for simple examples)
BATCH_SIZE = 64      # Number of sequences per batch
LEARNING_RATE = 0.003 # Learning rate
EPOCHS = 15          # Number of training epochs (keep low for speed)
GENERATION_LENGTH = 200 # How many characters to generate
GRAD_CLIP = 1.0      # Gradient clipping value to prevent exploding gradients

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# For reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"Using simple RNN with {NUM_LAYERS} layers, HIDDEN_DIM={HIDDEN_DIM}")

Using device: cpu
Using simple RNN with 2 layers, HIDDEN_DIM=256


## 2. Data Preparation

We need to convert our text into numbers that the RNN can process.

1.  **Corpus:** A small piece of text.
2.  **Vocabulary:** The set of unique characters in the text.
3.  **Mappings:** Dictionaries to convert characters to integers (`char_to_int`) and back (`int_to_char`).
4.  **Sequences:** Create input sequences of length `SEQ_LENGTH` and corresponding target characters (the character immediately following each sequence).

In [2]:
# 1. Define Text Corpus (Small snippet for speed)
text = """
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.
The goal is a computer capable of understanding the contents of documents, including the contextual nuances of the language within them.
The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
Recurrent neural networks (RNNs) were once commonly used for such tasks.
""".lower() # Use lowercase

# 2. Create Vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)

# 3. Create Mappings
char_to_int = {ch: i for i, ch in enumerate(chars)}
int_to_char = {i: ch for i, ch in enumerate(chars)}

print(f"Corpus length: {len(text)} characters")
print(f"Vocabulary size: {vocab_size} unique characters")
print(f"Vocabulary: {''.join(chars)}")

# 4. Generate Sequences and Targets
input_seqs_int = []
target_chars_int = []
for i in range(len(text) - SEQ_LENGTH):
    seq_in = text[i:i + SEQ_LENGTH]
    seq_out = text[i + SEQ_LENGTH]
    input_seqs_int.append([char_to_int[ch] for ch in seq_in])
    target_chars_int.append(char_to_int[seq_out])

num_sequences = len(input_seqs_int)
print(f"\nNumber of sequences created: {num_sequences}")

# Display a sample
sample_idx = 50
print(f"Sample Input : '{''.join([int_to_char[i] for i in input_seqs_int[sample_idx]])}'")
print(f"Sample Target: '{int_to_char[target_chars_int[sample_idx]]}'")

# 5. Create PyTorch Dataset and DataLoader
X = torch.tensor(input_seqs_int, dtype=torch.long)
y = torch.tensor(target_chars_int, dtype=torch.long)

class CharDataset(Dataset):
    def __init__(self, sequences, targets):
        self.sequences = sequences
        self.targets = targets
    def __len__(self):
        return len(self.sequences)
    def __getitem__(self, idx):
        return self.sequences[idx], self.targets[idx]

dataset = CharDataset(X, y)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
print(f"\nCreated DataLoader with {len(dataloader)} batches.")

Corpus length: 798 characters
Vocabulary size: 31 unique characters
Vocabulary: 
 (),.abcdefghiklmnopqrstuvwxyz

Number of sequences created: 758
Sample Input : 'f linguistics, computer science, and art'
Sample Target: 'i'

Created DataLoader with 11 batches.


## 3. Define the Simple RNN Model

This follows the standard structure:
1.  `nn.Embedding`: Learns a vector for each character.
2.  `nn.RNN`: The core recurrent layer. Takes the current character's embedding and the previous hidden state, outputs a new hidden state and an output vector.
3.  `nn.Linear`: Maps the RNN's output for the last character in the sequence to scores for *every* character in the vocabulary (predicting the next one).

In [3]:
class CharRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # Use nn.RNN here!
        self.rnn = nn.RNN(embedding_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, h):
        # x shape: (batch_size, seq_length)
        # h shape: (num_layers, batch_size, hidden_dim)
        embedded = self.embedding(x) # -> (batch_size, seq_length, embedding_dim)
        # Pass embedded sequence and hidden state through RNN
        out, h_out = self.rnn(embedded, h)
        # out shape: (batch_size, seq_length, hidden_dim) - Output from last RNN layer for each time step
        # h_out shape: (num_layers, batch_size, hidden_dim) - Final hidden state for all layers

        # We take the RNN output from the *very last* time step
        last_time_step_out = out[:, -1, :] # -> (batch_size, hidden_dim)

        # Pass this through the fully connected layer to get scores for next char
        scores = self.fc(last_time_step_out) # -> (batch_size, vocab_size)
        return scores, h_out

    def init_hidden(self, batch_size):
        # Initial hidden state (usually zeros)
        return torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)

# Instantiate the model
model = CharRNN(vocab_size, EMBEDDING_DIM, HIDDEN_DIM, NUM_LAYERS).to(device)
print("Model Definition (Simple RNN):")
print(model)
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal trainable parameters: {total_params:,}")

Model Definition (Simple RNN):
CharRNN(
  (embedding): Embedding(31, 64)
  (rnn): RNN(64, 256, num_layers=2, batch_first=True)
  (fc): Linear(in_features=256, out_features=31, bias=True)
)

Total trainable parameters: 223,967


## 4. Training the RNN

The training loop involves iterating through the data, feeding sequences to the model, calculating the loss (how wrong the predictions are), and updating the model's weights.

**Key RNN aspects:**
*   **Hidden State:** The hidden state `h` is initialized at the start and passed through the RNN along with the input. The RNN outputs an updated hidden state, which is then used for the *next* batch (after detaching).
*   **Detaching:** `h = h.detach()` prevents gradients from flowing back endlessly through the entire training history, which is computationally infeasible and usually not helpful. We only backpropagate through the current sequence (`SEQ_LENGTH`).
*   **Gradient Clipping:** `clip_grad_norm_` helps prevent the exploding gradient problem, where gradients become excessively large and destabilize training.

In [4]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

print("--- Starting Training (Simple RNN) ---")
start_train_time = time.time()
model.train()

for epoch in range(EPOCHS):
    epoch_start_time = time.time()
    epoch_loss = 0
    # Initialize hidden state for the start of the epoch (will be detached per batch)
    h = model.init_hidden(BATCH_SIZE)

    for i, (inputs, targets) in enumerate(dataloader):
        inputs, targets = inputs.to(device), targets.to(device)

        # Detach hidden state from previous batch history
        h = h.detach()

        optimizer.zero_grad()
        outputs, h = model(inputs, h)
        loss = criterion(outputs, targets)
        loss.backward()
        # Clip gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
        optimizer.step()

        epoch_loss += loss.item()

    avg_epoch_loss = epoch_loss / len(dataloader)
    epoch_end_time = time.time()
    print(f"Epoch {epoch+1}/{EPOCHS} | Avg Loss: {avg_epoch_loss:.4f} | Time: {epoch_end_time - epoch_start_time:.2f}s")

end_train_time = time.time()
print(f"--- Training Finished in {end_train_time - start_train_time:.2f} seconds ---")

--- Starting Training (Simple RNN) ---
Epoch 1/15 | Avg Loss: 2.9817 | Time: 0.95s
Epoch 2/15 | Avg Loss: 2.3214 | Time: 0.64s
Epoch 3/15 | Avg Loss: 2.0089 | Time: 0.66s
Epoch 4/15 | Avg Loss: 1.7189 | Time: 0.64s
Epoch 5/15 | Avg Loss: 1.4153 | Time: 0.64s
Epoch 6/15 | Avg Loss: 1.1765 | Time: 0.62s
Epoch 7/15 | Avg Loss: 0.9544 | Time: 0.63s
Epoch 8/15 | Avg Loss: 0.7394 | Time: 0.63s
Epoch 9/15 | Avg Loss: 0.5727 | Time: 0.63s
Epoch 10/15 | Avg Loss: 0.3940 | Time: 0.65s
Epoch 11/15 | Avg Loss: 0.3020 | Time: 0.64s
Epoch 12/15 | Avg Loss: 0.2283 | Time: 0.64s
Epoch 13/15 | Avg Loss: 0.1880 | Time: 0.65s
Epoch 14/15 | Avg Loss: 0.1434 | Time: 0.64s
Epoch 15/15 | Avg Loss: 0.1143 | Time: 0.64s
--- Training Finished in 9.90 seconds ---


## 5. Generating Text (Sampling)

This is where the "magic" happens (or doesn't, given our simple model!). We feed the model a starting character (or sequence) and ask it to predict the next character. We then take that prediction, feed it back in, and repeat the process.

*   **Priming:** We first feed the `start_phrase` to the model to get the hidden state into a reasonable context.
*   **Sampling:** Instead of always picking the *most likely* next character (which leads to boring, repetitive text), we *sample* from the probability distribution output by the model. The `temperature` parameter controls the randomness: lower temperature makes it more conservative, higher temperature makes it more adventurous (and often nonsensical).

In [5]:
def generate_text(model, start_phrase, length, temperature=0.8):
    """Generates text using the trained simple RNN model."""
    model.eval()
    start_phrase = start_phrase.lower()
    generated_text = start_phrase

    # Initialize hidden state for generation (batch size 1)
    h = model.init_hidden(1)

    # Prime the model with the start_phrase
    print("Priming...")
    for char in start_phrase[:-1]:
        try:
            char_idx = char_to_int[char]
        except KeyError:
            continue # Skip chars not in vocab
        input_tensor = torch.tensor([[char_idx]], dtype=torch.long).to(device)
        with torch.no_grad():
            _, h = model(input_tensor, h)

    # Set the first input for generation to the last char of the phrase
    try:
        last_char_idx = char_to_int[start_phrase[-1]]
        current_input = torch.tensor([[last_char_idx]], dtype=torch.long).to(device)
    except KeyError:
        print(f"Error: Last character '{start_phrase[-1]}' not in vocabulary.")
        return start_phrase

    print("Generating...")
    # Generation loop
    for _ in range(length):
        with torch.no_grad():
            output, h = model(current_input, h)

            # Apply temperature
            output_dist = output.data.view(-1).div(temperature).exp()
            # Sample next character index
            top_char_idx = torch.multinomial(output_dist, 1)[0]

            # Convert index to character
            predicted_char = int_to_char[top_char_idx.item()]
            generated_text += predicted_char

            # Update input for next step
            current_input = torch.tensor([[top_char_idx.item()]], dtype=torch.long).to(device)

    return generated_text

# --- Example Generation ---
start_phrase = "natural language"
print(f"\n--- Generating text starting with: '{start_phrase}' --- (Temp=0.8)")
generated_output = generate_text(model, start_phrase, length=GENERATION_LENGTH, temperature=0.8)
print("\nGenerated Text:")
print(generated_output)

start_phrase_2 = "the computer"
print(f"\n--- Generating text starting with: '{start_phrase_2}' --- (Temp=0.6)")
generated_output_2 = generate_text(model, start_phrase_2, length=GENERATION_LENGTH, temperature=0.6)
print("\nGenerated Text:")
print(generated_output_2)


--- Generating text starting with: 'natural language' --- (Temp=0.8)
Priming...
Generating...

Generated Text:
natural language onderstanding the contextual nuances of the language data. 
the tocumockgence concerned with the interactions between computer coprect information and insights tonta language understanding the conten

--- Generating text starting with: 'the computer' --- (Temp=0.6)
Priming...
Generating...

Generated Text:
the computers to process and analyze large amounts of natural language within them. 
the technology can then ance commonly used for such tasks.

the goal istacd the language within them. bthe technology can then 


## 6. Conclusion & Karpathy Context

We successfully built and trained a simple character-level RNN. As expected, the generated text likely shows some basic structure (e.g., forming word-like units, using spaces) but quickly devolves into repetition or nonsense.

This aligns with Karpathy's observations:
*   **Effectiveness:** Even this simple model *learns something* about language structure from raw text, which is remarkable.
*   **Limitations:** Simple `nn.RNN` units suffer from vanishing gradients and cannot easily capture long-range dependencies. This is why the generated text struggles with coherence over longer stretches.
*   **Next Steps (as per Karpathy):** To get truly "unreasonably effective" results, one would typically use more advanced units like LSTMs or GRUs, larger datasets, and more extensive training, as these gated units are specifically designed to mitigate the vanishing gradient problem and better manage the hidden state (memory).