# Summer of Code - Artificial Intelligence

## Week 10: Deep Learning

### Day 03: Text Generation

In this notebook, we will explore text generation, a **Sequence to Sequence** task, using RNNs with GRUs layers.


# Shakespeare's Works

In [1]:
from urllib.request import urlretrieve

shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
print("Downloading Shakespeare's works...")

urlretrieve(shakespeare_url, "shakespeare.txt")
print("Download complete!")

Downloading Shakespeare's works...
Download complete!


In [2]:
# Load the text file
with open("shakespeare.txt", "r", encoding="utf-8") as f:
    shakespeare_text = f.read()

In [3]:
len(shakespeare_text)

1115394

## Vocabulary Creation

For a character-level language model like this, our vocabulary consists of all unique characters present in the text. This includes letters, punctuation marks, spaces, and special characters.

In [4]:
text = "Hello world!"
text_tokens = sorted(set(text.lower()))
text_tokens

[' ', '!', 'd', 'e', 'h', 'l', 'o', 'r', 'w']

In [5]:
text_to_id = {ch: i for i, ch in enumerate(text_tokens)}
text_to_id['l']

5

In [6]:
id_to_text = {i: ch for ch, i in text_to_id.items()}
id_to_text[3]

'e'

In [7]:
chars = sorted(list(set(shakespeare_text.lower())))
n_tokens = len(chars)

print(f"Number of unique characters: {n_tokens}")

Number of unique characters: 39


In [8]:
# Create character-to-index and index-to-character mappings
char_to_idx = {char: idx for idx, char in enumerate(chars)}
idx_to_char = {idx: char for char, idx in char_to_idx.items()}

## Encoding Sequences
We will convert the entire text into a sequence of character IDs.

In [9]:
encoded = [char_to_idx[char] for char in shakespeare_text.lower()]
print(shakespeare_text.lower()[:10])
encoded[:10]

first citi


[18, 21, 30, 31, 32, 1, 15, 21, 32, 21]

# Text Generation Dataset

To train our model, we need to create input-target pairs.
- The input is a sequence of characters.
- The output is also a sequence of characters, but shifted one position to the right of the input position.

**For example:**\
Text: "Shakespeare"\
Sequence Length (Context Window): 5

Input: "Shake"\
Target: "haksp"


In [10]:
import torch
from torch.utils.data import Dataset


class CharDataset(Dataset):
    def __init__(self, encoded_text, window_length):
        self.encoded_text = encoded_text
        self.window_length = window_length

        # Create all possible windows
        self.windows = []
        for i in range(len(encoded_text) - window_length):
            window = encoded_text[i : i + window_length + 1]  # +1 for target
            self.windows.append(window)

    def __len__(self):
        return len(self.windows)

    def __getitem__(self, idx):
        window = self.windows[idx]
        # Input: all characters except the last
        # Target: all characters except the first (shifted by 1)
        input_seq = window[:-1]
        target_seq = window[1:]

        return input_seq, target_seq

In [11]:
encoded[:10]

[18, 21, 30, 31, 32, 1, 15, 21, 32, 21]

In [12]:
dummy_dataset = CharDataset(encoded, window_length=3)
dummy_dataset.windows[:5]

[[18, 21, 30, 31],
 [21, 30, 31, 32],
 [30, 31, 32, 1],
 [31, 32, 1, 15],
 [32, 1, 15, 21]]

In [13]:
dummy_dataset[0]

([18, 21, 30], [21, 30, 31])

## Data Splitting

In [14]:
# Set window length
window_length = 200
train_size = 1_000_000
valid_size = 60_000

encoded_tensor = torch.tensor(encoded, dtype=torch.long)
train_encoded = encoded_tensor[:train_size]
valid_encoded = encoded_tensor[train_size : train_size + valid_size]
test_encoded = encoded_tensor[train_size + valid_size :]

print(f"Training set size: {len(train_encoded):,} characters")
print(f"Validation set size: {len(valid_encoded):,} characters")
print(f"Test set size: {len(test_encoded):,} characters")

Training set size: 1,000,000 characters
Validation set size: 60,000 characters
Test set size: 55,394 characters


In [15]:
# Create datasets
train_dataset = CharDataset(train_encoded, window_length)
valid_dataset = CharDataset(valid_encoded, window_length)
test_dataset = CharDataset(test_encoded, window_length)

print(f"\nTraining windows: {len(train_dataset):,}")
print(f"Validation windows: {len(valid_dataset):,}")
print(f"Test windows: {len(test_dataset):,}")


Training windows: 999,800
Validation windows: 59,800
Test windows: 55,194


## DataLoaders

In [23]:
from torch.utils.data import DataLoader


batch_size = 128

train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=0,  # Set to 0 for Windows compatibility
    pin_memory=True if torch.cuda.is_available() else False,
)

valid_loader = DataLoader(
    valid_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=0,
    pin_memory=True if torch.cuda.is_available() else False,
)

test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=0,
    pin_memory=True if torch.cuda.is_available() else False,
)

# Char-RNN Model

In [24]:
from torch import nn


class CharRNN(nn.Module):

    def __init__(
        self, vocab_size, embed_dim=256, hidden_dim=512, num_layers=3, dropout=0.3
    ):
        super(CharRNN, self).__init__()
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.embed_dropout = nn.Dropout(0.2)
        self.gru = nn.GRU(
            embed_dim,
            hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
        )
        self.dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        
        self._init_weights()

    def _init_weights(self):
        """Initialize weights with better defaults"""
        nn.init.uniform_(self.embedding.weight, -0.1, 0.1)
        nn.init.xavier_uniform_(self.fc.weight)
        nn.init.zeros_(self.fc.bias)

    def forward(self, x):  # x shape: [batch_size, sequence_length]
        embedded = self.embedding(x)
        embedded = self.embed_dropout(embedded)
        gru_out, hidden = self.gru(embedded)
        gru_out = self.dropout(gru_out)
        output = self.fc(gru_out)
        return output, hidden


model = CharRNN(
    vocab_size=n_tokens,
    embed_dim=128,
    hidden_dim=256,
    num_layers=2,
    dropout=0.3,
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In [25]:
from torchinfo import summary

input_sample = torch.zeros((1, window_length), dtype=torch.long).to(device)
summary(model, input_data=input_sample)

Layer (type:depth-idx)                   Output Shape              Param #
CharRNN                                  [1, 200, 39]              --
├─Embedding: 1-1                         [1, 200, 128]             4,992
├─Dropout: 1-2                           [1, 200, 128]             --
├─GRU: 1-3                               [1, 200, 256]             691,200
├─Dropout: 1-4                           [1, 200, 256]             --
├─Linear: 1-5                            [1, 200, 39]              10,023
Total params: 706,215
Trainable params: 706,215
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 138.26
Input size (MB): 0.00
Forward/backward pass size (MB): 0.68
Params size (MB): 2.82
Estimated Total Size (MB): 3.50

## Loss Function and Optimizer

In [None]:
from torch import optim


criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=2, min_lr=1e-6
)

# Training and Evaluation

In [27]:
import tqdm


def train_epoch(model, dataloader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    progress_bar = tqdm.tqdm(dataloader, desc="Training")
    for batch_idx, (inputs, targets) in enumerate(progress_bar):
        # Move data to device
        inputs = inputs.to(device)
        targets = targets.to(device)

        outputs, _ = model(inputs)
        outputs = outputs.reshape(-1, outputs.size(-1))
        targets = targets.reshape(-1)

        loss = criterion(outputs, targets)
        optimizer.zero_grad()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
        optimizer.step()

        # Calculate accuracy
        _, predicted = torch.max(outputs.data, 1)
        total += targets.size(0)
        correct += (predicted == targets).sum().item()
        total_loss += loss.item()

        # Update progress bar
        progress_bar.set_postfix({
            'loss': f'{loss.item():.4f}',
            'acc': f'{100 * correct / total:.2f}%'
        })

    avg_loss = total_loss / len(dataloader)
    accuracy = 100 * correct / total

    return avg_loss, accuracy

## Step 11: Validation Function

The validation function evaluates the model without updating weights. This helps us monitor overfitting and select the best model.


In [28]:
def validate(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    progress_bar = tqdm.tqdm(dataloader, desc="Validating")
    with torch.no_grad():
        for inputs, targets in progress_bar:
            # Move data to device
            inputs = inputs.to(device)
            targets = targets.to(device)

            outputs, _ = model(inputs)
            outputs = outputs.reshape(-1, outputs.size(-1))
            targets = targets.reshape(-1)

            loss = criterion(outputs, targets)

            # Calculate accuracy
            _, predicted = torch.max(outputs.data, 1)
            total += targets.size(0)
            correct += (predicted == targets).sum().item()
            total_loss += loss.item()

            # Update progress bar with current accuracy
            progress_bar.set_postfix({
                'loss': f'{loss.item():.4f}',
                'acc': f'{100 * correct / total:.2f}%'
            })

    avg_loss = total_loss / len(dataloader)
    accuracy = 100 * correct / total

    return avg_loss, accuracy

In [29]:
num_epochs = 20
best_val_acc = 0.0

# Track training history
train_losses = []
train_accs = []
val_losses = []
val_accs = []

print("Starting training...")
print(f"Device: {device}")
print(f"Number of epochs: {num_epochs}\n")

for epoch in range(num_epochs):
    print(f"  Epoch {epoch + 1}/{num_epochs}")
    # Train
    train_loss, train_acc = train_epoch(
        model, train_loader, criterion, optimizer, device
    )

    # Validate
    val_loss, val_acc = validate(model, valid_loader, criterion, device)

    scheduler.step(val_loss)

    # Track history
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    val_losses.append(val_loss)
    val_accs.append(val_acc)

    # Print progress
    print(f"Epoch [{epoch+1}/{num_epochs}]")
    print(f"  Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}%")
    print(f"  Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.2f}%")

    # Save best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), "best_char_rnn_model.pth")
        print(f"  ✓ Saved best model (Val Acc: {val_acc:.2f}%)")
    print()

print("Training complete!")
print(f"Best validation accuracy: {best_val_acc:.2f}%")

Starting training...
Device: cuda
Number of epochs: 20

  Epoch 1/20


Training: 100%|██████████| 7811/7811 [13:32<00:00,  9.61it/s, loss=1.3049, acc=55.89%]
Validating: 100%|██████████| 468/468 [00:21<00:00, 22.11it/s, loss=1.3411, acc=57.23%]


Epoch [1/20]
  Train Loss: 1.4460 | Train Acc: 55.89%
  Val Loss: 1.4052 | Val Acc: 57.23%
  ✓ Saved best model (Val Acc: 57.23%)

  Epoch 2/20


Training: 100%|██████████| 7811/7811 [14:32<00:00,  8.95it/s, loss=1.2902, acc=59.98%]
Validating: 100%|██████████| 468/468 [00:24<00:00, 18.87it/s, loss=1.3543, acc=57.33%]


Epoch [2/20]
  Train Loss: 1.2910 | Train Acc: 59.98%
  Val Loss: 1.4116 | Val Acc: 57.33%
  ✓ Saved best model (Val Acc: 57.33%)

  Epoch 3/20


Training: 100%|██████████| 7811/7811 [13:39<00:00,  9.53it/s, loss=1.2606, acc=60.78%]
Validating: 100%|██████████| 468/468 [00:19<00:00, 24.28it/s, loss=1.3860, acc=57.21%]


Epoch [3/20]
  Train Loss: 1.2612 | Train Acc: 60.78%
  Val Loss: 1.4235 | Val Acc: 57.21%

  Epoch 4/20


Training: 100%|██████████| 7811/7811 [13:29<00:00,  9.65it/s, loss=1.2576, acc=61.21%]
Validating: 100%|██████████| 468/468 [00:21<00:00, 21.78it/s, loss=1.3404, acc=57.26%]


Epoch [4/20]
  Train Loss: 1.2456 | Train Acc: 61.21%
  Val Loss: 1.4307 | Val Acc: 57.26%

  Epoch 5/20


Training: 100%|██████████| 7811/7811 [13:06<00:00,  9.93it/s, loss=1.2114, acc=61.89%]
Validating: 100%|██████████| 468/468 [00:23<00:00, 20.06it/s, loss=1.3586, acc=57.13%]


Epoch [5/20]
  Train Loss: 1.2210 | Train Acc: 61.89%
  Val Loss: 1.4382 | Val Acc: 57.13%

  Epoch 6/20


Training: 100%|██████████| 7811/7811 [12:33<00:00, 10.37it/s, loss=1.1973, acc=62.06%]
Validating: 100%|██████████| 468/468 [00:18<00:00, 24.65it/s, loss=1.3580, acc=56.93%]


Epoch [6/20]
  Train Loss: 1.2146 | Train Acc: 62.06%
  Val Loss: 1.4429 | Val Acc: 56.93%

  Epoch 7/20


Training: 100%|██████████| 7811/7811 [12:36<00:00, 10.32it/s, loss=1.2136, acc=62.18%]
Validating: 100%|██████████| 468/468 [00:17<00:00, 26.41it/s, loss=1.3527, acc=57.03%]


Epoch [7/20]
  Train Loss: 1.2104 | Train Acc: 62.18%
  Val Loss: 1.4423 | Val Acc: 57.03%

  Epoch 8/20


Training: 100%|██████████| 7811/7811 [12:42<00:00, 10.25it/s, loss=1.1997, acc=62.53%]
Validating: 100%|██████████| 468/468 [00:19<00:00, 24.20it/s, loss=1.3601, acc=56.96%]


Epoch [8/20]
  Train Loss: 1.1983 | Train Acc: 62.53%
  Val Loss: 1.4474 | Val Acc: 56.96%

  Epoch 9/20


Training:   2%|▏         | 161/7811 [00:17<13:51,  9.21it/s, loss=1.1908, acc=62.63%]


KeyboardInterrupt: 

## Load Best Model


In [31]:
model.load_state_dict(
    torch.load("best_char_rnn_model.pth", weights_only=True, map_location=device)
)
print("Best model loaded!")

Best model loaded!


## Next Character Prediction
We can predict the next character given a seed text. This involves feeding the seed text into the model and sampling from the output probabilities to get the next character.

### Temperature Sampling
Instead of always picking the most likely character, we can use **temperature sampling** to add diversity:

- **Temperature = 1**: Use probabilities as-is
- **Temperature < 1**: Favor high-probability characters (more conservative)
- **Temperature > 1**: Flatten probabilities (more creative/random)

In [45]:
def predict_next_char(
    model, text, char_to_idx, idx_to_char, temperature=1.0, device="cpu"
):
    model.eval()

    text_lower = text.lower()
    input_tensor = (
        torch.tensor(
            [char_to_idx.get(char, 0) for char in text_lower], dtype=torch.long
        )
        .unsqueeze(0)
        .to(device)
    )

    with torch.no_grad():
        output, _ = model(input_tensor)

        # Get logits for the last character
        logits = output[0, -1, :]
        scaled_logits = logits / temperature
        probs = torch.softmax(scaled_logits, dim=0)
        char_idx = torch.multinomial(probs, num_samples=1).item()

    return idx_to_char[char_idx]


# Test the function
test_text = "to be or not to b"
next_char = predict_next_char(
    model, test_text, char_to_idx, idx_to_char, temperature=1.0, device=device
)
print(f"Input: '{test_text}'")
print(f"Predicted next character: '{next_char}'")
print(f"Result: '{test_text + next_char}'")

Input: 'to be or not to b'
Predicted next character: 'e'
Result: 'to be or not to be'


## Text Generation
After training the model, we can use it to generate new text sequences.
1. First, we give our model a seed text.
2. Then, we predict the next character and append it to the sequence.
3. Repeat the process for a desired length.

In [47]:
def generate_text(
    model,
    seed_text,
    char_to_idx,
    idx_to_char,
    num_chars=100,
    temperature=1.0,
    device="cuda",
):
    generated_text = seed_text

    # Generate characters one at a time
    for _ in range(num_chars):
        # Predict next character
        next_char = predict_next_char(
            model,
            generated_text,
            char_to_idx,
            idx_to_char,
            temperature=temperature,
            device=device,
        )
        generated_text += next_char

    return generated_text


# Generate text with different temperatures
seed = "to be or not to be"

print("=" * 70)
print("Text Generation with Different Temperatures")
print("=" * 70)

print(f"\nSeed text: '{seed}'")
print(f"\n{'='*70}")
print("Temperature = 0.01 (Very Conservative)")
print(f"{'='*70}")
text_low = generate_text(
    model,
    seed,
    char_to_idx,
    idx_to_char,
    num_chars=100,
    temperature=0.01,
    device=device,
)
print(text_low)

print(f"\n{'='*70}")
print("Temperature = 1.0 (Balanced)")
print(f"{'='*70}")
text_med = generate_text(
    model, seed, char_to_idx, idx_to_char, num_chars=100, temperature=1.0, device=device
)
print(text_med)

print(f"\n{'='*70}")
print("Temperature = 2.0 (More Creative)")
print(f"{'='*70}")
text_high = generate_text(
    model, seed, char_to_idx, idx_to_char, num_chars=100, temperature=2.0, device=device
)
print(text_high)

Text Generation with Different Temperatures

Seed text: 'to be or not to be'

Temperature = 0.01 (Very Conservative)
to be or not to be a proportion
and see the strength of the provost in the state,
which we have seen the strong and se

Temperature = 1.0 (Balanced)
to be or not to be
maided off, seening, sigrioal purses
in them, this proud man that bear him to his
with day will wea

Temperature = 2.0 (More Creative)
to be or not to be spice yet,
isesadiey, and thy eandve thts'em aclondoble;
wanioughfou sword to-fapet. let her prese



## Testing the Text Generation


In [48]:
# Try different seed texts
custom_seeds = [
    "romeo and juliet",
    "once upon a time",
    "the king said",
    "in fair verona",
]

print("Custom Text Generation\n" + "=" * 70)

for seed in custom_seeds:
    torch.manual_seed(42)  # Reset seed for consistency
    generated = generate_text(
        model,
        seed,
        char_to_idx,
        idx_to_char,
        num_chars=200,
        temperature=1.0,
        device=device,
    )
    print(f"\nSeed: '{seed}'")
    print(f"Generated: {generated}")
    print("-" * 70)

Custom Text Generation

Seed: 'romeo and juliet'
Generated: romeo and juliet,
or for this peers perfection for his sport:
your highness--as follows me.

eithoraa:
i pray thee, teech,
go mourn perform'd and loves these towers.

first murderer:
what is cared my lord, is out in 
----------------------------------------------------------------------

Seed: 'once upon a time'
Generated: once upon a time,
on there i do cries to your crown thus yours.

gloucester:
as followers to tite, and repoant my gitter;
and most upench, but for my head spare here,
the tribunes, throw of my benefit to-morrow.
now 
----------------------------------------------------------------------

Seed: 'the king said'
Generated: the king said,
on there is pecricious touchery of my point:
applain the onay follow'd cloiding soul.

prance:
well eas, kites, nor it keeped to mistrust.

coriolanus:
move the king that he shall fetter them
enough
----------------------------------------------------------------------

Seed: '