Character-Level RNN: Text Generation from Asimov's *Foundation*

🧠 Goal:
Build a character-level RNN from scratch using NumPy to model language patterns in Isaac Asimov's *Foundation*.
The trained model will generate Asimov-style text one character at a time.

🔁 Plan:

1. 📖 Load and preprocess text
   - Read *Foundation* text
   - Create character-to-index (char2idx) and index-to-character (idx2char) mappings
   - Encode text as integer sequence

2. 🧱 Initialize RNN model
   - Parameters: Wxh, Whh, Why, bh, by
   - Hidden state size: e.g., 100

3. 🔄 Forward and Backward Pass
   - Implement one time-step forward: h_t = tanh(Wxh·x_t + Whh·h_{t-1} + bh)
   - Predict next char logits: y_t = Why·h_t + by
   - Use softmax + cross-entropy loss
   - Backpropagate gradients through time (BPTT)
   - Apply gradient clipping

4. 🏋️‍♂️ Training Loop
   - Slide a window over text with fixed sequence length (e.g., 25 chars)
   - Compute loss and gradients
   - Update parameters via SGD

5. 🧪 Sampling Function
   - Start from a seed character
   - Sample next character from softmax distribution
   - Repeat for N characters

6. 📉 Monitoring
   - Print loss every N steps
   - Print sample text every 1000 iterations

7. 🚀 [Optional] Try with different corpora (e.g., Sanskrit, Shakespeare)

This notebook is a stepping stone toward building a Sanskrit name generator using character-level RNNs.


In [1]:
# Step 1: Load and preprocess text
import numpy as np

# Load cleaned Asimov corpus
with open("/kaggle/input/asimov/asimov_cleaned.txt", "r", encoding="utf-8") as f:
    data = f.read()

# Get all unique characters in the text
chars = sorted(list(set(data)))
vocab_size = len(chars)

# Create character-to-index and index-to-character mappings
char_to_ix = {ch: i for i, ch in enumerate(chars)}
ix_to_char = {i: ch for ch, i in char_to_ix.items()}

# Encode the entire text as a list of character indices
data_ix = [char_to_ix[ch] for ch in data]

# Print some basic stats
print(f"Total characters: {len(data)}")
print(f"Unique characters: {vocab_size}")
print(f"Sample char_to_ix: {list(char_to_ix.items())[:10]}")
print(f"Encoded text (first 20 indices): {data_ix[:20]}")

Total characters: 10925424
Unique characters: 105
Sample char_to_ix: [(' ', 0), ('!', 1), ('"', 2), ('#', 3), ('$', 4), ('%', 5), ('&', 6), ("'", 7), ('(', 8), (')', 9)]
Encoded text (first 20 indices): [31, 72, 80, 68, 75, 81, 67, 68, 0, 39, 0, 68, 61, 82, 65, 0, 83, 78, 69, 80]
