#  Data Description
The dataset used in this project is the text of *“A Child’s History of England”* by Charles Dickens, obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/28885).

- **Format:** Plain text (.txt)  
- **Content:** Historical account of England suitable for children.  
- **Purpose:** Used to demonstrate text preprocessing techniques, including extraction, cleaning, and preparing data for character-level modeling.



# Imports 

In [12]:
import re, torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import math
import torch.nn.functional as F
import torch.optim as optim
import torch.nn as nn

# STEP 1: Load and clean text


## 1. Text Cleaning

**Purpose:** Extract relevant portions of the text, clean it, and save the cleaned version.

**Steps:**

1. **Read the file:** Open `Book.txt` with UTF-8 encoding to correctly read special characters.  
2. **Extract between markers:**  
   - Start marker: `*** START`  
   - End marker: `*** END`  
   - Only text between these markers is kept.  
3. **Basic cleaning:**  
   - Convert text to lowercase.  
   - Remove all characters except letters, numbers, whitespace, and basic punctuation.  
   - Replace multiple spaces with a single space.  
4. **Save cleaned text:** Save the processed text to `dataclean_book.txt`.  
5. **Confirmation:** Print a message that the text has been cleaned and saved.

---

In [2]:
with open("Book.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Extract between start and end markers (if exist)
start_token = "*** START"
end_token = "*** END"
start_idx = text.find(start_token)
end_idx = text.find(end_token)
if start_idx != -1 and end_idx != -1:
    text = text[start_idx:end_idx]

# Basic cleaning
clean_text = text.lower()
clean_text = re.sub(r'[^a-zA-Z0-9\s.,;:!?\'"-]', ' ', clean_text)
clean_text = re.sub(r'\s+', ' ', clean_text)

with open("dataclean_book.txt", "w", encoding="utf-8") as f:
    f.write(clean_text)

print("Text cleaned and saved to dataclean_book.txt")


Text cleaned and saved to dataclean_book.txt


# STEP 2 Dataset (Char + Word)

## 2.1 Char-level Dataset 


###  Character-level Data Preparation

**Purpose:** Convert cleaned text into numerical form at the character level and prepare PyTorch datasets and dataloaders for model training.

**Steps:**

1. **Read the cleaned text:** Load `dataclean_book.txt` as a string.  
2. **Create vocabulary:** Extract all unique characters and count them.  
3. **Encoding and decoding dictionaries:** Map characters to integers and vice versa.  
4. **Helper functions:** Encode string into integer list and decode integer list back to string.  
5. **Convert text to tensor:** Transform the entire text into a PyTorch tensor of integers.  
6. **Split data:** Use 80% for training and 20% for validation.  
7. **Custom dataset:** Create a PyTorch Dataset that returns sequences of a fixed block size for training.  
8. **DataLoaders:** Prepare dataloaders to feed batches of character sequences into the model

In [3]:

with open("dataclean_book.txt", "r", encoding="utf-8") as f:
    text = f.read()

chars = sorted(list(set(text)))
vocab_size_char = len(chars)

stoi_char = {ch: i for i, ch in enumerate(chars)}
itos_char = {i: ch for i, ch in enumerate(chars)}

def encode_char(s): return [stoi_char[c] for c in s]
def decode_char(l): return ''.join([itos_char[i] for i in l])

data_char = torch.tensor(encode_char(text), dtype=torch.long)

n = int(0.8 * len(data_char))
train_data_char = data_char[:n]
val_data_char = data_char[n:]

class CharDataset(Dataset):
    def __init__(self, data, block_size=64):
        self.data = data
        self.block_size = block_size
    def __len__(self): return len(self.data) - self.block_size
    def __getitem__(self, idx):
        x = self.data[idx:idx+self.block_size]
        y = self.data[idx+1:idx+1+self.block_size]
        return x, y

train_loader_char = DataLoader(CharDataset(train_data_char), batch_size=32, shuffle=True)
val_loader_char = DataLoader(CharDataset(val_data_char), batch_size=32)


## 2.2 Word-level Dataset



### Word-level Data Preparation Documentation

### 1. Purpose
Convert cleaned text into numerical form at the **word level** and prepare PyTorch datasets and dataloaders for training and validation.

---

### 2. Steps

1. **Tokenization:**  
   - Split the cleaned text into individual words (tokens).  
   - Convert all words to lowercase to maintain consistency.

2. **Vocabulary creation:**  
   - Extract all unique words from the tokens.  
   - Count the total number of unique words (`vocab_size_word`).

3. **Encoding and decoding dictionaries:**  
   - `stoi_word`: Maps each word to a unique integer.  
   - `itos_word`: Maps integers back to their corresponding words.

4. **Helper functions:**  
   - `encode_word`: Converts a list of word tokens to a list of integer IDs.  
   - `decode_word`: Converts a list of integer IDs back to a string of words.

5. **Convert text to tensor:**  
   - Transform the entire list of encoded words into a PyTorch tensor for model input.

6. **Split data into training and validation sets:**  
   - Use 80% of the data for training and 20% for validation.

7. **Custom word-level dataset:**  
   - Create a PyTorch Dataset that returns sequences of a fixed block size (`block_size=16`) for training.  
   - Each sample consists of input `x` (sequence of words) and target `y` (next sequence of words).

8. **DataLoaders:**  
   - Prepare PyTorch DataLoaders for training and validation.  
   - Batch size is set to 32.  
   - Training loader shuffles data, validation loader does not.

In [4]:
#Tokenization
tokens = re.findall(r"\b\w+\b", clean_text.lower())
words = sorted(list(set(tokens)))
vocab_size_word = len(words)



########################################################
# Encoding and decoding dictionaries
stoi_word = {w: i for i, w in enumerate(words)}
itos_word = {i: w for i, w in enumerate(words)}



########################################################
# Helper functions
def encode_word(tokens): return [stoi_word[w] for w in tokens]
def decode_word(ids): return " ".join([itos_word[i] for i in ids])

    
########################################################
#Convert text to tensor
data_word = torch.tensor(encode_word(tokens), dtype=torch.long)


########################################################
#Split data into training and validation sets
n = int(0.8 * len(data_word))
train_data_word = data_word[:n]
val_data_word = data_word[n:]


########################################################
# Custom word-level dataset
class WordDataset(Dataset):
    def __init__(self, data, block_size=16):
        self.data = data
        self.block_size = block_size
    def __len__(self): return len(self.data) - self.block_size
    def __getitem__(self, idx):
        x = self.data[idx:idx+self.block_size]
        y = self.data[idx+1:idx+1+self.block_size]
        return x, y


########################################################
#DataLoaders
train_loader_word = DataLoader(WordDataset(train_data_word), batch_size=32, shuffle=True)
val_loader_word = DataLoader(WordDataset(val_data_word), batch_size=32)

# STEP 3: Models (Char + Word)


## LSTM Model Definition and Initialization

## 1. Purpose
Define and initialize LSTM models for **both character-level and word-level modeling**.  
- Character-level model predicts the next character in a sequence.  
- Word-level model predicts the next word in a sequence.  

---

## 2. Model Architecture

1. **Embedding Layer:**  
   - Converts input indices (characters or words) into dense vector representations.  
   - Embedding size is set to 128 by default.

2. **LSTM Layer:**  
   - Processes sequences using Long Short-Term Memory units.  
   - Hidden size is 256 by default.  
   - Number of stacked LSTM layers is 2.  
   - `batch_first=True` ensures input tensors have shape `(batch_size, sequence_length, features)`.

3. **Fully Connected Layer:**  
   - Maps LSTM outputs to the vocabulary space.  
   - Produces logits for each character or word in the vocabulary.

---

## 3. Forward Pass

- Input sequence is first embedded.  
- LSTM processes the embedded sequence and returns output and updated hidden state.  
- Fully connected layer transforms LSTM output to logits.  
- Logits can be used with a loss function like `CrossEntropyLoss` for training.

---

## 4. Device Setup

- Check if GPU is available (`cuda`).  
- Move models to the appropriate device for faster computation.

---

## 5. Model Initialization

- `model_char`: Character-level LSTM model initialized with character vocabulary size.  
- `model_word`: Word-level LSTM model initialized with word vocabulary size.  

**Purpose:** Ready to be trained on respective datasets.  

---

## 6. Summary

- Both models are printed to show their architecture and parameter details.  
- Supports training on GPU if available.


In [5]:
# Model Architecture
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_size=128, hidden_size=256, num_layers=2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
    # Forward Pass
    def forward(self, x, hidden=None):
        x = self.embed(x)
        out, hidden = self.lstm(x, hidden)
        logits = self.fc(out)
        return logits, hidden
# Device Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Model Initialization
model_char = LSTMModel(vocab_size_char).to(device)
model_word = LSTMModel(vocab_size_word).to(device)

print("Char-level model:\n", model_char)
print("Word-level model:\n", model_word)


Char-level model:
 LSTMModel(
  (embed): Embedding(37, 128)
  (lstm): LSTM(128, 256, num_layers=2, batch_first=True)
  (fc): Linear(in_features=256, out_features=37, bias=True)
)
Word-level model:
 LSTMModel(
  (embed): Embedding(2582, 128)
  (lstm): LSTM(128, 256, num_layers=2, batch_first=True)
  (fc): Linear(in_features=256, out_features=2582, bias=True)
)


# STEP 4: Training Function

## LSTM Model Training

## 1. Loss Function
- **CrossEntropyLoss** is used as the criterion.  
- Suitable for predicting discrete classes (characters or words).  

---

## 2. Training Function: `train_model`

**Purpose:** Train the LSTM model on given datasets and monitor training and validation loss.

---

## 3. Steps in Training

1. **Optimizer Setup:**  
   - Use Adam optimizer with learning rate `lr` (default 0.003).  
   - Optimizer updates the model parameters based on computed gradients.

2. **Epoch Loop:**  
   - Repeat training for a number of epochs (default 3).  
   - Each epoch goes through all batches in the training dataset.

3. **Training Loop:**  
   - Set model to training mode (`model.train()`).  
   - For each batch:  
     - Move input `x` and target `y` to the selected device (CPU or GPU).  
     - Zero gradients in the optimizer.  
     - Forward pass through the model to get logits.  
     - Compute loss between predicted logits and actual targets.  
     - Backpropagate loss (`loss.backward()`).  
     - Update model parameters (`optimizer.step()`).  
   - Print average loss every `print_every` steps for monitoring.

4. **Validation Loop:**  
   - Set model to evaluation mode (`model.eval()`).  
   - Disable gradient computation with `torch.no_grad()`.  
   - Compute validation loss over the validation dataset.  
   - Print average validation loss after each epoch.

---

## 4. Key Points

- **Batching:** Inputs and targets are processed in batches for efficiency.  
- **Loss Calculation:** Flatten outputs and targets to match CrossEntropyLoss requirements.  
- **Device Management:** Ensures computations run on GPU if available for faster training.  
- **Monitoring:** Training and validation losses are printed to track model performance and convergence.


In [7]:
criterion = nn.CrossEntropyLoss()

def train_model(model, train_loader, val_loader, vocab_size, epochs=3, lr=0.003, print_every=100):
    optimizer = optim.Adam(model.parameters(), lr=lr)
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for i, (x, y) in enumerate(train_loader):
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            logits, _ = model(x)
            loss = criterion(logits.view(-1, vocab_size), y.view(-1))
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            if (i+1) % print_every == 0:
                print(f"Epoch {epoch}, Step {i+1}, Loss: {total_loss/print_every:.4f}")
                total_loss = 0
        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for x, y in val_loader:
                x, y = x.to(device), y.to(device)
                logits, _ = model(x)
                loss = criterion(logits.view(-1, vocab_size), y.view(-1))
                val_loss += loss.item()
        print(f"Epoch {epoch} completed. Validation Loss: {val_loss/len(val_loader):.4f}")


# STEP 5: Train Both Models


## Training Execution

## 1. Purpose
Train both the character-level and word-level LSTM models using the prepared datasets.

---

## 2. Steps

1. **Training Character-level Model:**  
   - The character-level model is trained on sequences of characters.  
   - Number of epochs is set to 10.  
   - Training function monitors both training and validation loss.

2. **Training Word-level Model:**  
   - The word-level model is trained on sequences of words.  
   - Number of epochs is set to 10.  
   - Training function monitors both training and validation loss.

---

## 3. Monitoring

- For both models, the console prints:  
  - Step-wise average training loss during each epoch.  
  - Validation loss after each epoch.  

**Purpose:** Ensures that the models are learning correctly and allows monitoring for convergence and overfitting.


In [15]:
print("Training Char-level model...")
train_model(model_char, train_loader_char, val_loader_char, vocab_size_char, epochs=10)

print("\nTraining Word-level model...")
train_model(model_word, train_loader_word, val_loader_word, vocab_size_word, epochs=10)


Training Char-level model...
Epoch 1, Step 100, Loss: 0.2955
Epoch 1, Step 200, Loss: 0.2815
Epoch 1, Step 300, Loss: 0.2776
Epoch 1, Step 400, Loss: 0.2790
Epoch 1, Step 500, Loss: 0.2818
Epoch 1, Step 600, Loss: 0.2846
Epoch 1, Step 700, Loss: 0.2866
Epoch 1, Step 800, Loss: 0.2837
Epoch 1, Step 900, Loss: 0.2885
Epoch 1, Step 1000, Loss: 0.2893
Epoch 1, Step 1100, Loss: 0.2864
Epoch 1, Step 1200, Loss: 0.2868
Epoch 1, Step 1300, Loss: 0.2863
Epoch 1, Step 1400, Loss: 0.2889
Epoch 1, Step 1500, Loss: 0.2848
Epoch 1, Step 1600, Loss: 0.2851
Epoch 1, Step 1700, Loss: 0.2842
Epoch 1, Step 1800, Loss: 0.2873
Epoch 1, Step 1900, Loss: 0.2833
Epoch 1, Step 2000, Loss: 0.2834
Epoch 1, Step 2100, Loss: 0.2878
Epoch 1, Step 2200, Loss: 0.2853
Epoch 1, Step 2300, Loss: 0.2851
Epoch 1, Step 2400, Loss: 0.2847
Epoch 1, Step 2500, Loss: 0.2869
Epoch 1, Step 2600, Loss: 0.2820
Epoch 1, Step 2700, Loss: 0.2871
Epoch 1, Step 2800, Loss: 0.2890
Epoch 1, Step 2900, Loss: 0.2868
Epoch 1, Step 3000, Los

# STEP 6: Evaluation & Perplexity

## Model Evaluation

## 1. Purpose
Evaluate the performance of both character-level and word-level LSTM models on the validation datasets.  
- Compute **validation loss** to measure how well the model predicts unseen data.  
- Compute **perplexity** to quantify prediction uncertainty.

---

## 2. Steps

1. **Set model to evaluation mode:**  
   - Disable dropout and other training-specific layers using `model.eval()`.  
   - Disable gradient computation with `torch.no_grad()` for efficiency.

2. **Iterate over validation data:**  
   - For each batch, move input `x` and target `y` to the selected device (CPU or GPU).  
   - Forward pass through the model to get logits.  
   - Compute loss using `CrossEntropyLoss`.  
   - Accumulate total loss across all batches.

3. **Compute average loss and perplexity:**  
   - Average loss = total loss divided by number of batches.  
   - Perplexity = `exp(average loss)`; lower perplexity indicates better model performance.

4. **Compare models:**  
   - Character-level and word-level validation losses and perplexities are printed.  
   - The model with lower perplexity is considered to perform better.

---

## 3. Output

- **Char-level model:** Validation loss and perplexity.  
- **Word-level model:** Validation loss and perplexity.  
- **Comparison statement:** Indicates which model performs better on the validation set.


In [16]:
def evaluate(model, data_loader, vocab_size):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for x, y in data_loader:
            x, y = x.to(device), y.to(device)
            logits, _ = model(x)
            loss = criterion(logits.view(-1, vocab_size), y.view(-1))
            total_loss += loss.item()
    avg_loss = total_loss / len(data_loader)
    ppl = math.exp(avg_loss)
    return avg_loss, ppl

char_val_loss, char_val_ppl = evaluate(model_char, val_loader_char, vocab_size_char)
word_val_loss, word_val_ppl = evaluate(model_word, val_loader_word, vocab_size_word)

print(f"[Char-level] Validation Loss: {char_val_loss:.4f}, Perplexity: {char_val_ppl:.2f}")
print(f"[Word-level] Validation Loss: {word_val_loss:.4f}, Perplexity: {word_val_ppl:.2f}")

if word_val_ppl < char_val_ppl:
    print(" Word-level performs better.")
else:
    print(" Char-level performs better.")


[Char-level] Validation Loss: 2.8401, Perplexity: 17.12
[Word-level] Validation Loss: 12.7919, Perplexity: 359299.56
 Char-level performs better.


# STEP 7: Text Generation

## Text Generation with Trained LSTM Models

## 1. Purpose
Generate new text sequences using the trained **character-level** and **word-level** LSTM models.  
- Character-level model predicts one character at a time.  
- Word-level model predicts one word at a time.  

---

## 2. Generation Process

1. **Set model to evaluation mode:**  
   - Disable dropout and other training-specific layers with `model.eval()`.  
   - Use `torch.no_grad()` to prevent gradient computation.

2. **Prepare initial input (start text):**  
   - Convert starting characters or words to integer IDs using the respective encoding dictionaries (`stoi_char` or `stoi_word`).  
   - Initialize the hidden state of the LSTM to `None`.

3. **Iterative generation loop:**  
   - For a specified sequence length:  
     - Forward pass through the model to get logits for the next character or word.  
     - Apply temperature scaling to control randomness.  
     - Convert logits to probabilities using softmax.  
     - Sample the next character or word from the probability distribution.  
     - Append the generated character or word to the sequence.  
     - Update input IDs for the next step.

4. **Return generated text:**  
   - Character-level: Join generated characters into a string.  
   - Word-level: Join generated words into a string.

---

## 3. Parameters

- **start_text:** Initial string to seed generation (e.g., `"the "` or `"the cat"`).  
- **length:** Number of characters or words to generate.  
- **temperature:** Controls randomness; higher values produce more diverse outputs.

---

## 4. Output

- Prints a sample of generated text for:  
  - **Character-level model** (e.g., 200 characters)  
  - **Word-level model** (e.g., 20 words)  
- Demonstrates the model’s ability to generate coherent sequences resembling the training data.


In [20]:
def generate_char(model, start_text="the ", length=200, temperature=1.0):
    model.eval()
    input_ids = torch.tensor([stoi_char[c] for c in start_text], dtype=torch.long).unsqueeze(0).to(device)
    hidden = None
    generated = list(start_text)
    with torch.no_grad():
        for _ in range(length):
            logits, hidden = model(input_ids, hidden)
            logits = logits[:, -1, :] / temperature
            probs = F.softmax(logits, dim=-1)
            next_id = torch.multinomial(probs, 1).item()
            generated.append(itos_char[next_id])
            input_ids = torch.tensor([[next_id]], dtype=torch.long).to(device)
    return "".join(generated)

def generate_word(model, start_text="the cat", length=20, temperature=1.0):
    model.eval()
    input_ids = torch.tensor([stoi_word[w] for w in start_text.split() if w in stoi_word],
                             dtype=torch.long).unsqueeze(0).to(device)
    hidden = None
    generated = start_text.split()
    with torch.no_grad():
        for _ in range(length):
            logits, hidden = model(input_ids, hidden)
            logits = logits[:, -1, :] / temperature
            probs = F.softmax(logits, dim=-1)
            next_id = torch.multinomial(probs, 1).item()
            generated.append(itos_word[next_id])
            input_ids = torch.tensor([[next_id]], dtype=torch.long).to(device)
    return " ".join(generated)

print("Char-level sample:\n", generate_char(model_char, start_text="the ", length=200)) # you can change the start_text
print("\nWord-level sample:\n", generate_word(model_word, start_text="the cat", length=20)) # you can change the start_text


Char-level sample:
 the ratched a couple knew that it was over at last: and i do see you re trying to me dear, and a bright to-days and camouse go on one, said alice. right, was in the distance, sitting sad and lonely on a l

Word-level sample:
 the cat or you are you fond of of dogs the mouse did not answer so alice went on eagerly there is
