# Assignment 2: NLP Basics
In this assignment, you will implement and explore three fundamental concepts in Natural Language Processing: word embeddings (GloVe), subword tokenization (BPE), and recurrent neural networks for text generation (LSTM). This will give you a hands-on understanding of how text is represented and processed in modern NLP models.

**Due date: 2025.11.16 , 23:59**

### Environment Setup and Data Download
Before you begin, please make sure you have the necessary libraries installed. You can typically install them using pip.
```bash
pip install torch numpy requests

## Section 1: GloVe Embeddings
In this section, you will load the traditional GloVe embeddings, and explore the basic properties of the embedding space.

In [1]:
# Download the GloVe text files
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2025-11-13 20:39:33--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-11-13 20:39:34--  https://nlp.stanford.edu/data/glove.6B.zip
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-11-13 20:39:35--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8621826

### 1.1 Load the GloVe embeddings [5 pts]

In [2]:
import numpy as np

# Load GloVe word vectors from a text file into a dictionary
def load_glove_embeddings(glove_file_path):
    """
    Loads GloVe embeddings into a dictionary mapping words to their vector representations.
    """
    embeddings_dict = {}
    # TODO: Open the file and parse each line.
    # Each line contains a word followed by its vector values.
    # Convert the vector values to a numpy array.
    with open(glove_file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings_dict[word] = vector

    return embeddings_dict

glove_file_path = 'glove.6B.100d.txt'  # Replace with the path to your GloVe file
glove_embeddings = load_glove_embeddings(glove_file_path)

### 1.2 Find closest words [5 pts]
For a given word, you should find the most similar words in the vocabulary based on cosine similarity and output them along with their similarity scores.

In [3]:
# Function to find the closest words and their corresponding similarity values
def find_closest_words(word_vec, embeddings_dict, top_n=5):
    """
    Finds the top_n closest words to a given vector based on cosine similarity.
    Note: The input is a vector, not a word. This makes the function more versatile.
    """
    # TODO:
    # 1. Calculate the cosine similarity between the input word_vec and all other word vectors in embeddings_dict.
    # 2. Sort the words based on similarity in descending order.
    # 3. Return the top_n words and their similarity scores.
    keys = list(embeddings_dict.keys())
    mat = np.array(list(embeddings_dict.values()))
    # cosine similarity
    mat_norms = np.linalg.norm(mat, axis=1)
    vec_norm = np.linalg.norm(word_vec)
    sims = (mat @ word_vec) / (mat_norms * vec_norm + 1e-9)
    # sort by similarity descending and skip the identical vector (if present)
    sorted_idx = np.argsort(-sims)
    results = []
    for idx in sorted_idx:
        if np.allclose(mat[idx], word_vec, atol=1e-6):
            continue
        results.append((keys[idx], float(sims[idx])))
        if len(results) >= top_n:
            break

    return results

chosen_word = 'man'
if chosen_word in glove_embeddings:
    chosen_word_vec = glove_embeddings[chosen_word]
    closest_words = find_closest_words(chosen_word_vec, glove_embeddings, top_n=5)
    print(f"The words closest to '{chosen_word}' are:")
    for word, similarity in closest_words:
        print(f"{word} with similarity of {similarity:.4f}")

The words closest to 'man' are:
woman with similarity of 0.8323
boy with similarity of 0.7915
one with similarity of 0.7789
person with similarity of 0.7527
another with similarity of 0.7522


### 1.3 Find new analogies [5 pts]
In the lecture, we discussed how linear relationships exist in the embedding space (e.g. king - man + woman ≈ queen). Please demonstrate a new analogy that was not mentioned in the lecture. You should perform the vector arithmetic like `vec(word1) - vec(word2) + vec(word3)` and find the word closest to the resulting vector.

In [4]:
# TODO
vec = glove_embeddings['king'] - glove_embeddings['man'] + glove_embeddings['woman']
closest_words = find_closest_words(vec, glove_embeddings, top_n=5)
print(f"\nThe words closest to the vector 'king - man + woman' are:")
for word, similarity in closest_words:
    print(f"{word} with similarity of {similarity:.4f}")


The words closest to the vector 'king - man + woman' are:
king with similarity of 0.8552
queen with similarity of 0.7834
monarch with similarity of 0.6934
throne with similarity of 0.6833
daughter with similarity of 0.6809


## Section 2: BPE tokenizer

Byte Pair Encoding(BPE) is a subword tokenization technique that iteratively merges the most frequent adjacent byte pairs into subword units, creating a vocabulary that balances character-level granularity and whole-word tokens. This method is widely used in modern natural language processing to handle out-of-vocabulary words and optimize tokenization efficiency.

Let's look at an example. Given a sample string "banana bandana", we can calculate the frequency of the character pairs: 
```python
('a', 'n'): 4, ('n', 'a'): 3, ('b', 'a'): 2, ('a', ' '): 1, (' ','b'): 1, ('n', 'd'): 1, ('d', 'a'): 1
```
Which means that we can combine 'an' into a new token. In the next round, 'an' can now participate in the frequency count, giving:
```python
('b', 'an'): 2, ('an', 'a'): 2, ('an', 'an'): 1, ('a', ' '): 1, (' ','b'): 1, ('an', 'd'): 1, ('d', 'an'): 1
```
So we may get 'ban' as a new token. Similarly, 'ana' would be the most frequent pair in the next round. With three merges, we've added 'an', 'ban' and 'ana' into our vocabulary, and our string can now be converted to the following tokens:
```python
'ban', 'ana', ' ', 'ban','d' ,'ana'
```
So now we can use 6 tokens to represent the 14 characters.

You may wonder how this is better than word-level tokenization. First of all, it is more robust in out-of-vocabulary scenarios. For example, though the word "bandana" does exist in the GloVe embedding (look it up if you're not convinced), something like "banada" does not. When using GloVe embeddings, encountering "banada" during training would result in the default \<UNK\> token. In contrast, a BPE tokenizer can still infer the word's meaning through its sub-word tokens. Secondly, sub-word tokens include prefixes and suffixes that allow the model to learn different variations of a single word more efficiently.

In this section, you are required to implement a BPE tokenizer, and use one of the provided corpora to train it. You may train it on character level (starting with a vocabulary of all characters in the corpus) or byte level (starting with a vocabulary of all 256 possible byte values). You should verify that encoding and then decoding a sentence produces the original sentence. You may refer to (but not copy) the following implementations:
1. The tiktoken library: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py
2. Kaparthy's minbpe repository: https://github.com/karpathy/minbpe

### 2.1 Implementation and Verification [15 pts]

In [5]:
from typing import List, Tuple, Dict
import collections
class BPETokenizer:
    def __init__(self):
        self.vocab = {}
        self.reverse_vocab = {}
        self.next_id = 0
        self.merges = {}
    
    def _get_stats(self, tokens: List[str]) -> Dict[Tuple[str, str], int]:
        """计算相邻token对的频率"""
        pairs = collections.defaultdict(int)
        for i in range(len(tokens) - 1):
            pairs[tuple(tokens[i:i+2])] += 1
        return pairs
    def _merge_pair(self, tokens: List[str], pair: Tuple[str, str], new_token: str) -> List[str]:
        """将tokens列表中所有出现的pair替换为new_token"""
        i = 0
        new_tokens = []
        while i < len(tokens):
            # 检查当前位置和下一个位置是否形成要合并的pair
            if i + 1 < len(tokens) and tokens[i] == pair[0] and tokens[i+1] == pair[1]:
                new_tokens.append(new_token)
                i += 2  # 跳过已被合并的两个旧token
            else:
                new_tokens.append(tokens[i])
                i += 1
        return new_tokens
    def train(self, text: str, vocab_size: int):
        """
        Trains the BPE tokenizer. You will need to store the learned merge rules and the final vocabulary.
        """
        # TODO:
        # 1. Initialize vocabulary with all unique characters in the text.
        # 2. Calculate the number of merges needed (vocab_size - initial_vocab_size).
        # 3. Loop for the required number of merges:
        #    a. Find the most frequent adjacent pair of tokens in the text.
        #    b. Create a new token by merging this pair.
        #    c. Add the new token to the vocabulary and record the merge rule.
        #    d. Replace all occurrences of the pair in the text with the new token.

        initial_chars = sorted(list(set(text)))
        self.next_id = 0
        for char in initial_chars:
            self.vocab[char] = self.next_id
            self.reverse_vocab[self.next_id] = char
            self.next_id += 1
            
        initial_vocab_size = len(self.vocab)
        num_merges = vocab_size - initial_vocab_size
        tokens = list(text) 
        self.merges = []
        for merge_i in range(num_merges):
            # 2a. 统计最高频对
            # 注意：这里直接作用于整个 tokens 列表
            stats = self._get_stats(tokens)
                    
            if not stats:
                break 

            best_pair = max(stats, key=stats.get)
            
            # 2b. 创建新 token
            new_token_str = best_pair[0] + best_pair[1]
            
            # 2c. 添加新 token 到词汇表并记录规则（按顺序）
            self.merges.append((best_pair, new_token_str))
            self.vocab[new_token_str] = self.next_id
            self.reverse_vocab[self.next_id] = new_token_str
            self.next_id += 1
            if merge_i < 5:
                print(f"Merge {merge_i+1}: '{best_pair[0]}', '{best_pair[1]}' -> '{new_token_str}'")
            
            # 2d. 替换所有出现的pair
            tokens = self._merge_pair(tokens, best_pair, new_token_str)
        
    def _tokenize_string(self, s: str) -> List[str]:
        """将单个字符串 s 转换为其最高效的BPE tokens表示"""
        tokens = list(s)
        
        # 严格按照训练时记录的规则顺序进行合并
        for pair, new_token_str in self.merges:
            tokens = self._merge_pair(tokens, pair, new_token_str)
            
        return tokens
            

    def encode(self, text: str) -> List[int]:
        """
        Encodes a string into a list of token indices using the learned merge rules.
        """
        # 1. 应用所有合并规则
        tokens = self._tokenize_string(text)
        
        # 2. 将最终 tokens 映射为 ID
        encoded_tokens = []
        for token in tokens:
            if token in self.vocab:
                encoded_tokens.append(self.vocab[token])
            else:
                # OOV 策略：对于未登录词，回退到其基础字符编码 (char-level OOV)
                # 这在字符级BPE中是安全的，因为所有基础字符都存在于 vocab 中
                for char in list(token):
                    encoded_tokens.append(self.vocab[char])
        
        return encoded_tokens

    def decode(self, tokens: List[int]) -> str:
        """
        Decodes a list of token indices back into a text string.
        """
        # 将 token ID 转换回 token 字符串
        # 由于我们训练时没有丢弃或特殊处理空格，这里简单拼接即可精确重构
        token_strings = [self.reverse_vocab[token_id] for token_id in tokens]
        
        return "".join(token_strings)

In [None]:
# Load your training text here, we alse provide some sample text for you
with open('tinyshakespeare.txt', 'r', encoding='utf-8') as f:
    train_text = f.read()

tokenizer = BPETokenizer()
# We will create a small vocabulary for demonstration purposes
tokenizer.train(train_text, vocab_size=512)

# Verification step
test_string = "O Romeo, Romeo, wherefore art thou Romeo?"
encoded_tokens = tokenizer.encode(test_string)
decoded_string = tokenizer.decode(encoded_tokens)

assert decoded_string == test_string
print("Verification Successful!")
print(f"Original string: {test_string}")
print(f"Encoded tokens: {encoded_tokens}")
print(f"Decoded string: {decoded_string}")

Merge 1: 'e', ' ' -> 'e '
Merge 2: 't', 'h' -> 'th'
Merge 3: 't', ' ' -> 't '
Merge 4: 's', ' ' -> 's '
Merge 5: 'd', ' ' -> 'd '
Verification Successful!
Original string: O Romeo, Romeo, wherefore art thou Romeo?
Encoded tokens: [27, 1, 30, 101, 43, 294, 30, 101, 43, 294, 191, 365, 464, 81, 67, 455, 30, 101, 43, 53, 12]
Decoded string: O Romeo, Romeo, wherefore art thou Romeo?


### 2.2 Question [5 pts]

**Question:** List the first 5 merge rules your tokenizer learned during training. Based on the `tinyshakespeare.txt` corpus, provide a brief explanation for why these specific pairs of characters were likely the first to be merged.

**Answer:**

*Your answer here. You should list the merges (e.g., ('t', 'h') -> 'th') and explain their high frequency in the context of Shakespearean English.*

Merge 1: 'e', ' ' -> 'e '
Merge 2: 't', 'h' -> 'th'
Merge 3: 't', ' ' -> 't '
Merge 4: 's', ' ' -> 's '
Merge 5: 'd', ' ' -> 'd '

Merge 1,3,4,5展示了单词末尾和空格的合并。这是很合理的。e,t,s,d都是常见的单词末尾字母。e是最常见的字母;art,not,it,but;his,lords,三单，复数等以s结尾；d构成过去式词尾（ed 的一部分）和高频词（如 and, had, lord）。  
th是最经典的二元组。

## Section 3: Text generation

In this section, you will implement an LSTM-based model to generate sentences that mimic the style of the Shakespearean corpus. You will use the GloVe embeddings from Section 1 as a pre-trained embedding layer.

In [6]:
# Load necessary packages. Feel free to add ones that you need.
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

### 3.1 Load and preprocess text [5 pts]
You can choose a training corpus from the provided texts. Though the texts are much cleaner than random web crawls, you may still want to perform some preprocessing.

In [7]:
# Load the text file
with open('tinyshakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()


def preprocess_text(text):
    import re
    text = text.lower()
    text = re.sub(r'\n+', ' ', text)
    text = re.sub(r'([.,?!;:"()])', r' \1 ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    tokens = text.split(' ')
    return tokens

processed_text = preprocess_text(text)

### 3.2 Build vocabulary and setup embedding matrix [5 pts]
Create a vocabulary from your processed text. Then, create an embedding matrix where the i-th row corresponds to the GloVe vector for the i-th word in your vocabulary.

In [8]:
# Build Vocabulary from the processed text
# TODO: Create a set of unique words, then create word-to-index (word2idx) and index-to-word (idx2word) mappings.


# Create the embedding matrix from GloVe
def create_embedding_matrix(word2idx, glove_embeddings, embedding_dim):
    # Initialize matrix with zeros
    embedding_matrix = np.zeros((len(word2idx), embedding_dim))
    # TODO: 
    # For each word in your vocabulary, if it exists in glove_embeddings, 
    # add its vector to the matrix at the correct index.
    # Words not found in GloVe will remain as zero vectors.
    for word, idx in word2idx.items():
        if word in glove_embeddings:
            embedding_matrix[idx] = glove_embeddings[word]
    return torch.FloatTensor(embedding_matrix)
unique_words = sorted(list(set(processed_text)))
word2idx = {word: i for i, word in enumerate(unique_words)}
idx2word = {i: word for word, i in word2idx.items()}

vocab_size = len(word2idx)
embedding_dim = 100
embedding_matrix = create_embedding_matrix(word2idx, glove_embeddings, embedding_dim)
token_indices = [word2idx[word] for word in processed_text]

### 3.3 Implement the dataset [10 pts]
The text generation task uses next-word prediction as its objective. You should construct your dataset using a sliding window approach. For a sequence of length `n`, the first `n-1` words will be the input, and the `n`-th word will be the target.

In [9]:
# Construct your dataset
class TextDataset(Dataset):

    def __init__(self, token_indices, seq_length):
        # TODO: Create input sequences and their corresponding targets.
        # For a given seq_length, each sample should be (sequence_of_indices, next_word_index).
        self.token_indices = token_indices
        self.seq_length = seq_length

    def __len__(self):
        # TODO
        return len(self.token_indices) - self.seq_length + 1

    def __getitem__(self, idx):
        # TODO

        window = self.token_indices[idx:idx + self.seq_length]

        # Split into input and target
        x = torch.tensor(window[:-1], dtype=torch.long)
        y = torch.tensor(window[-1], dtype=torch.long)

        return x, y

### 3.4 Implement the LSTM model [10 pts]
You will use `nn.Embedding`, `nn.LSTM`, and `nn.Linear` to build your model. The embedding layer should be initialized with the pre-trained GloVe matrix.

In [10]:
# Construct your model
class TextGenLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, embedding_matrix):
        super(TextGenLSTM, self).__init__()
        # TODO:  
        # 1. Create an embedding layer (nn.Embedding). Load the pre-trained embedding_matrix and set freeze=True to prevent it from being trained.
        # 2. Create an LSTM layer (nn.LSTM).
        # 3. Create a fully connected layer (nn.Linear) to map LSTM output to vocabulary size.
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix, freeze=True)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        self.dropout = nn.Dropout(0.2)
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim

    def forward(self, x, hidden):
        # TODO
        embed_out = self.embedding(x)
        # lstm_out shape: (batch_size, seq_len, hidden_dim)
        # hidden shape: ( (num_layers, batch_size, hidden_dim), ... )
        lstm_out, hidden = self.lstm(embed_out, hidden)
        lstm_out = self.dropout(lstm_out)
        # logits shape: (batch_size, seq_len, vocab_size)
        logits = self.fc(lstm_out)
        logits = logits[:, -1, :]  # Get the logits for the last time step
        return logits, hidden
    def init_hidden(self, batch_size):
        # Initialize hidden and cell states with zeros
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim).to('cuda' if torch.cuda.is_available() else 'cpu')
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim).to('cuda' if torch.cuda.is_available() else 'cpu')
        return (h0, c0)

### 3.5 Implement a generate_text function [10 pts]
This function will take a starting sequence (prompt) and generate a specified number of new words. To get more interesting results than simple greedy decoding (always picking the most probable word), try implementing a sampling strategy like top-k sampling.

In [11]:
# Generate text with your model
def generate_text(model, start_sequence, num_words_to_generate, word2idx, idx2word, device, top_k=5):
    """
    Generates text using the trained model and a top-k sampling strategy.
    """
    # TODO:
    # 1. Set the model to evaluation mode.
    # 2. Convert start_sequence to a tensor of indices.
    # 3. Generate one word at a time for num_words_to_generate:
    #    a. Feed the current sequence to the model.
    #    b. Get the output logits for the next word.
    #    c. Apply top-k sampling: get the top k logits and their indices, convert them to probabilities using softmax, and sample from this new distribution.
    #    d. Append the sampled word's index to the sequence and use it as input for the next step.
    # 4. Convert the final sequence of indices back to words and return as a string.
    model.eval()
    global SEQ_LENGTH 
    input_length = SEQ_LENGTH
        
    generated_indices = [word2idx[word] for word in start_sequence.split() if word in word2idx]
    
    # 确保起始序列长度不超过 input_length
    current_indices = generated_indices[-input_length:]    
    input_seq = torch.tensor(current_indices, dtype=torch.long).unsqueeze(0).to(device)
    hidden = None
    
    for _ in range(num_words_to_generate):
        with torch.no_grad():
            output, hidden = model(input_seq, hidden)
            
            # 修正：output 已经是 (1, Vocab_size) 维度，移除 Batch 维
            logits = output.squeeze(0) # 尺寸现在是 (Vocab_size)
            
            top_k_logits, top_k_indices = torch.topk(logits, top_k)
            probabilities = torch.softmax(top_k_logits, dim=-1)
            sampled_index_in_k = torch.multinomial(probabilities, 1).item()
            next_word_index = top_k_indices[sampled_index_in_k].item()
            
            # 3. 更新序列
            generated_indices.append(next_word_index)
            current_indices = generated_indices[-input_length:]
            input_seq = torch.tensor(current_indices, dtype=torch.long).unsqueeze(0).to(device)    
            
    generated_words = [idx2word.get(idx, '<UNK>') for idx in generated_indices]
    return " ".join(generated_words)

### 3.6 Implement the training loop [10 pts]
Train your model. During each epoch, log the average training and validation loss. It's also highly recommended to generate a short piece of text after each epoch to see how the model's creative abilities evolve.

In [12]:
def train_model(model, train_loader, val_loader, criterion, optimizer, device, epochs=10):
    for epoch in range(epochs):
        model.train()
        train_loss = 0.0
        for inputs, targets in train_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            # TODO: Your training steps here (zero grad, forward pass, loss, backward, step)
            # Remember to detach the hidden state to prevent backpropagating through the entire history.
            optimizer.zero_grad()
            hidden = model.init_hidden(inputs.size(0))
            outputs, hidden = model(inputs, hidden)
            loss = criterion(outputs, targets)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)  # Gradient clipping
            optimizer.step()
            train_loss += loss.item()

        # Calculate average training loss
        avg_train_loss = train_loss / len(train_loader)

        # Validation loop
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for inputs, targets in val_loader:
                inputs, targets = inputs.to(device), targets.to(device)
                
                hidden = model.init_hidden(inputs.size(0)) 
                outputs, _ = model(inputs, hidden)                 
                loss = criterion(outputs, targets)
                val_loss += loss.item()

        # Calculate average validation loss
        avg_val_loss = val_loss / len(val_loader)

        # Logging and generating sample text
        print(f'Epoch {epoch+1}/{epochs}, Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}')
        
        # TODO: Call your generate_text function with a fixed prompt (e.g., "shall i compare thee")
        # and print the generated text to observe the model's progress.
        prompt = "shall i compare thee"
        generated_text = generate_text(model, prompt, num_words_to_generate=20, word2idx=word2idx, idx2word=idx2word, device=device, top_k=5)
        print(f'Generated Text: {generated_text}\n')
        
    return model

In [16]:
# Initialize hyperparameters, model, optimizer, etc., and start the training process.
# TODO
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

SEQ_LENGTH = 60
BATCH_SIZE = 128
HIDDEN_DIM = 128
NUM_LAYERS = 1
LEARNING_RATE = 0.001
EPOCHS = 10
EMBEDDING_DIM = 100

# 3. 数据集和 DataLoader
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset

split_idx = int(len(token_indices) * 0.9)  # 保持 90% 训练集 (匹配你的 test_size=0.1)
train_indices = token_indices[:split_idx]
val_indices = token_indices[split_idx:]

# --- (以下代码保持不变) ---
# 创建 Dataset
train_dataset = TextDataset(train_indices, SEQ_LENGTH)
val_dataset = TextDataset(val_indices, SEQ_LENGTH)

# 创建 DataLoader
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

# 4. 初始化模型、损失函数和优化器
model = TextGenLSTM(vocab_size, embedding_dim, HIDDEN_DIM, NUM_LAYERS, embedding_matrix).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE,weight_decay=1e-5)

# 5. 启动训练
print(f"Starting training on {device}...")
trained_model = train_model(model, train_loader, val_loader, criterion, optimizer, device, 
                            epochs=EPOCHS)

Starting training on cuda...
Epoch 1/10, Train Loss: 6.0144, Val Loss: 5.7731
Generated Text: shall i compare thee , i am a of a : and he have , that is the world , and the his ,

Epoch 2/10, Train Loss: 5.5215, Val Loss: 5.5843
Generated Text: shall i compare thee , and my lord . king henry vi : the queen is a man . gloucester : my lord ,

Epoch 3/10, Train Loss: 5.3270, Val Loss: 5.4900
Generated Text: shall i compare thee ; and i have , my father . king henry vi richard : i have not to be , and

Epoch 4/10, Train Loss: 5.1919, Val Loss: 5.4429
Generated Text: shall i compare thee , and i have , for that i will be a little man . romeo : o , thou art

Epoch 5/10, Train Loss: 5.0902, Val Loss: 5.4056
Generated Text: shall i compare thee , to thy love to the tower , and the duke of york . king edward iv : i know

Epoch 6/10, Train Loss: 5.0081, Val Loss: 5.3851
Generated Text: shall i compare thee , and i will not so , that i have been a man of my heart , and i will

Epoch 7/10, Train L

### 3.7 Question: Design Choices [15 pts]

**Question:** Discuss **at least two** design choices you made during the implementation of your text generation model (Section 3) and explain how they impacted the final result. You can discuss any of the steps, from text preprocessing and dataset construction to model architecture and the text generation strategy.

For each choice, describe:
1.  **What was the choice?** (e.g., sequence length in the dataset, number of LSTM layers, using top-k sampling vs. greedy decoding).
2.  **What was your rationale for this choice?** (e.g., 'I chose a longer sequence length to capture more context...' or 'I used top-k sampling to avoid repetitive text...').
3.  **How did it affect the outcome?** (e.g., 'This resulted in more coherent sentences but increased training time.' or 'The generated text became more diverse and less predictable.').

*Your answer will be evaluated based on the clarity and depth of your rationale. **Please note:** The goal of this question is to encourage reflection. As long as you clearly explain your choices and your reasoning, you will receive full credit, so you don't need to write a lot and worry about losing points.*

In [17]:
# Initialize hyperparameters, model, optimizer, etc., and start the training process.
# TODO
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

SEQ_LENGTH = 30
BATCH_SIZE = 128
HIDDEN_DIM = 128
NUM_LAYERS = 1
LEARNING_RATE = 0.001
EPOCHS = 10
EMBEDDING_DIM = 100

# 3. 数据集和 DataLoader
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset

split_idx = int(len(token_indices) * 0.9)  # 保持 90% 训练集 (匹配你的 test_size=0.1)
train_indices = token_indices[:split_idx]
val_indices = token_indices[split_idx:]

# --- (以下代码保持不变) ---
# 创建 Dataset
train_dataset = TextDataset(train_indices, SEQ_LENGTH)
val_dataset = TextDataset(val_indices, SEQ_LENGTH)

# 创建 DataLoader
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

# 4. 初始化模型、损失函数和优化器
model = TextGenLSTM(vocab_size, embedding_dim, HIDDEN_DIM, NUM_LAYERS, embedding_matrix).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE,weight_decay=1e-3)
 
# 5. 启动训练
print(f"Starting training on {device}...")
trained_model = train_model(model, train_loader, val_loader, criterion, optimizer, device, 
                            epochs=EPOCHS)

Starting training on cuda...
Epoch 1/10, Train Loss: 6.2935, Val Loss: 6.1014
Generated Text: shall i compare thee . and and the the the a the the , and and and i , the the the his ,

Epoch 2/10, Train Loss: 6.0136, Val Loss: 5.9756
Generated Text: shall i compare thee to my of of his of of my lord , and you you have , the the is , that

Epoch 3/10, Train Loss: 5.9379, Val Loss: 5.9700
Generated Text: shall i compare thee have you to , that the this king . king ! , that he have i , that , the

Epoch 4/10, Train Loss: 5.9098, Val Loss: 5.9249
Generated Text: shall i compare thee . duke , and you have , i have be the the , and the the lord . the york

Epoch 5/10, Train Loss: 5.8967, Val Loss: 5.9248
Generated Text: shall i compare thee , but is , i have you to him to to , i have not , to me . and

Epoch 6/10, Train Loss: 5.8881, Val Loss: 5.9281
Generated Text: shall i compare thee . the york . the the king , but you not the the lord ; and the the the is

Epoch 7/10, Train Loss: 5.8836, 

In [13]:
# Initialize hyperparameters, model, optimizer, etc., and start the training process.
# TODO
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

SEQ_LENGTH = 60
BATCH_SIZE = 128
HIDDEN_DIM = 128
NUM_LAYERS = 1
LEARNING_RATE = 0.001
EPOCHS = 10
EMBEDDING_DIM = 100

# 3. 数据集和 DataLoader
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset

split_idx = int(len(token_indices) * 0.9)  # 保持 90% 训练集 (匹配你的 test_size=0.1)
train_indices = token_indices[:split_idx]
val_indices = token_indices[split_idx:]

# --- (以下代码保持不变) ---
# 创建 Dataset
train_dataset = TextDataset(train_indices, SEQ_LENGTH)
val_dataset = TextDataset(val_indices, SEQ_LENGTH)

# 创建 DataLoader
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

# 4. 初始化模型、损失函数和优化器
model = TextGenLSTM(vocab_size, embedding_dim, HIDDEN_DIM, NUM_LAYERS, embedding_matrix).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

# 5. 启动训练
print(f"Starting training on {device}...")
trained_model = train_model(model, train_loader, val_loader, criterion, optimizer, device, 
                            epochs=EPOCHS)

Starting training on cuda...
Epoch 1/10, Train Loss: 6.0656, Val Loss: 5.8638
Generated Text: shall i compare thee . first : you have not be be , but i have , to you to you to the ,

Epoch 2/10, Train Loss: 5.5084, Val Loss: 5.7051
Generated Text: shall i compare thee , i am the world , and you have to the king ? what is my father ? o ,

Epoch 3/10, Train Loss: 5.2839, Val Loss: 5.6702
Generated Text: shall i compare thee to my brother ; and you , and i have been to the king ; and that you will not

Epoch 4/10, Train Loss: 5.1194, Val Loss: 5.6639
Generated Text: shall i compare thee : and , i do , and you have heard to the world , that we shall be a thousand

Epoch 5/10, Train Loss: 4.9792, Val Loss: 5.6878
Generated Text: shall i compare thee , and , i am not , i will not . i will not . king richard iii : i

Epoch 6/10, Train Loss: 4.8584, Val Loss: 5.7063
Generated Text: shall i compare thee , and , for i would not have been a happy . king henry vi : why , i am

Epoch 7/10, Train Lo

In [18]:
# Initialize hyperparameters, model, optimizer, etc., and start the training process.
# TODO
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

SEQ_LENGTH = 60
BATCH_SIZE = 128
HIDDEN_DIM = 128
NUM_LAYERS = 1
LEARNING_RATE = 0.001
EPOCHS = 10
EMBEDDING_DIM = 100

# 3. 数据集和 DataLoader
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset

split_idx = int(len(token_indices) * 0.9)  # 保持 90% 训练集 (匹配你的 test_size=0.1)
train_indices = token_indices[:split_idx]
val_indices = token_indices[split_idx:]

# --- (以下代码保持不变) ---
# 创建 Dataset
train_dataset = TextDataset(train_indices, SEQ_LENGTH)
val_dataset = TextDataset(val_indices, SEQ_LENGTH)

# 创建 DataLoader
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

# 4. 初始化模型、损失函数和优化器
model = TextGenLSTM(vocab_size, embedding_dim, HIDDEN_DIM, NUM_LAYERS, embedding_matrix).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE,weight_decay=1e-3)

# 5. 启动训练
print(f"Starting training on {device}...")
trained_model = train_model(model, train_loader, val_loader, criterion, optimizer, device, 
                            epochs=EPOCHS)

Starting training on cuda...
Epoch 1/10, Train Loss: 6.3057, Val Loss: 6.1139
Generated Text: shall i compare thee . i i , , i , , the my , and the the the . i be , that

Epoch 2/10, Train Loss: 6.0301, Val Loss: 6.0097
Generated Text: shall i compare thee . the , and and my lord , and my lord , and i , and , that have have

Epoch 3/10, Train Loss: 5.9571, Val Loss: 5.9806
Generated Text: shall i compare thee i will the my , and my , to the a the king : and i have be you have

Epoch 4/10, Train Loss: 5.9269, Val Loss: 5.9652
Generated Text: shall i compare thee have be be the , and , my lord , and i am , i have i do have ,

Epoch 5/10, Train Loss: 5.9140, Val Loss: 5.9458
Generated Text: shall i compare thee , and i have be not be to the my lord . king of the lord , my , and

Epoch 6/10, Train Loss: 5.9068, Val Loss: 5.9354
Generated Text: shall i compare thee i , the a , and and the , and my lord , and you have the a of ,

Epoch 7/10, Train Loss: 5.9019, Val Loss: 5.9478
Generated Tex

In [15]:
print(generate_text(trained_model, "shall i compare thee", num_words_to_generate=20, word2idx=word2idx, idx2word=idx2word, device=device, top_k=5))
print(generate_text(trained_model, "shall i compare thee", num_words_to_generate=20, word2idx=word2idx, idx2word=idx2word, device=device, top_k=10))
print(generate_text(trained_model, "shall i compare thee", num_words_to_generate=20, word2idx=word2idx, idx2word=idx2word, device=device, top_k=1))

shall i compare thee , but that i am not , and , i am not , for i will not , but ,
shall i compare thee now , but i am not to hear it , and so my lord . lady anne : ay ,
shall i compare thee , and i am a king , and i am not to be revenged . king richard ii : i


Answer: 

#### **选择 1：数据集序列长度**

* **1. 选择了什么？**
    我对比了两种不同的序列长度（`SEQ_LENGTH`）设置：30 和 60。
* **2. 选择的理由是什么？**
    旨在探索序列长度对训练效率（LSTM 的 O(n) 复杂度）和模型性能（更长的上下文 vs. 更短的上下文）之间的权衡。
* **3. 这对结果有何影响？**
    * **训练时间**：`SEQ_LENGTH=30` 的训练时间确实接近 `SEQ_LENGTH=60` 的一半，这与 LSTM 训练复杂度的理论预期一致。
    * **模型损失**：`SEQ_LENGTH=30` 的模型最终收敛的 loss 偏高，这也符合预期，因为较短的序列提供了较少的上下文信息。
    * **生成质量**：尽管 loss 有差异，但两种设置在*生成质量*上差异不大。这可能是因为用于测试的 prompt（提示）本身很短，远未达到 30 个词元的限制，因此模型无法利用更长的上下文优势。

#### **选择 2：分词策略**

* **1. 选择了什么？**
    我没有使用简单的空格分割，而是实现了一个预处理步骤，通过在标点符号周围添加空格来将它们与单词显式分开（例如，`"revenged."` 变为 `"revenged" " . "`)。
* **2. 选择的理由是什么？**
    预训练的 GloVe 词嵌入 是为“干净”的单词（不带标点符号）提供向量的。简单的分割会产生许多词汇表外 (OOV) 的词元，例如 `"revenged."`，这些词元在词嵌入矩阵中是找不到的。通过将它们分开，我确保了有效的单词（`"revenged"`）能够正确映射到其预训练向量，从而最大限度地利用 GloVe 词嵌入。
* **3. 这对结果有何影响？**
    这个选择显著减少了 OOV 词元的数量。它使模型能够学习标点符号作为独立词元的句法作用，并改善了单词到其有意义嵌入的整体映射，从而提升了模型性能及其生成语法合理文本的能力。

#### **选择 3：文本生成的采样策略**

* **1. 选择了什么？**
    我比较了三种不同的文本生成解码策略：贪心解码（Greedy Decoding，总是选择概率最高的词）、Top-k 采样（$k=5$）以及 Top-k 采样（$k=10$）。
* **2. 选择的理由是什么？**
    目标是在文本的流畅性 (fluency) 和多样性 (diversity) 之间找到平衡。贪心解码通常会导致重复的、确定性的、乏味的文本。引入采样（如 Top-k）可以增加随机性，从而创造出更多样化、“更有创意”的文本，但过多的随机性（如过高的 'k' 值）可能导致文本 incoherent（不连贯）。
* **3. 这对结果有何影响？**
    结果显示了清晰的权衡。贪心解码产生的文本具有很高的局部流畅性，但很快变得重复，且词汇多样性很低。Top-10 采样产生的文本则经常在语法上不通顺，缺乏连贯性。Top-5 采样 提供了最佳的折衷：它生成的文本比贪心方法多样化得多，但又不像 Top-10 方法那样混乱，仍然保持了合理的句子结构和流畅度。

#### **选择 4：正则化策略**

* **1. 选择了什么？**
    我试验了优化器中不同的 `weight_decay` (权重衰减) 值，特别是比较了低值 (1e-5)、中等值 (1e-3) 和无权重衰减这三种设置。
* **2. 选择的理由是什么？**
    `tinyshakespeare` 是一个极小的数据集，这使得模型非常容易发生过拟合。
* **3. 这对结果有何影响？**
    鉴于数据集很小，所有设置都显示出过拟合的迹象。然而，1e-5 的设置取得了最好的平衡。与没有衰减相比，它明显减缓了过拟合的发生速度；同时，它也比更激进的 1e-3 设置收敛到了一个更好（更低）的最终验证损失。这表明，在这个特定数据集上，1e-5 是寻找泛化解的最有效选择。

注：为了更清晰，使用了AI帮助整理书写。原始文段：

 对于序列长度，我选择了30和60两组。从训练时间上来看，序列长度为30的确实接近60的一半，这和LSTM训练时O(n)复杂度是一致的。从loss上看，序列长度为30训练的模型最终loss偏大，这符合我们对语句越长、信息越丰富的感知。但是整体上看，二者在生成的质量上差异不大。我认为这是合理的，因为生成的prompt序列长度很短，远远没有超过训练的30 token的限制。


对于防止过拟合的weight decay，我实验了1e-5,1e-3，无三种settings。很明显，对于生成任务，这个数据集过于小了。无论是哪种settings都无法生成让人满意的语句，且都出现了一定的过拟合的情况。整体上来看，选择1e-5是一个比较平衡的举动，不仅过拟合现象相对缓解，收敛到的解也性质比较好，可以获得相对较低的loss.


对于tokenizer,我选择了一种将标点符号单独分割出来的方法。由于原始的word embedding是相对干净的，通过将标点符号前后加空格强行分开的方法，避免了如 `revenged.` 这种非法token的出现，提升了模型的性能。


关于采样策略，我分别尝试了top 5,top 10, greedy三种。greedy的结果虽然流畅度相对较高，但是生成词汇的多样性很差，而top10的语句通顺性相对较差，没有明显的实际意义。 