## 1. Deep RNNs
---

有多个隐藏层的RNNs

In [2]:
from IPython.display import Image, display
url = 'https://d2l.ai/_images/deep-rnn.svg'
display(Image(url=url, width=200))

## 2. 数学表达式

$t$时刻的隐藏输出：$H_t^{(l)} = \phi(W_{hh}^{(l)}H_{t-1}^{(l)}+W_{xh}^{(l)}H_t^{(l-1)}+b)$

特别的，$x_t = H_t^{0}$

$t$时刻的预测输出：$\hat X_t = \phi(W_{hx}^{(L)}H_{t}^{(L)}+b)$

用分块矩阵乘法重新表示下$H_t^{(l)}$：$H_t^{(l)} = \phi(\left[\begin{array}{c|r}W_{hh}^{(l)} & W_{xh}^{(l)}\end{array}\right] \left[\begin{array}{cc|r}H_{t-1}^{(l)}\\H_t^{(l-1)}\end{array}\right]+b)$

## 3. 文本分类

基于以 imdb 数据集，实现了一个基本的 RNN 文本分类任务

### 3.1 导入必要的包

In [102]:
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
from transformers import GPT2TokenizerFast, AutoModelForCausalLM
from datasets import load_dataset
import torch.optim as optim
from torch.amp import GradScaler, autocast

### 3.2 设定超参数

In [103]:
config = {
    "dataset_name": "wikitext",
    "dataset_config": "wikitext-2-raw-v1",
    "model_name": "gpt2",
    "batch_size": 32,
    "seq_length": 64,
    "embed_dim": 64,
    "hidden_dim": 32,
    "num_layers": 3,
    "dropout": 0.2,
    "lr": 1e-4,
    "epochs": 20,
    "device": "cuda" if torch.cuda.is_available() else "cpu"
}

### 3.3 处理数据

In [1]:
# 加载wiki数据集
dataset = load_dataset(config['dataset_name'], config['dataset_config'], cache_dir='./data')['train']

NameError: name 'load_dataset' is not defined

In [104]:
for i in range(10):
    print(dataset[i])

{'text': ''}
{'text': ' = Valkyria Chronicles III = \n'}
{'text': ''}
{'text': ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n'}
{'text': " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adju

In [99]:
# 分词器
tokenizer = GPT2TokenizerFast.from_pretrained(config['model_name'], cache_dir='./cache') 
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
print(len(tokenizer))

50258


In [86]:
# 示例
text = "The Hub has support for dozens of libraries in the Open Source ecosystem. \
Thanks to the huggingface_hub Python library, it’s easy to enable sharing your models on the Hub. \
The Hub supports many libraries, and we’re working on expanding this support. \
We’re happy to welcome to the Hub a set of Open Source libraries that are pushing Machine Learning forward."

# 分词
tokens = tokenizer.tokenize(text)
print("Tokens:", len(tokens), tokens)

Tokens: 78 ['The', 'ĠHub', 'Ġhas', 'Ġsupport', 'Ġfor', 'Ġdozens', 'Ġof', 'Ġlibraries', 'Ġin', 'Ġthe', 'ĠOpen', 'ĠSource', 'Ġecosystem', '.', 'ĠThanks', 'Ġto', 'Ġthe', 'Ġhugging', 'face', '_', 'hub', 'ĠPython', 'Ġlibrary', ',', 'Ġit', 'âĢ', 'Ļ', 's', 'Ġeasy', 'Ġto', 'Ġenable', 'Ġsharing', 'Ġyour', 'Ġmodels', 'Ġon', 'Ġthe', 'ĠHub', '.', 'ĠThe', 'ĠHub', 'Ġsupports', 'Ġmany', 'Ġlibraries', ',', 'Ġand', 'Ġwe', 'âĢ', 'Ļ', 're', 'Ġworking', 'Ġon', 'Ġexpanding', 'Ġthis', 'Ġsupport', '.', 'ĠWe', 'âĢ', 'Ļ', 're', 'Ġhappy', 'Ġto', 'Ġwelcome', 'Ġto', 'Ġthe', 'ĠHub', 'Ġa', 'Ġset', 'Ġof', 'ĠOpen', 'ĠSource', 'Ġlibraries', 'Ġthat', 'Ġare', 'Ġpushing', 'ĠMachine', 'ĠLearning', 'Ġforward', '.']


In [87]:
# 定义预处理函数
def tokenize_function(dataset):
    return tokenizer(
        dataset["text"],
        truncation=True,
        padding="max_length",
        max_length=config["seq_length"],
        return_overflowing_tokens=True,
        stride=config["seq_length"] // 2,
        return_tensors="pt"
    )
    
# 应用预处理
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

Map:   0%|          | 0/36718 [00:00<?, ? examples/s]

In [95]:
encoding = tokenizer(text, padding=True, truncation=True, max_length=20, return_tensors="pt")
print(encoding)

{'input_ids': tensor([[  464, 14699,   468,  1104,   329,  9264,   286, 12782,   287,   262,
          4946,  8090, 13187,    13,  6930,   284,   262, 46292,  2550,    62]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [96]:
# 自定义 Dataset 类
class TextDataset(Dataset):
    def __init__(self, tokenized_data):
        self.input_ids = tokenized_data["input_ids"]
    
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        input_ids = torch.tensor(self.input_ids[idx])
        return input_ids, input_ids.clone()  # 目标就是输入的下一个 token

# 创建 DataLoader
train_dataset = TextDataset(tokenized_dataset)
train_dataloader = DataLoader(train_dataset, batch_size=config["batch_size"], shuffle=True)

In [97]:
for batch in dataloader:
    input_ids, target_ids = batch
    print(input_ids, target_ids)
    break

tensor([[12168,  6154,   423,  ..., 50256, 50256, 50256],
        [  796,   796, 12556,  ..., 50256, 50256, 50256],
        [  796,   796,  2159,  ..., 50256, 50256, 50256],
        ...,
        [50256, 50256, 50256,  ..., 50256, 50256, 50256],
        [  383,   749,  8018,  ..., 50256, 50256, 50256],
        [  554,  9162,   837,  ..., 50256, 50256, 50256]]) tensor([[12168,  6154,   423,  ..., 50256, 50256, 50256],
        [  796,   796, 12556,  ..., 50256, 50256, 50256],
        [  796,   796,  2159,  ..., 50256, 50256, 50256],
        ...,
        [50256, 50256, 50256,  ..., 50256, 50256, 50256],
        [  383,   749,  8018,  ..., 50256, 50256, 50256],
        [  554,  9162,   837,  ..., 50256, 50256, 50256]])


### 3.4 定义网络

In [46]:
# 定义 RNN 模型
class RNNModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, dropout):
        super(RNNModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.RNN(embed_dim, hidden_dim, num_layers, batch_first=True, dropout=dropout)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        x = self.embedding(x)
        x = self.dropout(x)
        x, _ = self.rnn(x)
        x = self.dropout(x)
        x = self.fc(x)
        return x

### 3.5 实例化模型

In [56]:
vocab_size = len(tokenizer)
model = RNNModel(vocab_size=vocab_size, embed_dim=config["embed_dim"], hidden_dim=config["hidden_dim"], 
                 num_layers=config["num_layers"], dropout=config["dropout"]).to(config["device"])

In [57]:
from torchinfo import summary
summary(model)

Layer (type:depth-idx)                   Param #
RNNModel                                 --
├─Embedding: 1-1                         3,216,512
├─RNN: 1-2                               7,360
├─Linear: 1-3                            1,658,514
├─Dropout: 1-4                           --
Total params: 4,882,386
Trainable params: 4,882,386
Non-trainable params: 0

### 3.6 选择损失函数和优化器

In [58]:
# 使用交叉熵损失函数
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)
optimizer = optim.AdamW(model.parameters(), lr=config["lr"])
scaler = GradScaler()  # 半精度训练

### 3.7 定义训练函数

In [66]:
# 定义训练函数
def train(model, dataloader):
    model.train()

    for epoch in range(config['epochs']):
        total_loss = 0.0

        for batch in dataloader:
            input_ids, target_ids = batch
            input_ids, target_ids = input_ids.to(config['device']), target_ids.to(config['device'])

            optimizer.zero_grad(set_to_none=True)  # 手动管理显存

            with autocast(device_type='cuda'):  # 混合精度训练
                output = model(input_ids)
                loss = criterion(output.view(-1, vocab_size), target_ids.view(-1))
            
            scaler.scale(loss).backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # 梯度裁剪
            scaler.step(optimizer)
            scaler.update()
            
            total_loss += loss.item()

            # 释放显存
            del input_ids, target_ids, output, loss
            torch.cuda.empty_cache()
        
        avg_loss = total_loss / len(dataloader)
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.4f}")

### 3.8 训练

In [60]:
train(model, train_dataloader, optimizer, criterion, scaler, config["epochs"], config["device"])

Epoch [1/20], Loss: 9.2707
Epoch [2/20], Loss: 7.4201
Epoch [3/20], Loss: 6.9457
Epoch [4/20], Loss: 6.7038
Epoch [5/20], Loss: 6.5250
Epoch [6/20], Loss: 6.3356
Epoch [7/20], Loss: 6.2328
Epoch [8/20], Loss: 6.1387
Epoch [9/20], Loss: 6.0454
Epoch [10/20], Loss: 5.9370
Epoch [11/20], Loss: 5.8748
Epoch [12/20], Loss: 5.8157
Epoch [13/20], Loss: 5.7470
Epoch [14/20], Loss: 5.6954
Epoch [15/20], Loss: 5.6381
Epoch [16/20], Loss: 5.5857
Epoch [17/20], Loss: 5.5471
Epoch [18/20], Loss: 5.5017
Epoch [19/20], Loss: 5.4573
Epoch [20/20], Loss: 5.4199
