## 文本分类项目模型与训练代码讲解

// unfinished

使用 PyTorch 完成一个文本分类的项目。

数据集采用 IMDB-Review Dataset，数据集位置：[Kaggle](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)。这是一个二分类任务。

+ 视频：[32、基于PyTorch的文本分类项目模型与训练代码讲解](https://www.bilibili.com/video/BV1eD4y1F7o4/)

In [7]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from loguru import logger
from pathlib import Path

## 1. 编写 GCN + DNN 模型代码

+ 这里的 GCN 指的是 Gated Convolution Network，并对其进行了简化。

In [2]:
class GCNN(nn.Module):
    
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int = 64,
        num_class: int = 2
    ) -> None:
        super().__init__()
        
        self.embedding_table = nn.Embedding(vocab_size, embed_dim)
        nn.init.xavier_uniform_(self.embedding_table)
        
        self.conv_A_1 = nn.Conv1d(embed_dim, 64, 15, stride=7)
        self.conv_B_1 = nn.Conv1d(embed_dim, 64, 15, stride=7)
        
        self.conv_A_2 = nn.Conv1d(64, 64, 15, stride=7)
        self.conv_B_2 = nn.Conv1d(64, 64, 15, stride=7)
        
        self.output_linear1 = nn.Linear(64, 128)
        self.output_linear2 = nn.Linear(128, num_class)
    
    def forward(self, word_index: torch.Tensor):
        """
        定义 GCN 网络的算子操作流程，基于句子单词 ID 输入得到分类 logits 输出
        :param: word_index: [bs, max_seq_len] 
        """
        # 1. word_index -> word_embedding
        word_embedding = self.embedding_table(word_index)  # [bs, max_seq_len, embedding_dim]
        
        # 2. 编写第一层 1D 门卷积模块
        word_embedding.transpose_(1, 2)  # [bs, embedding_dim, max_seq_len]
        A = self.conv_A_1(word_embedding)
        B = self.conv_B_1(word_embedding)
        H = A * torch.sigmoid(B)  # [bs, 64, max_seq_len]
        
        A = self.conv_A_2(H)
        B = self.conv_B_2(H)
        H = A * torch.sigmoid(B)  # [bs, 64, max_seq_len]
        
        # 3. 池化并经过全连接层
        pool_output = torch.mean(H, dim=-1)  # 平均池化，[bs, 64]
        logits = self.output_linear1(
            self.output_linear2(pool_output)
        )  # [bs, 2]
        return logits

PyTorch 官网有一个更简单的模型，这里对其进行展示，特别简单...

注意这个模型用了 `nn.EmbeddingBag`，当输入 [bs, seq_len] 后，得到的仍然是 [bs, seq_len]，它对所有的 token 的 embedding 进行了一个平均。

In [3]:
class TextClassificationModel(nn.Module):
    """
    简单版的 EmbeddingBag + DNN 模型
    了解一下就好
    """
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int = 64,
        num_class: int = 2
    ) -> None:
        super().__init__()
        
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc = nn.Linear(embed_dim, num_class)
        
    def forward(self, token_ids: torch.Tensor):
        embedding = self.embedding(token_ids)  # [bs, embedding_dim]
        return self.fc(embedding)

## 2. 构建 IMDB Dataloader

In [5]:
from torchtext.datasets import IMDB
from torchtext.datasets.imdb import NUM_LINES
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset

In [13]:
conf = {
    'dataset_path': '/root/yubin/dataset/kaggle/IMDB-Dataset-of-50K-Movie-Reviews'
}

In [16]:
train_data_iter = IMDB(split='train')

NameError: name 'IterableWrapper' is not defined