# IE6483 Project 1 — RNN (BiLSTM) Sentiment Classifier

**Notebook generated on:** 2025-11-04 05:35:16

本 notebook 依照课程项目要求构建了一个可复现的 **BiLSTM（RNN）情感分类** 管线：
- 数据读取：`train.json` / `test.json`（字段：`reviews`，`sentiments`）。
- 词表与分词：`torchtext` 的 `basic_english`；如不可用，退化到正则分词。
- 模型：变长序列 + `pack_padded_sequence` 的 BiLSTM，二分类头。
- 训练：AdamW，梯度裁剪，`ReduceLROnPlateau` 学习率调度，早停与最优模型保存。
- 日志：Python `logging` 同步写入控制台与 `train.log`，并显示 tqdm 进度条。
- 结果：对测试集输出 `submission.csv`（一列：`sentiments`，0/1）。

> 运行前，请将 `train.json` 与 `test.json` 放在当前工作目录。


## 0. 环境与设备

- 推荐 OS：Ubuntu 22.04（或等价 Linux）  
- Python：3.10+  
- PyTorch：2.x（建议 2.7+），CUDA 12.x（如 12.6/12.8）  
- GPU：NVIDIA RTX A6000（Ampere GA102，单卡即可；CPU 下也可运行）  
- CPU：Xeon Silver 系列兼容（无需特殊指令集）

> 下方 cell 会自动检测 `torch.cuda`、打印设备与关键库版本。


In [None]:

# 可选：仅当你的环境未安装相应依赖时再执行本单元
# 在本评测环境中禁用了联网安装，这里只是提供命令参考：
# import sys
# !{sys.executable} -m pip install torch torchtext pandas scikit-learn tqdm


In [6]:

import platform, torch, sys
import pandas as pd
import sklearn
import importlib

print("Python :", sys.version)
print("Platform:", platform.platform())
print("PyTorch :", torch.__version__)
print("Pandas  :", pd.__version__)
print("sklearn :", sklearn.__version__)

if torch.cuda.is_available():
    print("CUDA    : available")
    print("GPU     :", torch.cuda.get_device_name(0))
else:
    print("CUDA    : not available (running on CPU)")


Python : 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0]
Platform: Linux-5.15.0-1027-oracle-x86_64-with-glibc2.31
PyTorch : 2.7.1+cu118
Pandas  : 2.3.3
sklearn : 1.7.2
CUDA    : available
GPU     : NVIDIA RTX A6000


## 1. 日志初始化（打印到控制台 + 写入 train.log）

In [7]:

import logging, sys
from pathlib import Path

def setup_logger(log_file: str = "train.log", level=logging.INFO):
    logger = logging.getLogger()
    logger.handlers = []  # reset existing handlers in notebooks
    logger.setLevel(level)
    fmt = logging.Formatter("[%(asctime)s] [%(levelname)s] %(message)s", "%H:%M:%S")

    # console
    ch = logging.StreamHandler(stream=sys.stdout)
    ch.setLevel(level)
    ch.setFormatter(fmt)
    logger.addHandler(ch)

    # file
    fh = logging.FileHandler(log_file, mode="w", encoding="utf-8")
    fh.setLevel(level)
    fh.setFormatter(fmt)
    logger.addHandler(fh)

    logger.info("Logger initialized. Writing to %s", log_file)
    return logger

logger = setup_logger()


[14:55:25] [INFO] Logger initialized. Writing to train.log


## 2. 随机性与确定性设置

In [8]:

import os, random, torch, numpy as np

def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.use_deterministic_algorithms(True)
    os.environ.setdefault("CUBLAS_WORKSPACE_CONFIG", ":16:8")
    logger.info("Deterministic mode enabled with seed=%d", seed)

set_seed(42)


[14:55:28] [INFO] Deterministic mode enabled with seed=42


## 3. 数据读取与预处理

- 期望字段：`reviews`（文本），`sentiments`（0/1，仅训练集）。
- 自动兼容 JSON/JSONL/CSV 常见方言。


In [9]:

import re
import pandas as pd
from pathlib import Path

def read_json_flexible(path: str) -> pd.DataFrame:
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(f"File not found: {path}")

    # Try JSON Lines
    try:
        df = pd.read_json(p, lines=True)
        if "reviews" in df.columns:
            return df
    except Exception:
        pass

    # Try regular JSON
    try:
        df = pd.read_json(p)
        if "reviews" in df.columns:
            return df
    except Exception:
        pass

    # Try CSV fallback
    try:
        df = pd.read_csv(p)
        if "reviews" in df.columns:
            return df
    except Exception:
        pass

    raise ValueError(f"Unsupported data format in {path}. Expected a 'reviews' column.")

# 路径可按需修改
TRAIN_PATH = "../train.json"
TEST_PATH  = "../test.json"

train_df = read_json_flexible(TRAIN_PATH)
test_df  = read_json_flexible(TEST_PATH)

assert "reviews" in train_df.columns and "reviews" in test_df.columns, "缺少 'reviews' 列"
assert "sentiments" in train_df.columns, "训练集缺少 'sentiments' 列"

train_df.head(3)


Unnamed: 0,reviews,sentiments
0,I bought this belt for my daughter in-law for ...,1
1,The size was perfect and so was the color. It...,1
2,"Fits and feels good, esp. for doing a swim rac...",1


## 4. 分词与词表（`torchtext` 的 `basic_english`，失败则退化到正则分词）

In [10]:

import re
from typing import List, Optional, Dict

def get_basic_tokenizer():
    try:
        from torchtext.data.utils import get_tokenizer
        tok = get_tokenizer("basic_english")
        logger.info("Using torchtext basic_english tokenizer.")
        return tok
    except Exception:
        pattern = re.compile(r"[A-Za-z0-9']+")
        logger.warning("torchtext unavailable; fall back to regex tokenizer.")
        return lambda s: pattern.findall(s.lower())

class Vocab:
    def __init__(self, counter: Dict[str, int], min_freq: int = 2, specials: Optional[List[str]] = None, max_size: Optional[int] = 50000):
        self.pad_token = "<pad>"
        self.unk_token = "<unk>"
        specials = specials or [self.pad_token, self.unk_token]

        items = [(w, c) for w, c in counter.items() if c >= min_freq]
        items.sort(key=lambda x: (-x[1], x[0]))
        if max_size is not None:
            items = items[:max_size - len(specials)]
        itos = specials + [w for w, _ in items]
        self.itos = itos
        self.stoi = {w: i for i, w in enumerate(itos)}
        self.pad_idx = self.stoi[self.pad_token]
        self.unk_idx = self.stoi[self.unk_token]

    def __len__(self):
        return len(self.itos)

    def encode(self, tokens: List[str]) -> List[int]:
        return [self.stoi.get(t, self.unk_idx) for t in tokens]

def build_vocab(texts: List[str], tokenizer, min_freq=2, max_size=50000) -> Vocab:
    counter = {}
    for t in texts:
        for tok in tokenizer(str(t)):
            counter[tok] = counter.get(tok, 0) + 1
    return Vocab(counter, min_freq=min_freq, max_size=max_size)

def encode_text(text: str, tokenizer, vocab: Vocab, max_len: int = 256):
    tokens = tokenizer(str(text))
    ids = vocab.encode(tokens[:max_len])
    return ids

tokenizer = get_basic_tokenizer()
vocab = build_vocab(train_df["reviews"].astype(str).tolist(), tokenizer, min_freq=2, max_size=50000)
logger.info("Vocab size: %d", len(vocab))


[14:55:37] [INFO] Vocab size: 7797


## 5. Dataset 与 DataLoader（包含变长 padding 与 `collate_fn`）

In [12]:

from typing import Tuple
import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, vocab, max_len=256):
        self.texts = list(texts)
        self.labels = None if labels is None else list(labels)
        self.tokenizer = tokenizer
        self.vocab = vocab
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        ids = encode_text(self.texts[idx], self.tokenizer, self.vocab, self.max_len)
        y = None if self.labels is None else int(self.labels[idx])
        return ids, y

def collate_batch(batch, pad_idx: int):
    lengths = [len(x[0]) for x in batch]
    max_len = max(lengths) if lengths else 0
    padded, labels = [], []
    for ids, y in batch:
        padded.append(ids + [pad_idx] * (max_len - len(ids)))
        if y is not None:
            labels.append(y)
    x = torch.tensor(padded, dtype=torch.long)
    lens = torch.tensor(lengths, dtype=torch.long)
    y = None if len(labels) == 0 else torch.tensor(labels, dtype=torch.float32)
    return x, lens, y

from sklearn.model_selection import train_test_split
tr_texts, va_texts, tr_labels, va_labels = train_test_split(
    train_df["reviews"].astype(str).tolist(),
    train_df["sentiments"].astype(int).tolist(),
    test_size=0.2, random_state=42, stratify=train_df["sentiments"].astype(int).tolist()
)

max_len = 256
tr_ds = TextDataset(tr_texts, tr_labels, tokenizer, vocab, max_len=max_len)
va_ds = TextDataset(va_texts, va_labels, tokenizer, vocab, max_len=max_len)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
collate = lambda b: collate_batch(b, vocab.pad_idx)
num_workers = 2 if os.name != "nt" else 0
pin_memory = True if torch.cuda.is_available() else False

tr_loader = DataLoader(tr_ds, batch_size=64, shuffle=True, collate_fn=collate, num_workers=num_workers, pin_memory=pin_memory)
va_loader = DataLoader(va_ds, batch_size=64, shuffle=False, collate_fn=collate, num_workers=num_workers, pin_memory=pin_memory)

logger.info("Train/Val sizes: %d / %d", len(tr_ds), len(va_ds))


[14:56:21] [INFO] Train/Val sizes: 5920 / 1481


## 6. BiLSTM 模型（变长序列 + `pack_padded_sequence`）

In [13]:

import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence

class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size: int, embed_dim: int = 200, hidden_dim: int = 256,
                 num_layers: int = 2, dropout: float = 0.3, pad_idx: int = 0, bidirectional: bool = True):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0.0,
            bidirectional=bidirectional
        )
        self.bidirectional = bidirectional
        out_dim = hidden_dim * (2 if bidirectional else 1)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(out_dim, 1)

    def forward(self, x, lengths):
        emb = self.embedding(x)                       # [B, T, E]
        packed = pack_padded_sequence(emb, lengths.cpu(), batch_first=True, enforce_sorted=False)
        _, (h_n, _) = self.lstm(packed)               # h_n: [L*D, B, H]
        if self.bidirectional:
            h_fwd, h_bwd = h_n[-2], h_n[-1]
            h = torch.cat([h_fwd, h_bwd], dim=1)      # [B, 2H]
        else:
            h = h_n[-1]                               # [B, H]
        h = self.dropout(h)
        return self.fc(h).squeeze(1)                  # [B]

model = BiLSTMClassifier(vocab_size=len(vocab), embed_dim=200, hidden_dim=256, num_layers=2, dropout=0.3, pad_idx=vocab.pad_idx).to(device)
total_params = sum(p.numel() for p in model.parameters())
logger.info("Model built: %s (params: %d)", model.__class__.__name__, total_params)


[14:57:58] [INFO] Model built: BiLSTMClassifier (params: 4074857)


## 7. 训练与验证（带日志、早停、学习率调度）

In [None]:

import torch, torch.nn as nn, torch.optim as optim
from tqdm.auto import tqdm

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.AdamW(model.parameters(), lr=2e-3, weight_decay=1e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", factor=0.5, patience=1)

def evaluate(model, loader, device):
    model.eval()
    total, correct, loss_sum = 0, 0, 0.0
    with torch.no_grad():
        for x, lengths, y in loader:
            x, lengths, y = x.to(device), lengths.to(device), y.to(device)
            logits = model(x, lengths)
            loss = criterion(logits, y)
            preds = (torch.sigmoid(logits) >= 0.5).long()
            total += x.size(0)
            correct += (preds == y.long()).sum().item()
            loss_sum += loss.item() * x.size(0)
    return loss_sum / total, correct / total

best_val = float("inf")
best_path = "checkpoints/bilstm_best.pt"
os.makedirs(os.path.dirname(best_path), exist_ok=True)
epochs = 8
patience, bad_epochs = 3, 0

for epoch in range(1, epochs + 1):
    model.train()
    pbar = tqdm(tr_loader, desc=f"Epoch {epoch}/{epochs}")
    for x, lengths, y in pbar:
        x, lengths, y = x.to(device), lengths.to(device), y.to(device)
        optimizer.zero_grad(set_to_none=True)
        logits = model(x, lengths)
        loss = criterion(logits, y)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        pbar.set_postfix(loss=f"{loss.item():.4f}")

    val_loss, val_acc = evaluate(model, va_loader, device)
    scheduler.step(val_loss)
    logger.info("[Val] loss=%.4f acc=%.4f lr=%.6f", val_loss, val_acc, optimizer.param_groups[0]["lr"])

    if val_loss < best_val - 1e-6:
        best_val = val_loss
        torch.save({"state_dict": model.state_dict(), "cfg": {
            "vocab_itos": vocab.itos,
            "pad_idx": vocab.pad_idx,
            "embed_dim": 200, "hidden_dim": 256, "num_layers": 2, "dropout": 0.3, "max_len": 256
        }}, best_path)
        logger.info("Saved best checkpoint -> %s", best_path)
        bad_epochs = 0
    else:
        bad_epochs += 1
        if bad_epochs >= patience:
            logger.info("Early stopping at epoch=%d (no improvement for %d epochs).", epoch, patience)
            break


Epoch 1/8:   0%|          | 0/93 [00:00<?, ?it/s]

## 8. 加载最优权重并在测试集生成提交文件 `submission.csv`

In [None]:

import torch, pandas as pd
from torch.utils.data import DataLoader

# 重新加载最优权重（以及词表）
ckpt = torch.load("checkpoints/bilstm_best.pt", map_location=device)
model.load_state_dict(ckpt["state_dict"])
model.eval()
# 词表（如果你在独立推理脚本中运行，需要反序列化 itos；这里直接沿用 notebook 内的 vocab 实例即可）

test_ds = TextDataset(test_df['reviews'].astype(str).tolist(), labels=None, tokenizer=tokenizer, vocab=vocab, max_len=256)
test_loader = DataLoader(test_ds, batch_size=128, shuffle=False, collate_fn=lambda b: collate_batch(b, vocab.pad_idx),
                         num_workers=2 if os.name!="nt" else 0, pin_memory=torch.cuda.is_available())

preds = []
with torch.no_grad():
    for x, lengths, _ in tqdm(test_loader, desc="Predict"):
        x, lengths = x.to(device), lengths.to(device)
        logits = model(x, lengths)
        probs = torch.sigmoid(logits)
        preds.extend((probs >= 0.5).long().cpu().tolist())

sub = pd.DataFrame({"sentiments": preds})
out_csv = "submission.csv"
sub.to_csv(out_csv, index=False)
logger.info("Wrote predictions -> %s (head):\n%s", out_csv, sub.head(5))
sub.head()


## 9. 备注与对齐

- 该 RNN 实现严格采用 `pack_padded_sequence` 处理变长序列，避免无效填充的计算；
- 采用 `torch.use_deterministic_algorithms(True)` 与固定随机种子，便于复现实验；
- 训练日志既打印到控制台也写入 `train.log`，并保存最优检查点到 `checkpoints/`；
- 满足项目要求：输入/输出字段、二分类、导出 `submission.csv`。