# Torch-Rechub Tutorial： DIN (阶梯式简易教程)

- **场景**：精排（CTR预测） 
- **模型**：DIN、BST
- **数据**：Amazon-Electronics Sample

## 学习目标
1. 学会使用 torch-rechub 处理不同类型的特征。
2. 掌握 `create_seq_features` 构建历史行为序列的方法。
3. 理解如何通过配置 `SequenceFeature` 实现 Embedding 共享。
4. 比较 DIN 与其他序列模型（如 BST）的训练流程。

In [None]:
# 1. 安装与环境检查
# !pip install torch-rechub
import torch
import torch_rechub
import pandas as pd
import numpy as np
import tqdm
import sklearn

print("Torch Version:", torch.__version__, "CUDA Available:", torch.cuda.is_available())
torch.manual_seed(2022) 

## 第一步：支持新数据集
提取 user_id, item_id, cate_id, time 四个特征列。

In [None]:
file_path = '../examples/ranking/data/amazon-electronics/amazon_electronics_sample.csv'
data = pd.read_csv(file_path)
data.head()

## 第二步：特征工程
- **Dense特征**：数值型（本教程未涉及）。
- **Sparse特征**：类别型，将映射为 Embedding。
- **Sequence特征**：用户历史行为序列，DIN 的核心创新点。

In [None]:
from torch_rechub.utils.data import create_seq_features

# 自动构建历史序列：指定需要生成序列的列，drop_short 为舍弃短序列用户的阈值
train, val, test = create_seq_features(data, seq_feature_col=['item_id', 'cate_id'], drop_short=0)
train.head()

## 第三步：定义特征处理方式
使用 `SparseFeature` 和 `SequenceFeature`。注意 `shared_with` 参数用于 Embedding 共享。

In [None]:
from torch_rechub.basic.features import SparseFeature, SequenceFeature

n_users, n_items, n_cates = data["user_id"].max(), data["item_id"].max(), data["cate_id"].max()

features = [
    SparseFeature("target_item", vocab_size=n_items + 2, embed_dim=64),
    SparseFeature("target_cate", vocab_size=n_cates + 2, embed_dim=64),
    SparseFeature("user_id", vocab_size=n_users + 2, embed_dim=64)
]
target_features = features

history_features = [
    SequenceFeature("history_item", vocab_size=n_items + 2, embed_dim=64, pooling="concat", shared_with="target_item"),
    SequenceFeature("history_cate", vocab_size=n_cates + 2, embed_dim=64, pooling="concat", shared_with="target_cate")
]

In [None]:
from torch_rechub.utils.data import df_to_dict, DataGenerator

train_y, val_y, test_y = train["label"], val["label"], test["label"]
train_x, val_x, test_x = df_to_dict(train.drop(columns="label")), df_to_dict(val.drop(columns="label")), df_to_dict(test.drop(columns="label"))

dg = DataGenerator(train_x, train_y)
train_dataloader, val_dataloader, test_dataloader = dg.generate_dataloader(x_val=val_x, y_val=val_y, x_test=test_x, y_test=test_y, batch_size=16)

## 第四步：模型训练（以 BST 为例对比）

In [None]:
from torch_rechub.models.ranking import BST
from torch_rechub.trainers import CTRTrainer

model_bst = BST(features=features, history_features=history_features, target_features=target_features, mlp_params={"dims": [256, 128]})
trainer = CTRTrainer(model_bst, optimizer_params={"lr": 1e-5, "weight_decay": 1e-3}, n_epoch=10, device='cpu')
trainer.fit(train_dataloader, val_dataloader)
print(f"BST Test AUC: {trainer.evaluate(model_bst, test_dataloader)}")

## 第五步：模型训练（DIN）

In [None]:
from torch_rechub.models.ranking import DIN

model_din = DIN(features=features, history_features=history_features, target_features=target_features, mlp_params={"dims": [256, 128]}, attention_mlp_params={"dims": [256, 128]})
trainer_din = CTRTrainer(model_din, optimizer_params={"lr": 1e-3, "weight_decay": 1e-3}, n_epoch=3, device='cpu')
trainer_din.fit(train_dataloader, val_dataloader)
print(f"DIN Test AUC: {trainer_din.evaluate(model_din, test_dataloader)}")