# 🧠 Party 模型原理與中文處理解析

## 一、什麼是 Party？

Party（PAge-wise Recognition of Text-y）是一種新型的文字識別模型，專為自動文字識別（ATR）系統設計。它的核心理念是：**按頁識別文字**，而非傳統的逐行或逐字識別。這使得 Party 特別適合處理排版複雜、語言多樣的歷史文獻與手稿。

Party 的設計目標是取代傳統依賴「基線 + 邊界多邊形」或「文字框」的 OCR 模型。它透過更簡化的提示方式（如基線或框）來理解文字行的位置，**不再需要繁瑣的多邊形標註**。

---

## 二、模型架構解析

Party 模型由三個主要模塊組成：

- 🎯 **Swin Vision Transformer 編碼器**  
  用於提取整頁圖像的視覺特徵，具備強大的空間理解能力。

- 📐 **Baseline Positional Embeddings**  
  將每行文字的基線位置嵌入模型，作為空間提示，幫助模型定位文字。

- 🦙 **Tiny Llama 解碼器（Octet Tokenization）**  
  使用小型 Llama 模型進行文字生成，並採用 UTF-8 的八位元編碼方式（octet tokenization）來預測文字。

---

## 三、Octet Tokenization：如何處理中文？

中文字符的 Unicode 編碼通常超過 8 位元，例如「謝」是 `U+8B1D`。Party 並不直接預測 Unicode 字元，而是將其轉換為 UTF-8 編碼後，**逐 byte 預測**：

### 範例：「謝」

| 編碼步驟 | 二進位 | 十六進位 |
|----------|--------|----------|
| Byte 1   | 11101000 | E8       |
| Byte 2   | 10101100 | AC       |
| Byte 3   | 10011101 | 9D       |

Party 模型會依序生成 `E8 → AC → 9D`，最後再由後處理程序還原成「謝」。這種方式的優點是：

- ✅ 支援任意語言與符號（只要能轉成 UTF-8）
- ✅ 不需事先定義字元集
- ✅ 模型架構更簡潔、語言無關

但也有挑戰：

- ❗ 若 byte 預測錯誤，可能導致無法還原成合法字元
- ❗ 輸出可能不符合 Unicode 正規化（如組合音符）

---

## 四、語言標記與中文推理

Party 的最新版本引入了**語言標記（language tokens）**，可在推理時指定每行的語言，避免模型在多語言文獻中隨機切換語言。對於中文文獻，建議明確指定語言標記（如 `-l zho`），以提升準確率。

若未指定語言，模型會自動生成語言標記，判斷該行可能包含的語言。但在中文古籍中，這可能導致誤判，尤其是混合使用文言與白話的情況。

---

## 五、微調建議與中文應用

雖然 Party 在訓練資料中常見語言上表現良好，但對於中文等非主流語言，**仍建議進行微調**，以符合具體的轉寫準則與排版特性。

微調流程包括：

1. 使用 ALTO 或 PageXML 格式的標註資料（含基線與文字）
2. 編譯成 Party 專用格式：  
   `party compile -o dataset.arrow *.xml`
3. 訓練模型：  
   `party train --load-from-repo ... --prompt-mode curves`

---

## 六、總結

Party 模型以其創新的頁面級識別方式、靈活的 octet 編碼策略與語言標記機制，為多語言文獻處理提供了強大的工具。對於中文文獻，尤其是古籍與手稿，Party 的架構提供了極具潛力的識別能力，只需適度微調即可達到實用水準。

如果你正在進行中文 OCR 項目，Party 值得深入探索與應用。


In [None]:
#加载谷歌硬盘，用于永久保存文件
from google.colab import drive
drive.mount('/content/drive')

NotImplementedError: Mounting drive is unsupported in this environment. Use PyDrive2 instead. See examples at https://colab.research.google.com/notebooks/io.ipynb#scrollTo=7taylj9wpsA2.

In [None]:
#安装party, 更改运行时需要重新安装

!pip install git+https://github.com/mittagessen/party.git

Collecting git+https://github.com/mittagessen/party.git
  Cloning https://github.com/mittagessen/party.git to /tmp/pip-req-build-mr33i9lk
  Running command git clone --filter=blob:none --quiet https://github.com/mittagessen/party.git /tmp/pip-req-build-mr33i9lk
  Resolved https://github.com/mittagessen/party.git to commit 271d6a7cc2720d068004ac1e3e5d19886ff14cc5
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting kraken@ git+https://github.com/mittagessen/kraken.git@348ff35 (from party==0.0.1.dev351)
  Cloning https://github.com/mittagessen/kraken.git (to revision 348ff35) to /tmp/pip-install-4b_wryss/kraken_060b75efe8564d00af66ac98d9ceee54
  Running command git clone --filter=blob:none --quiet https://github.com/mittagessen/kraken.git /tmp/pip-install-4b_wryss/kraken_060b75efe8564d00af66ac98d9ceee54
[0m  Running command git checkout -q 348ff35
  Resolved h

In [None]:
#编译数据集，分别从两个目录中编译训练集和验证集

!party compile -o /content/drive/MyDrive/party/dataset_train.arrow /content/drive/MyDrive/party/Chinese/Training/*.xml
!party compile -o /content/drive/MyDrive/party/dataset_val.arrow /content/drive/MyDrive/party/Chinese/Validation/*.xml

[2KCompiling dataset [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [35m 99%[0m [36m0:00:01[0m [33m0:00:05[0m [32m135/136[0m
[?25hOutput file written to /content/drive/MyDrive/party/dataset_train.arrow[0m


In [None]:
#在cpu上训练

!party train --load-from-checkpoint /content/drive/MyDrive/party/model/checkpoint_04-1.2983.ckpt --workers 4 --epochs 3 --no-validate-before-train -t /content/drive/MyDrive/party/train.lst -e /content/drive/MyDrive/party/val.lst

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
Loading from checkpoint /content/drive/MyDrive/party/model/checkpoint_04-1.2983.ckpt.[0m
2025-09-10 19:04:10.389138: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1757531051.031972   19833 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757531051.222357   19833 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1757531052.223194   19833 computation_placer.cc:177] computation placer already registered. Please check linkage and 

第一次训练，只学会了一个“學”字

In [None]:
#T4 GPU运行时内存溢出
!party -d cuda:0 -v train --load-from-checkpoint /content/drive/MyDrive/party/model/checkpoint_04-1.2983.ckpt -t /content/drive/MyDrive/party/train.lst -e /content/drive/MyDrive/party/val.lst

/bin/bash: line 1: party: command not found


In [None]:
!party -d cuda:0 -v train --load-from-checkpoint /content/drive/MyDrive/party/model/checkpoint_04-1.2983.ckpt -t /content/drive/MyDrive/party/train.lst -e /content/drive/MyDrive/party/val.lst -o /content/drive/MyDrive/party/model/model_stageB

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
2025-09-10 20:58:32.781250: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-09-10 20:58:32.797426: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1757537912.817520   15917 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757537912.823615   15917 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attemp

### 海南数据集拆分与编译（9:1）

以下单元记录了如何将某个目录下的 PageXML/ALTO 数据按 9:1 划分成训练集和评估集，并生成 party 训练所需的清单与 arrow 数据文件。
只需按需修改参数单元中的路径即可复用到其他数据集。


In [1]:
from pathlib import Path

# 自动定位项目根目录，便于在不同路径下复用
project_root = Path.cwd().resolve()
while project_root.parent != project_root and not (project_root / 'party_env' / 'bin' / 'party').exists():
    project_root = project_root.parent

party_bin = project_root / 'party_env' / 'bin' / 'party'
if not party_bin.exists():
    raise FileNotFoundError('未找到 party_env/bin/party，请调整 project_root 或手动设置 party_bin 路径。')

# ---- 可按需修改的参数 ----
data_root = project_root / '/mnt/d/光緒永嘉縣志/02086211.cn'
train_files_list = project_root / 'train/Chinese/yongjia_train_files.txt'
eval_files_list = project_root / 'train/Chinese/yongjia_files.txt'
train_arrow = project_root / 'train/Chinese/yongjia_train.arrow'
eval_arrow = project_root / 'train/Chinese/yongjia_eval.arrow'
train_manifest = project_root / 'train_yongjia.lst'
eval_manifest = project_root / 'val_yongjia.lst'
split_ratio = 0.9  # 训练集比例
seed = 42  # 随机种子，保证划分可复现
# --------------------------------

print(f'项目根目录: {project_root}')
print(f'数据目录: {data_root}')


项目根目录: /home/sheng/party
数据目录: /mnt/d/光緒永嘉縣志/02086211.cn


In [2]:
import random
import subprocess
from pathlib import Path

def _write_list(path: Path, items):
    path.parent.mkdir(parents=True, exist_ok=True)
    lines = [str(p.resolve()) for p in items]
    path.write_text('\n'.join(lines) + '\n', encoding='utf-8')

def _rel_to_project(path: Path) -> str:
    path = path.resolve()
    try:
        return str(path.relative_to(project_root))
    except ValueError:
        return str(path)

xml_files = sorted(data_root.glob('*.xml'))
if not xml_files:
    raise RuntimeError(f'未在 {data_root} 找到任何 XML 文件，请确认路径。')

random.seed(seed)
random.shuffle(xml_files)
split_idx = int(len(xml_files) * split_ratio)
if split_idx <= 0 or split_idx >= len(xml_files):
    raise ValueError('split_ratio 导致某个子集为空，请调整后重试。')

train_files = xml_files[:split_idx]
eval_files = xml_files[split_idx:]

_write_list(train_files_list, train_files)
_write_list(eval_files_list, eval_files)

subprocess.run(
    [str(party_bin), 'compile', '-o', str(train_arrow), '-F', str(train_files_list)],
    check=True,
    cwd=project_root,
)
subprocess.run(
    [str(party_bin), 'compile', '-o', str(eval_arrow), '-F', str(eval_files_list)],
    check=True,
    cwd=project_root,
)

train_manifest.parent.mkdir(parents=True, exist_ok=True)
eval_manifest.parent.mkdir(parents=True, exist_ok=True)
train_manifest.write_text(_rel_to_project(train_arrow) + '\n', encoding='utf-8')
eval_manifest.write_text(_rel_to_project(eval_arrow) + '\n', encoding='utf-8')

print(f'总样本: {len(xml_files)} -> 训练 {len(train_files)}, 评估 {len(eval_files)}')
print(f'训练 list: {train_files_list}')
print(f'评估 list: {eval_files_list}')
print(f'训练 manifest: {train_manifest}')
print(f'评估 manifest: {eval_manifest}')


[2KCompiling dataset [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [35m 99%[0m [36m0:00:01[0m [33m0:00:02[0m [32m130/131[0m7/131[0m━[0m[91m━[0m[91m━[0m[91m━[0m[91m━[0m[35m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[35m━[0m[91m━[0m[91m━[0m[91m━[0m[91m━[0m[91m━[0m[91m━[0m[91m━[0m [35m  0%[0m [36m-:--:--[0m [33m-:--:--[0m [32m0/0[0m
[?25hOutput file written to /home/sheng/party/train/Chinese/yongjia_train.arrow
[2KCompiling dataset [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [36m0:00:00[0m [33m0:00:00[0m [32m15/15[0m[0m [32m13/15[0m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[35m━[0m[91m━[0m[91m━[0m[91m━[0m[91m━[0m[91m━[0m[91m━[0m[91m━[0m[35m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m[90m━[0m [35m  0%[0m [36m-:--:--[0m [33m-:--:--[0m [32m0/0[0m
[?25hOutpu

In [None]:
#转换模型格式，便于在不同平台使用
!party convert -o model.safetensors checkpoint.ckpt

In [2]:
!party -d cuda:0 ocr -i /mnt/c/Users/sheng/Downloads/export_doc15__alto_202509301620/92_正德琼台志_海南省.pdf_ 历代地方志_Z-Library1.pdf_page_251.xml out.xml --load-from-file model.safetensors

Usage: party ocr [OPTIONS]
Try 'party ocr --help' for help.

Error: Invalid value for '-i' / '--input': File '/mnt/c/Users/sheng/Downloads/export_doc15__alto_202509301620/92_正德琼台志_海南省.pdf_' does not exist.


In [None]:
# 训练模型,增加-B 8参数，accuracy似乎更高

!ketos -v -d cuda:0 train -i /home/sheng/models/chat_rec.mlmodel --resize union -f binary -t party/ketos_train.lst -e party/ketos_eval.lst -o hn_from_chat -B 8

In [None]:
!ketos segtrain -d cuda:0 - -f alto -t train/Chinese/hainan_train_files.txt -e train/Chinese/hainan_eval_files.txt -o hn_from_chat_seg  --workers 4

In [6]:
import torch
torch.set_float32_matmul_precision('high')  # 或 'medium'

!cd /home/sheng/party/
!ketos segtrain -d cuda:0 -i /home/sheng/models/chat_seg.mlmodel -f alto -t /home/sheng/party/train/Chinese/hainan_train_files.txt -e /home/sheng/party/train/Chinese/hainan_eval_files.txt -o hn_from_chat_seg --resize both

[2;36m                    [0m         [32m"train/Chinese/hainan/92_正德琼台志_海南[0m [2m          [0m
[2;36m                    [0m         [32m省.pdf_历代地方志_Z-Library1.pdf_page_15[0m [2m          [0m
[2;36m                    [0m         [32m0.xml"[0m in                                [2m          [0m
[2;36m                    [0m         [35m/home/sheng/party/train/Chinese/[0m[95mhainan_t[0m [2m          [0m
[2;36m                    [0m         [95mrain_files.txt[0m                           [2m          [0m
[2;36m                    [0m         [32m"train/Chinese/hainan/92_正德琼台志_海南[0m [2m          [0m
[2;36m                    [0m         [32m省.pdf_历代地方志_Z-Library1.pdf_page_18[0m [2m          [0m
[2;36m                    [0m         [32m7.xml"[0m in                                [2m          [0m
[2;36m                    [0m         [35m/home/sheng/party/train/Chinese/[0m[95mhainan_t[0m [2m          [0m
[2;36m                    

In [None]:
#测试模型效果
!party -d cuda:0 test --curves \
      --load-from-file model_15.safetensors \
      party/train/Chinese/hainan_eval.arrow