# 教程（中文）：mbe-tools 全流程 (v0.2.0)

目标：XYZ → 片段化 → 抽样 → MBE 子集几何 → 输入文件 → 调度模板 → 解析 → 分析 → 快速查看/计算（show/info/calc/save/compare）。
示例数据：`notebooks/data/water3.xyz`（水三聚体）。默认输出写入 `notebooks/data/demo_cn/`。

## 1）环境与配置

- 安装：`python -m pip install -e .[analysis,cli]`
- 优先级：环境变量 → `~/.config/mbe-tools/config.toml` → `./mbe.toml` → `load_settings(path=...)`。
- 数据在 `notebooks/data/`，本教程写到 `notebooks/data/demo_cn/`。
- 先设 `DO_RUN=False` 观察命令，确认后改 True。

In [None]:
import sys, pathlib, importlib

BASE_DIR = pathlib.Path.cwd().parent
DATA_ROOT = BASE_DIR / "notebooks" / "data"
DEMO_ROOT = DATA_ROOT / "demo_cn"
DEMO_ROOT.mkdir(parents=True, exist_ok=True)
print("数据目录:", DATA_ROOT)
print("输出目录:", DEMO_ROOT)

print("Python", sys.version.split()[0])
for mod in ["numpy", "pandas", "matplotlib", "seaborn", "mbe_tools"]:
    try:
        m = importlib.import_module(mod)
        ver = getattr(m, "__version__", "(no __version__)")
        print(f"{mod}: {ver}")
    except ImportError:
        print(f"未安装 {mod}")

## 2）CLI：片段化 + 抽样（含水启发）
- 命令：`mbe fragment <xyz> --n --seed --mode spatial --require-ion --prefer-special --out-xyz`
- 使用示例 `notebooks/data/water3.xyz`，输出到 `DEMO_ROOT`。

In [None]:
import subprocess, shutil, sys
from pathlib import Path

DO_RUN = True  # 先设 False，仅打印命令，确认后再改 True
xyz_path = DATA_ROOT / "water3.xyz"
sampled_xyz = DEMO_ROOT / "water3_sampled.xyz"
mbe_bin = Path(sys.executable).parent / "mbe"
cmd_fragment = [
    str(mbe_bin), "fragment", str(xyz_path),
    "--n", "2", "--seed", "42",
    "--mode", "spatial", "--require-ion", "--prefer-special",
    "--out-xyz", str(sampled_xyz),
]
print("使用 CLI:", mbe_bin)
print("片段化命令:", " ".join(cmd_fragment))
if DO_RUN:
    if not mbe_bin.exists():
        raise SystemExit("请先安装: python -m pip install -e .[analysis,cli]")
    subprocess.run(cmd_fragment, check=True)
    print("已写入", sampled_xyz)

## 3）CLI：生成 MBE 子集几何（至 K 阶）
- 命令：`mbe gen <xyz> --backend --max-order --cp --out-dir`。
- 输出目录：`DEMO_ROOT/mbe_geoms_cli/`。

In [None]:
geom_dir_cli = DEMO_ROOT / "mbe_geoms_cli"
cmd_gen = [
    str(mbe_bin), "gen", str(sampled_xyz),
    "--backend", "qchem",
    "--max-order", "2",
    "--cp",
    "--out-dir", str(geom_dir_cli),
]
print("生成几何命令:", " ".join(cmd_gen))
if DO_RUN:
    geom_dir_cli.mkdir(exist_ok=True)
    subprocess.run(cmd_gen, check=True)

## 4）CLI：生成 Q-Chem / ORCA 输入
- 命令：`mbe build-input <geom> --backend qchem|orca --method --basis --out`。
- 批量：对目录使用 `--glob "*.geom" --out-dir <dir>`，会为匹配的 .geom 逐个写输入。
- 示例展示 Q-Chem REM 参数。

In [None]:
geom_example = geom_dir_cli / "qchem_k2_f0-1.geom"

# Check if geometry file exists, if not use the first available .geom file
if not geom_example.exists():
    geom_files = list(geom_dir_cli.glob("*.geom"))
    if geom_files:
        geom_example = geom_files[0]
        print(f"使用几何文件: {geom_example.name}")
    else:
        print(f"错误: 在 {geom_dir_cli} 中未找到 .geom 文件")
        print("请先运行 CELL INDEX 6 生成几何文件")
        geom_example = None

if geom_example:
    cmd_build_qchem = [
        str(mbe_bin), "build-input", str(geom_example),
        "--backend", "qchem",
        "--method", "wb97m-v", "--basis", "def2-svpd",
        "--out", str(DEMO_ROOT / "qchem.inp"),
        "--thresh", "14", "--scf-convergence", "8",
    ]
    cmd_build_orca = [
        str(mbe_bin), "build-input", str(geom_example),
        "--backend", "orca",
        "--method", "wb97m-v", "--basis", "def2-svpd",
        "--out", str(DEMO_ROOT / "orca.inp"),
    ]
    batch_out_dir = DEMO_ROOT / "inputs_batch"
    cmd_build_batch = [
        str(mbe_bin), "build-input", str(geom_dir_cli),
        "--backend", "qchem",
        "--method", "wb97m-v", "--basis", "def2-svpd",
        "--glob", "*.geom", "--out-dir", str(batch_out_dir),
        "--thresh", "14", "--scf-convergence", "8",
    ]
    print("Q-Chem 输入命令:", " ".join(cmd_build_qchem))
    print("ORCA 输入命令  :", " ".join(cmd_build_orca))
    print("批量生成命令  :", " ".join(cmd_build_batch))
    if DO_RUN:
        subprocess.run(cmd_build_qchem, check=True)
        subprocess.run(cmd_build_orca, check=True)
        batch_out_dir.mkdir(exist_ok=True)
        subprocess.run(cmd_build_batch, check=True)

## 5）CLI：调度模板 + run-control
- 命令：`mbe template --scheduler pbs|slurm ... --chunk-size ... --out ...`，若加 `--wrapper`，产出可直接 `bash job.sh` 的提交脚本，内部写隐藏的 `._*.pbs/.sbatch` 再 qsub/sbatch。
- run-control：在输入旁放 `mbe.control.toml` 或 `<input>.mbe.control.toml`。

In [None]:
pbs_path = DEMO_ROOT / "qchem.pbs"
slurm_path = DEMO_ROOT / "orca.sbatch"
wrapper_path = DEMO_ROOT / "submit_qchem.sh"
cmd_pbs = [str(mbe_bin), "template", "--scheduler", "pbs", "--backend", "qchem", "--job-name", "mbe-qchem", "--chunk-size", "10", "--out", str(pbs_path)]
cmd_slurm = [str(mbe_bin), "template", "--scheduler", "slurm", "--backend", "orca", "--job-name", "mbe-orca", "--chunk-size", "5", "--out", str(slurm_path)]
cmd_wrapper = [str(mbe_bin), "template", "--scheduler", "pbs", "--backend", "qchem", "--job-name", "mbe-qchem-wrap", "--chunk-size", "10", "--out", str(wrapper_path), "--wrapper"]
print("PBS 模板:", " ".join(cmd_pbs))
print("Slurm 模板:", " ".join(cmd_slurm))
print("Wrapper 提交脚本:", " ".join(cmd_wrapper))
if DO_RUN:
    subprocess.run(cmd_pbs, check=True)
    subprocess.run(cmd_slurm, check=True)
    subprocess.run(cmd_wrapper, check=True)
    print("使用: bash", wrapper_path)


## 6）CLI：解析输出为 JSONL
- 命令：`mbe parse <root> --program auto --glob "*.out" --out parsed.jsonl`。
- 自动探测程序/元数据；从 singleton 输出嵌入 cluster 几何（丢弃幽灵原子），若缺少 subset 元数据则回退到首个可解析几何作为 monomer 0。
- 默认 JSONL 选择（被 show/info/calc/save/compare 复用）：显式路径优先，否则按 `run.jsonl → parsed.jsonl → 单个 *.jsonl → 最新`。
- 将真实 `.out` 放到 `DEMO_ROOT/Output/`。

In [None]:
output_root = DEMO_ROOT / "Output"
parsed_jsonl = DEMO_ROOT / "parsed.jsonl"
cmd_parse = [str(mbe_bin), "parse", str(output_root), "--program", "auto", "--glob-pattern", "*.out", "--out", str(parsed_jsonl)]
print("解析命令:", " ".join(cmd_parse))
if DO_RUN:
    output_root.mkdir(exist_ok=True)
    subprocess.run(cmd_parse, check=True)

## 7）CLI：分析 JSONL → CSV/Excel/Plot
- 命令：`mbe analyze parsed.jsonl --to-csv --to-xlsx --plot`。

In [None]:
csv_path = DEMO_ROOT / "results.csv"
xlsx_path = DEMO_ROOT / "results.xlsx"
plot_path = DEMO_ROOT / "mbe.png"
cmd_analyze = [str(mbe_bin), "analyze", str(parsed_jsonl), "--to-csv", str(csv_path), "--to-xlsx", str(xlsx_path), "--plot", str(plot_path)]
print("分析命令:", " ".join(cmd_analyze))
if DO_RUN:
    subprocess.run(cmd_analyze, check=True)

## 7.5）CLI：快速查看 / 计算 / 归档 / 对比
- 新命令：`mbe show`、`mbe info`、`mbe calc`、`mbe save`、`mbe compare`。
- 默认 JSONL 选择：显式路径优先，否则按 `run.jsonl → parsed.jsonl → 单个 *.jsonl → 最新`。
- `calc` 支持 `--scheme simple|strict`、`--unit hartree|kcal|kj`、`--to/--from`、`--monomer`，并会阻止混合 program/method/basis/grid/cp 的数据。
- `save` 会以 `cluster_id/时间戳/` 归档 JSONL；`compare` 可用目录或 glob 对多个 JSONL 做 CPU/记录数对比。

In [None]:
cmd_show = [str(mbe_bin), "show", str(parsed_jsonl), "--monomer", "0"]
cmd_info = [str(mbe_bin), "info"]  # 使用默认 JSONL 选择
cmd_calc = [str(mbe_bin), "calc", str(parsed_jsonl), "--unit", "kcal", "--monomer", "0"]
archive_dir = DEMO_ROOT / "archives"
cmd_save = [str(mbe_bin), "save", str(parsed_jsonl), "--dest", str(archive_dir)]
cmd_compare = [str(mbe_bin), "compare", str(archive_dir)]

for name, cmd in [
    ("show", cmd_show),
    ("info", cmd_info),
    ("calc", cmd_calc),
    ("save", cmd_save),
    ("compare", cmd_compare),
]:
    print(f"{name}:", " ".join(cmd))

if DO_RUN:
    import subprocess
    if not parsed_jsonl.exists():
        raise SystemExit("缺少 parsed.jsonl，请先解析输出")
    archive_dir.mkdir(exist_ok=True)
    subprocess.run(cmd_show, check=True)
    subprocess.run(cmd_info, check=True)
    subprocess.run(cmd_calc, check=True)
    subprocess.run(cmd_save, check=True)
    subprocess.run(cmd_compare, check=True)

## 8）Python：读取 XYZ 并片段化（对比水启发 vs 连通）

In [None]:
from mbe_tools.cluster import read_xyz, fragment_by_water_heuristic, fragment_by_connectivity

xyz = read_xyz(str(xyz_path))
frags_water = fragment_by_water_heuristic(xyz, oh_cutoff=1.25)
frags_conn = fragment_by_connectivity(xyz, scale=1.2)
print("水启发片段数:", len(frags_water))
print("连通片段数  :", len(frags_conn))

## 9）Python：确定性抽样（seed + 可选保留离子）并写回 XYZ

In [None]:
from mbe_tools.cluster import sample_fragments, write_xyz
sampled_py = sample_fragments(frags_conn, n=min(2, len(frags_conn)), seed=42, require_ion=False)
sampled_py_path = DEMO_ROOT / "sampled_py.xyz"
write_xyz(str(sampled_py_path), sampled_py, comment="python 抽样示例")
print("已写", sampled_py_path)

## 10）Python：生成 MBE 子集（含 CP），写出 .geom

In [None]:
from mbe_tools.mbe import MBEParams, generate_subsets_xyz
mbe_params = MBEParams(max_order=2, cp_correction=True, backend="qchem")
geom_dir_py = DEMO_ROOT / "mbe_geoms_py"
geom_dir_py.mkdir(exist_ok=True)
subset_jobs = list(generate_subsets_xyz(sampled_py, mbe_params))
print("子集数量:", len(subset_jobs))
if subset_jobs:
    job_id, subset_indices, geom_text = subset_jobs[0]
    print("示例 job_id", job_id, "subset_indices", subset_indices)
for job_id, subset_indices, geom_text in subset_jobs:
    (geom_dir_py / f"{job_id}.geom").write_text(geom_text)
print("已写 .geom 至", geom_dir_py)

## 11）Python：渲染 Q-Chem / ORCA 输入（字符串）

In [None]:
from pathlib import Path
from mbe_tools.input_builder import render_qchem_input, render_orca_input

geom_pick = next(iter(geom_dir_py.glob("*.geom")), None)
if geom_pick is None:
    raise SystemExit("未找到 .geom，请先运行上一步")
geom_text = Path(geom_pick).read_text()

qchem_inp = render_qchem_input(
    geom_text,
    method="wb97m-v",
    basis="def2-svpd",
    charge=0,
    multiplicity=1,
    thresh=14,
    scf_convergence="8",
    rem_extra=None,
    )
orca_inp = render_orca_input(
    geom_text,
    method="wb97m-v",
    basis="def2-svpd",
    charge=0,
    multiplicity=1,
    grid="Grid5",
)
(DEMO_ROOT / "input_qchem.inp").write_text(qchem_inp)
(DEMO_ROOT / "input_orca.inp").write_text(orca_inp)
print("已写入输入文件到", DEMO_ROOT)

## 12）Python：程序化生成调度脚本（PBS/Slurm）

In [None]:
from mbe_tools.hpc_templates import render_pbs_qchem, render_slurm_orca

pbs_text = render_pbs_qchem(
    job_name="mbe-qchem",
    walltime="24:00:00",
    ncpus=32,
    mem_gb=64.0,
    queue="normal",
    project="proj123",
    module="qchem/5.2.2",
    input_glob=str(DEMO_ROOT / "input_qchem*.inp"),
    chunk_size=10,
 )
slurm_text = render_slurm_orca(
    job_name="mbe-orca",
    walltime="24:00:00",
    ntasks=1,
    cpus_per_task=32,
    mem_gb=64.0,
    partition="work",
    account="proj123",
    module="orca/5.0.3",
    input_glob=str(DEMO_ROOT / "input_orca*.inp"),
    chunk_size=5,
)
(DEMO_ROOT / "pbs_api.pbs").write_text(pbs_text)
(DEMO_ROOT / "slurm_api.sbatch").write_text(slurm_text)
print("已写调度脚本到", DEMO_ROOT)

## 13）Python：解析输出并推断元数据，写 JSONL

In [None]:
import json
from mbe_tools.parsers.io import glob_paths, parse_files

out_paths = glob_paths(str(output_root), "*.out") if output_root.exists() else []
print("找到输出文件:", len(out_paths))
if out_paths:
    records = parse_files(out_paths, program="auto", infer_metadata=True)
    parsed_jsonl_py = DEMO_ROOT / "parsed_api.jsonl"
    with open(parsed_jsonl_py, "w") as f:
        for r in records:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")
    print("已写", parsed_jsonl_py)
else:
    print("请将 .out 放到", output_root)

## 14）Python：组装 MBE 能量并表格化

In [None]:
from mbe_tools.mbe_math import assemble_mbe_energy, order_totals_as_rows

records_for_energy = [
    {"subset_indices": [0], "energy_hartree": -1.0},
    {"subset_indices": [1], "energy_hartree": -1.1},
    {"subset_indices": [0, 1], "energy_hartree": -2.25},
]
res = assemble_mbe_energy(records_for_energy, max_order=2)
rows = order_totals_as_rows(res["order_totals"])
print("order_totals:", res["order_totals"])
print("rows:", rows)

## 15）端到端运行提示
1）CLI 流程：运行单元 2–7.5（含 show/info/calc/save/compare），真实 `.out` 放入 `Output` 后再跑解析/分析。
2）Python 流程：运行单元 8–14，无需 CLI。
3）所有产物在 `notebooks/data/demo_cn/`。