# Tutorial (EN): mbe-tools End-to-End (v0.2.0)

Goal: XYZ → fragmentation → sampling → MBE geometries → inputs → scheduler templates → parse → analyze + inspect (`show/info/calc/save/compare`).
Sample data: `notebooks/data/water3.xyz` (water trimer). Outputs will be written under `notebooks/data/demo_en/` by default.

## 1) Setup & Configuration
- Install with extras: `python -m pip install -e .[analysis,cli]`
- Settings precedence: env → `~/.config/mbe-tools/config.toml` → `./mbe.toml` → explicit `load_settings(path=...)`.
- Data lives in `notebooks/data/`; this tutorial writes to `notebooks/data/demo_en/`.
- Leave `DO_RUN=False` to preview commands; switch to True to actually run them.

In [6]:
import sys, pathlib, importlib

# Demo paths relative to repo root
BASE_DIR = pathlib.Path.cwd().parent
DATA_ROOT = BASE_DIR / "notebooks" / "data"
DEMO_ROOT = DATA_ROOT / "demo_en"
DEMO_ROOT.mkdir(parents=True, exist_ok=True)
print("Data root:", DATA_ROOT)
print("Demo output dir:", DEMO_ROOT)

print("Python", sys.version.split()[0])
for mod in ["numpy", "pandas", "matplotlib", "seaborn", "mbe_tools"]:
    try:
        m = importlib.import_module(mod)
        ver = getattr(m, "__version__", "(no __version__)")
        print(f"{mod}: {ver}")
    except ImportError:
        print(f"{mod} not installed")

Data root: /Users/jiarui/Downloads/mbe-tools/notebooks/data
Demo output dir: /Users/jiarui/Downloads/mbe-tools/notebooks/data/demo_en
Python 3.12.4
numpy: 2.4.1
pandas: 2.3.3
matplotlib: 3.10.8
seaborn not installed
mbe_tools: 0.1.0


## 2) CLI: Fragment & Sample (water heuristic)
- Command: `mbe fragment <xyz> --n --seed --mode spatial --require-ion --prefer-special --out-xyz`
- Using sample `notebooks/data/water3.xyz`. Outputs go to `DEMO_ROOT`.

In [7]:
import subprocess, shutil

DO_RUN = False  # set True to actually run CLI
xyz_path = DATA_ROOT / "water3.xyz"
sampled_xyz = DEMO_ROOT / "water3_sampled.xyz"

cmd_fragment = [
    "mbe", "fragment", str(xyz_path),
    "--n", "2", "--seed", "42",
    "--mode", "spatial", "--require-ion", "--prefer-special",
    "--out-xyz", str(sampled_xyz),
]
print("CLI fragment:", " ".join(cmd_fragment))
if DO_RUN:
    if not shutil.which("mbe"):
        raise SystemExit("Install mbe-tools first: python -m pip install -e .[analysis,cli]")
    subprocess.run(cmd_fragment, check=True)
    print("Wrote", sampled_xyz)

CLI fragment: mbe fragment /Users/jiarui/Downloads/mbe-tools/notebooks/data/water3.xyz --n 2 --seed 42 --mode spatial --require-ion --prefer-special --out-xyz /Users/jiarui/Downloads/mbe-tools/notebooks/data/demo_en/water3_sampled.xyz


## 3) CLI: Generate MBE subset geometries (up to K)
- Command: `mbe gen <xyz> --backend --max-order --cp --out-dir`
- Writes `.geom` files under `DEMO_ROOT/mbe_geoms_cli/`.

In [None]:
geom_dir_cli = DEMO_ROOT / "mbe_geoms_cli"
cmd_gen = [
    "mbe", "gen", str(sampled_xyz),
    "--backend", "qchem",
    "--max-order", "2",
    "--cp",
    "--out-dir", str(geom_dir_cli),
]
print("CLI gen:", " ".join(cmd_gen))
if DO_RUN:
    geom_dir_cli.mkdir(exist_ok=True)
    subprocess.run(cmd_gen, check=True)

## 4) CLI: Render Q-Chem / ORCA inputs
- Command: `mbe build-input <geom> --backend qchem|orca --method --basis --out`.
- Demonstrates extra REM knobs for Q-Chem.

In [None]:
geom_example = geom_dir_cli / "qchem_k2_f0-1.geom"
cmd_build_qchem = [
    "mbe", "build-input", str(geom_example),
    "--backend", "qchem",
    "--method", "wb97m-v", "--basis", "def2-svpd",
    "--out", str(DEMO_ROOT / "qchem.inp"),
    "--thresh", "14", "--scf-convergence", "8",
]
cmd_build_orca = [
    "mbe", "build-input", str(geom_example),
    "--backend", "orca",
    "--method", "wb97m-v", "--basis", "def2-svpd",
    "--out", str(DEMO_ROOT / "orca.inp"),
]
print("CLI build qchem:", " ".join(cmd_build_qchem))
print("CLI build orca :", " ".join(cmd_build_orca))
if DO_RUN:
    subprocess.run(cmd_build_qchem, check=True)
    subprocess.run(cmd_build_orca, check=True)

## 5) CLI: Scheduler templates (PBS/Slurm) + run-control
- Command: `mbe template --scheduler pbs|slurm --backend ... --chunk-size ... --out ...`
- Run-control: optional `mbe.control.toml` colocated with inputs.

In [None]:
pbs_path = DEMO_ROOT / "qchem.pbs"
slurm_path = DEMO_ROOT / "orca.sbatch"
cmd_pbs = ["mbe", "template", "--scheduler", "pbs", "--backend", "qchem", "--job-name", "mbe-qchem", "--chunk-size", "10", "--out", str(pbs_path)]
cmd_slurm = ["mbe", "template", "--scheduler", "slurm", "--backend", "orca", "--job-name", "mbe-orca", "--chunk-size", "5", "--out", str(slurm_path)]
print("CLI template pbs  :", " ".join(cmd_pbs))
print("CLI template slurm:", " ".join(cmd_slurm))
print("Optional run-control: place mbe.control.toml next to inputs")
if DO_RUN:
    subprocess.run(cmd_pbs, check=True)
    subprocess.run(cmd_slurm, check=True)

## 6) CLI: Parse outputs → JSONL
- Command: `mbe parse <root> --program auto --glob "*.out" --out parsed.jsonl`
- Auto-detects program + metadata; embeds cluster geometry from singleton outputs (drops ghosts) and falls back to the first parsable geometry if singleton metadata is missing.
- Default JSONL selection used by other commands: run.jsonl → parsed.jsonl → single *.jsonl → newest *.jsonl (so you can often omit the path).
- Place real `.out` files under `DEMO_ROOT/Output/`.

In [None]:
output_root = DEMO_ROOT / "Output"
parsed_jsonl = DEMO_ROOT / "parsed.jsonl"
cmd_parse = ["mbe", "parse", str(output_root), "--program", "auto", "--glob-pattern", "*.out", "--out", str(parsed_jsonl)]
print("CLI parse:", " ".join(cmd_parse))
if DO_RUN:
    output_root.mkdir(exist_ok=True)
    subprocess.run(cmd_parse, check=True)

## 7) CLI: Analyze JSONL → CSV/Excel/plot
- Command: `mbe analyze parsed.jsonl --to-csv --to-xlsx --plot`.

In [None]:
csv_path = DEMO_ROOT / "results.csv"
xlsx_path = DEMO_ROOT / "results.xlsx"
plot_path = DEMO_ROOT / "mbe.png"
cmd_analyze = ["mbe", "analyze", str(parsed_jsonl), "--to-csv", str(csv_path), "--to-xlsx", str(xlsx_path), "--plot", str(plot_path)]
print("CLI analyze:", " ".join(cmd_analyze))
if DO_RUN:
    subprocess.run(cmd_analyze, check=True)

## 7b) CLI: Inspect JSONL (show/info/calc/save/compare)
- `mbe show` and `mbe info` reuse default JSONL selection (so you can omit the path).
- `mbe calc` adds CPU totals + MBE energies; `--unit` supports hartree/kcal/kj; `--monomer N` reports monomer energy.
- `mbe save` archives JSONL under cluster_id/timestamp; `mbe compare` scans a dir/glob of JSONL files.

In [None]:
import subprocess

archive_dir = DEMO_ROOT / "runs"
cmd_show = ["mbe", "show", str(parsed_jsonl)]
cmd_info = ["mbe", "info"]  # uses default JSONL selection
cmd_calc = ["mbe", "calc", str(parsed_jsonl), "--unit", "kcal", "--monomer", "0"]
cmd_save = ["mbe", "save", str(parsed_jsonl), "--dest", str(archive_dir)]
cmd_compare = ["mbe", "compare", str(archive_dir)]

for label, cmd in [("show", cmd_show), ("info", cmd_info), ("calc", cmd_calc), ("save", cmd_save), ("compare", cmd_compare)]:
    print(f"CLI {label}: {' '.join(cmd)}")

if DO_RUN:
    if not parsed_jsonl.exists():
        raise SystemExit("parsed.jsonl missing; run parse first")
    archive_dir.mkdir(exist_ok=True)
    subprocess.run(cmd_show, check=True)
    subprocess.run(cmd_info, check=True)
    subprocess.run(cmd_calc, check=True)
    subprocess.run(cmd_save, check=True)
    subprocess.run(cmd_compare, check=True)

## 8) Python: Load XYZ and fragment (water vs connectivity)
Compare fragment counts on the same sample cluster.

In [None]:
from mbe_tools.cluster import read_xyz, fragment_by_water_heuristic, fragment_by_connectivity

xyz = read_xyz(str(xyz_path))
frags_water = fragment_by_water_heuristic(xyz, oh_cutoff=1.25)
frags_conn = fragment_by_connectivity(xyz, scale=1.2)
print("water heuristic fragments:", len(frags_water))
print("connectivity fragments  :", len(frags_conn))

## 9) Python: Sample fragments deterministically
Use seed + ion retention; write sampled XYZ.

In [None]:
from mbe_tools.cluster import sample_fragments, write_xyz
sampled_py = sample_fragments(frags_conn, n=min(2, len(frags_conn)), seed=42, require_ion=False)
sampled_py_path = DEMO_ROOT / "sampled_py.xyz"
write_xyz(str(sampled_py_path), sampled_py, comment="sampled via Python API")
print("Sampled fragments written to", sampled_py_path)

## 10) Python: Generate MBE subsets with CP
Use `MBEParams` and `generate_subsets_xyz`; write `.geom` files.

In [None]:
from mbe_tools.mbe import MBEParams, generate_subsets_xyz
mbe_params = MBEParams()
geom_dir_py = DEMO_ROOT / "mbe_geoms_py"
geom_dir_py.mkdir(exist_ok=True)
subset_jobs = list(generate_subsets_xyz(sampled_py, mbe_params))
print("total subset jobs:", len(subset_jobs))
if subset_jobs:
    job_id, subset_indices, geom_text = subset_jobs[0]
    print("example job_id", job_id, "subset_indices", subset_indices)
for job_id, subset_indices, geom_text in subset_jobs:
    (geom_dir_py / f"{job_id}.geom").write_text(geom_text)
print("geom files written to", geom_dir_py)

## 11) Python: Render Q-Chem / ORCA inputs (string-based)
Read a `.geom`, render inputs with method/basis and REM extras.

In [None]:
from pathlib import Path
from mbe_tools.input_builder import render_qchem_input, render_orca_input

geom_pick = next(iter(geom_dir_py.glob("*.geom")), None)
if geom_pick is None:
    raise SystemExit("No .geom found; run previous cell first")
geom_text = Path(geom_pick).read_text()

qchem_inp = render_qchem_input(
    geom_text,
    method="wb97m-v",
    basis="def2-svpd",
    charge=0,
    multiplicity=1,
    thresh=14,
    scf_convergence="8",
    rem_extra=None,
 )
orca_inp = render_orca_input(
    geom_text,
    method="wb97m-v",
    basis="def2-svpd",
    charge=0,
    multiplicity=1,
    grid="Grid5",
)
(DEMO_ROOT / "input_qchem.inp").write_text(qchem_inp)
(DEMO_ROOT / "input_orca.inp").write_text(orca_inp)
print("Wrote qchem/orca inputs to", DEMO_ROOT)

## 12) Python: Scheduler templates programmatically
Render PBS/Slurm scripts with chunk_size and resources.

In [None]:
from mbe_tools.hpc_templates import render_pbs_qchem, render_slurm_orca

pbs_text = render_pbs_qchem(
    job_name="mbe-qchem",
    walltime="24:00:00",
    ncpus=32,
    mem_gb=64.0,
    queue="normal",
    project="proj123",
    module="qchem/5.2.2",
    input_glob=str(DEMO_ROOT / "input_qchem*.inp"),
    chunk_size=10,
 )
slurm_text = render_slurm_orca(
    job_name="mbe-orca",
    walltime="24:00:00",
    ntasks=1,
    cpus_per_task=32,
    mem_gb=64.0,
    partition="work",
    account="proj123",
    module="orca/5.0.3",
    input_glob=str(DEMO_ROOT / "input_orca*.inp"),
    chunk_size=5,
)
(DEMO_ROOT / "pbs_api.pbs").write_text(pbs_text)
(DEMO_ROOT / "slurm_api.sbatch").write_text(slurm_text)
print("Scheduler scripts written to", DEMO_ROOT)

## 13) Python: Parse outputs with metadata inference
Parse `.out` files (auto-detect program) and write JSONL.

In [None]:
import json
from mbe_tools.parsers.io import glob_paths, parse_files

out_paths = glob_paths(str(output_root), "*.out") if output_root.exists() else []
print("found outputs:", len(out_paths))
if out_paths:
    records = parse_files(out_paths, program="auto", infer_metadata=True)
    parsed_jsonl_py = DEMO_ROOT / "parsed_api.jsonl"
    with open(parsed_jsonl_py, "w") as f:
        for r in records:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")
    print("Wrote", parsed_jsonl_py)
else:
    print("Place .out files in", output_root)

## 14) Python: Assemble MBE energies and tabulate
Use `assemble_mbe_energy` + `order_totals_as_rows` (works with real parsed records or mock data).

In [None]:
from mbe_tools.mbe_math import assemble_mbe_energy, order_totals_as_rows

records_for_energy = [
    {"subset_indices": [0], "energy_hartree": -1.0},
    {"subset_indices": [1], "energy_hartree": -1.1},
    {"subset_indices": [0, 1], "energy_hartree": -2.25},
]
res = assemble_mbe_energy(records_for_energy, max_order=2)
rows = order_totals_as_rows(res["order_totals"])
print("order_totals:", res["order_totals"])
print("rows:", rows)

## 15) End-to-end quick run (CLI flow)
1) Run cells 2–7b with `DO_RUN=True` after placing real outputs (parse/analyze + show/info/calc/save/compare).
2) For Python-only flow, run cells 8–14 (no CLI needed).
3) All artifacts are under `notebooks/data/demo_en/`.