v0.4.0 — DF-MP2 end-to-end validation + NKI energy kernel
Highlights
-
Real-molecule DF-MP2 validation against PySCF (#11) — trnblas matches PySCF's own
mp.dfmp2.DFMP2reference to nanohartree precision on H2O/STO-3G, H2O/cc-pvdz, CH4/cc-pvdz, NH3/cc-pvdz. Newpip install trnblas[pyscf]extra, runnableexamples/df_mp2_pyscf.pydemo. -
Fused MP2 energy-reduction NKI kernel (#15, Phase 1) —
trnblas.nki.nki_mp2_energywith partition-dim sub-tiling. Validated on trn1 acrossnvir ∈ {8, 16, 64, 256, 448}. Scaffold landed; further perf work tracked under #15. -
DF-MP2 step-4 collapse (#14) — energy reduction replaced from
nocc²sequential batched dispatches with one chunked GEMM via the algebraic identityT_full = X @ X.T. Ontrn1.2xlarge:Shape Flops Cold Warm TFLOPS small (128/16/384) 3.4 G 0.025s 0.008s 0.43 medium (512/64/1536) 2757 G 12.9s 9.77s 0.28 large (768/96/2304) 20352 G 65.9s 62.8s 0.32 -
Trainium CI infrastructure — Terraform module for a persistent trn1 test instance, SSM-driven runners (
scripts/run_neuron_tests.sh,scripts/run_df_mp2_bench.sh), docs atdocs/aws_setup.md. -
NKI GEMM kernel wired to real
nisa.nc_matmulwith stationary tile reuse + HBM padding for arbitrary shapes.nki_batched_gemmdispatches per-slice through the cached kernel. 17/17 hardware tests pass. -
Repository transfer — now at
trnsci/trnblas. Docs at https://trnsci.dev/trnblas/. -
neuronxccfloor bumped>=2.15 → >=2.24(NKI 2.24+nc_matmulcalling convention) to unify with the rest of the trnsci suite.
See the full CHANGELOG for details.