Skip to content

v0.4.0 — DF-MP2 end-to-end validation + NKI energy kernel

Choose a tag to compare

@scttfrdmn scttfrdmn released this 13 Apr 01:43
· 86 commits to main since this release

Highlights

  • Real-molecule DF-MP2 validation against PySCF (#11) — trnblas matches PySCF's own mp.dfmp2.DFMP2 reference to nanohartree precision on H2O/STO-3G, H2O/cc-pvdz, CH4/cc-pvdz, NH3/cc-pvdz. New pip install trnblas[pyscf] extra, runnable examples/df_mp2_pyscf.py demo.

  • Fused MP2 energy-reduction NKI kernel (#15, Phase 1)trnblas.nki.nki_mp2_energy with partition-dim sub-tiling. Validated on trn1 across nvir ∈ {8, 16, 64, 256, 448}. Scaffold landed; further perf work tracked under #15.

  • DF-MP2 step-4 collapse (#14) — energy reduction replaced from nocc² sequential batched dispatches with one chunked GEMM via the algebraic identity T_full = X @ X.T. On trn1.2xlarge:

    Shape Flops Cold Warm TFLOPS
    small (128/16/384) 3.4 G 0.025s 0.008s 0.43
    medium (512/64/1536) 2757 G 12.9s 9.77s 0.28
    large (768/96/2304) 20352 G 65.9s 62.8s 0.32
  • Trainium CI infrastructure — Terraform module for a persistent trn1 test instance, SSM-driven runners (scripts/run_neuron_tests.sh, scripts/run_df_mp2_bench.sh), docs at docs/aws_setup.md.

  • NKI GEMM kernel wired to real nisa.nc_matmul with stationary tile reuse + HBM padding for arbitrary shapes. nki_batched_gemm dispatches per-slice through the cached kernel. 17/17 hardware tests pass.

  • Repository transfer — now at trnsci/trnblas. Docs at https://trnsci.dev/trnblas/.

  • neuronxcc floor bumped >=2.15 → >=2.24 (NKI 2.24+ nc_matmul calling convention) to unify with the rest of the trnsci suite.

See the full CHANGELOG for details.


⚠️ Erratum (v0.4.3): The "GEMM per-call kernel timing" and "DF-MP2 end-to-end" tables here reported trn1 numbers that were silently torch.matmul fallback on trn1's Xeon, not NKI on the Tensor Engine. Fixed in v0.4.3.