Skip to content

v0.4.2 — cuBLAS head-to-head on A10G

Choose a tag to compare

@scttfrdmn scttfrdmn released this 13 Apr 03:10
· 81 commits to main since this release

Patch release. First cross-vendor DF-MP2 numbers published — closes #4.

Highlights

Shape trn1 warm A10G warm A10G vs trn1
small (128/16/384) 0.008s 0.001s
medium (512/64/1536) 9.77s 0.266s 37×
large (768/96/2304) 62.8s 2.018s 31×

Energies bit-exact across platforms (fp32 reduction-order noise in the last ULP for medium/large).

What landed

  • infra/terraform-cuda/ — provisions a single-A10G g5.xlarge CI instance vintage-matched to trn1 (GA102 Ampere 2021 vs Trainium1 2022).
  • scripts/run_cuda_bench.sh — SSM runner mirroring the Trainium one.
  • examples/df_mp2.py --device cuda — moves inputs to GPU HBM; existing CPU path unchanged.
  • Honest cross-vendor table in docs/benchmarks.md with the vintage-parity rationale and the "close the gap" target for v0.5.0+ NKI kernel work.

What this tells us

Raw cuBLAS on 2021-vintage Ampere is ~30× faster than the current trnblas path on trn1 at medium/large DF-MP2 shapes. Trainium's Tensor Engine is being under-utilized in the current pipeline — closing the gap is exactly what #15, #18 (syrk), and #19 (trsm) are for in v0.5.0.

See the full CHANGELOG.


⚠️ Erratum (v0.4.3): The "trn1 vs A10G" table in this release was comparing A10G's Ampere GPU to trn1's Xeon CPU, not its Tensor Engine. A PATH misconfiguration caused silent NKI fallback to torch.matmul. Fixed in v0.4.3. Re-measured numbers are on the benchmarks page.