v0.4.2 — cuBLAS head-to-head on A10G
Patch release. First cross-vendor DF-MP2 numbers published — closes #4.
Highlights
| Shape | trn1 warm | A10G warm | A10G vs trn1 |
|---|---|---|---|
| small (128/16/384) | 0.008s | 0.001s | 8× |
| medium (512/64/1536) | 9.77s | 0.266s | 37× |
| large (768/96/2304) | 62.8s | 2.018s | 31× |
Energies bit-exact across platforms (fp32 reduction-order noise in the last ULP for medium/large).
What landed
infra/terraform-cuda/— provisions a single-A10Gg5.xlargeCI instance vintage-matched to trn1 (GA102 Ampere 2021 vs Trainium1 2022).scripts/run_cuda_bench.sh— SSM runner mirroring the Trainium one.examples/df_mp2.py --device cuda— moves inputs to GPU HBM; existing CPU path unchanged.- Honest cross-vendor table in docs/benchmarks.md with the vintage-parity rationale and the "close the gap" target for v0.5.0+ NKI kernel work.
What this tells us
Raw cuBLAS on 2021-vintage Ampere is ~30× faster than the current trnblas path on trn1 at medium/large DF-MP2 shapes. Trainium's Tensor Engine is being under-utilized in the current pipeline — closing the gap is exactly what #15, #18 (syrk), and #19 (trsm) are for in v0.5.0.
See the full CHANGELOG.
torch.matmul. Fixed in v0.4.3. Re-measured numbers are on the benchmarks page.