Release v0.4.2 — cuBLAS head-to-head on A10G · trnsci/trnblas

Patch release. First cross-vendor DF-MP2 numbers published — closes #4.

Highlights

Shape	trn1 warm	A10G warm	A10G vs trn1
small (128/16/384)	0.008s	0.001s	8×
medium (512/64/1536)	9.77s	0.266s	37×
large (768/96/2304)	62.8s	2.018s	31×

Energies bit-exact across platforms (fp32 reduction-order noise in the last ULP for medium/large).

What landed

infra/terraform-cuda/ — provisions a single-A10G g5.xlarge CI instance vintage-matched to trn1 (GA102 Ampere 2021 vs Trainium1 2022).
scripts/run_cuda_bench.sh — SSM runner mirroring the Trainium one.
examples/df_mp2.py --device cuda — moves inputs to GPU HBM; existing CPU path unchanged.
Honest cross-vendor table in docs/benchmarks.md with the vintage-parity rationale and the "close the gap" target for v0.5.0+ NKI kernel work.

What this tells us

Raw cuBLAS on 2021-vintage Ampere is ~30× faster than the current trnblas path on trn1 at medium/large DF-MP2 shapes. Trainium's Tensor Engine is being under-utilized in the current pipeline — closing the gap is exactly what #15, #18 (syrk), and #19 (trsm) are for in v0.5.0.

See the full CHANGELOG.

⚠️ Erratum (v0.4.3): The "trn1 vs A10G" table in this release was comparing A10G's Ampere GPU to trn1's Xeon CPU, not its Tensor Engine. A PATH misconfiguration caused silent NKI fallback to torch.matmul. Fixed in v0.4.3. Re-measured numbers are on the benchmarks page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.4.2 — cuBLAS head-to-head on A10G

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

What landed

What this tells us

Uh oh!