The initial ao_to_mo_transform kernel (shipped in 36564cd) is single-tile only: nbasis ≤ 128, nocc * naux ≤ 512, nvir * naux ≤ 512. Larger shapes raise NotImplementedError.
Most cc-pVDZ molecules have nbasis ≤ 128, but cc-pVTZ and above exceed it. K-tiling over the μ (and ν, in step 2) axis would extend the kernel to larger systems.
Approach: add a K-tile loop around each nc_matmul, accumulating partial PSUM across tiles. The matmul_kernel already has this pattern for the single-GEMM case — port the structure into the two-step fused kernel.
Out of scope here: optimize the intermediate's HBM round-trip (currently kernel-scratch HBM between the two matmul steps to handle partition-dim change). A follow-up could explore an in-SBUF transpose primitive if NKI adds one.
Acceptance: ao_to_mo_transform succeeds for nbasis up to at least 256 with nocc * naux / nvir * naux up to 2048; hardware tests at a cc-pVTZ-representative shape pass.
The initial
ao_to_mo_transformkernel (shipped in 36564cd) is single-tile only:nbasis ≤ 128,nocc * naux ≤ 512,nvir * naux ≤ 512. Larger shapes raiseNotImplementedError.Most cc-pVDZ molecules have nbasis ≤ 128, but cc-pVTZ and above exceed it. K-tiling over the μ (and ν, in step 2) axis would extend the kernel to larger systems.
Approach: add a K-tile loop around each
nc_matmul, accumulating partial PSUM across tiles. Thematmul_kernelalready has this pattern for the single-GEMM case — port the structure into the two-step fused kernel.Out of scope here: optimize the intermediate's HBM round-trip (currently kernel-scratch HBM between the two matmul steps to handle partition-dim change). A follow-up could explore an in-SBUF transpose primitive if NKI adds one.
Acceptance:
ao_to_mo_transformsucceeds for nbasis up to at least 256 withnocc * naux/nvir * nauxup to 2048; hardware tests at a cc-pVTZ-representative shape pass.