v0.72.0-dev20260601
Pre-release
Pre-release
·
870 commits
to main
since this release
Immutable
release. Only release title and notes can be modified.
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/26751550008
📦 Uncategorized
- Re-enable some DiT model tests
- PR: #45433
- [skip ci] Remove @nardoTT from CODEOWNERS
- PR: #45451
- #45440: Skip docs deploy for pre-release tags
- PR: #45442
- [Cleanup] Submerge temporary quasar API headers to impl/
- PR: #45289
- [skip ci] Revert stale SDPA disable for blackhole 3-chunk 4x768 case
- PR: #45449
- vLLM fixes
- PR: #45284
- [Fix] Migrate deprecated get_noc_addr callsites in embedding and rms kernels
- PR: #45318
- [Quasar] Add binary SFPU Div - Compute Side
- PR: #43988
- [Quasar] exp bug fix
- PR: #45459
- #42240: Remove dead transpose_utils helpers superseded by #45071
- PR: #43871
- #45394: Uplift multigammaln to use the new lgamma device operation
- PR: #45395
- [gpt-oss] skip fused MoE weight prep when cache files exist
- PR: #45438
- Take input and output shape into account for concat program cache
- PR: #45144
- [skip ci] Disable consistently failing tests in vllm-nightly-tests
- PR: #45313
- Update ttsim version to 1.7.0
- PR: #45297
- fix:
ttnn.to_layout(ttnn.TILE_LAYOUT)hangs onHEIGHT_SHARDED- PR: #44625
- [tt-train] Split K Gram matmul implementation
- PR: #40928
- [tt-train] Migrate ttml models to fused SwiGLU
- PR: #45335
- #45441: Sync published_versions.json to match gh-pages reality
- PR: #45457
- #42370 [Descriptor Migration] copy/typecast
- PR: #44257
- point_to_point: fix cache-hit stale-buffer bug from optional-output aliasing (#45422)
- PR: #45429
- Migrate deepseek_moe_fast_reduce_nc_fused + all_gather_via_broadcast to ProgramDescriptor (Contract-2)
- PR: #45076
- Removes MatmulMultiCoreReuseProgramFactory
- PR: #44587
- ring joint sdpa: bump 8k perf baseline 65.2 -> 65.6
- PR: #45527
- #45444: update kernel_lib to compile on Quasar
- PR: #45445
- Bug fix: Fix DeepSeek matmul expert LLK assert
- PR: #44850
- Ring joint SDPA: chunked-prefill accuracy, determinism, and per-chunk perf tests
- PR: #45525
- Adjust, reorder compiler options
- PR: #45348
- Fix failing CCL pytests in T3K e2e pipeline
- PR: #45434
- [Bug fix] Bump WH BRISC firmware size to 6KB+128 to fit profiler
- PR: #45505
- [#40012] Support In-Process Topology Updates after Link Recovery
- PR: #42866
- #44966: Add pre-publish RTL Sim CI gate to package-and-release.yaml
- PR: #45145
- [test] Relax TestSimpleL1Read tolerance from 0.05% to 0.5% to fix flaky CI
- PR: #45624
- Replace deprecated API calls in kernels and docs
- PR: #45316
- #43444 MoE: Full fused op on BlackHole
- PR: #45294
- [MeshTensor Integration] Matmul
- PR: #44220
- [Feature]: Implement Dispatch Telemetry
- PR: #44786
- [skip ci] Estimated number of machine hours per week per sku
- PR: #45615
- [Bug fix] Update ttnn.std/var and layernorm ops for FP32 precision (UnpackToDestFp32)
- PR: #45319
- Topology Mapper: Physical Grouping improved matching for heterogeneous groups
- PR: #43836
- Feature: Quasar SFPU binary add_int and metal test
- PR: #45028
- [emule] Uplift Fixes
- PR: #45568
- Move bit 11 from unpacker to math thread and fix 32bit bcast (WH)
- PR: #44412
- Feature: Quasar SFPU binary mul_int compute api and metal test
- PR: #45325
- bumping exalens to 0.3.20
- PR: #45640
- remove first token assert from prefill runner
- PR: #45638
- ci: reduce number of threads for Quasar compile job from 35 to 20
- PR: #45632
- [skip ci] Disable IntermeshSplit2x2FabricFixture.* in t3000-unit-tests t3k_tt_metal_multiprocess_tests
- PR: #45492
- [skip ci] Disable MeshDeviceFixture.Top32RmDevPipelineCompletes in tt-metal-l2-nightly llk-sd-unit-tests
- PR: #45484
- [skip ci] Disable swin_s test_e2e_performant / test_e2e_performant_dp in perf-models (TT_FATAL: trace buffer size exceeds trace_region_size)
- PR: #45529
- ci: increase verbosity level of TT_LOGGER for CI jobs
- PR: #45634
- [skip ci] Disable test_all_to_all_combine_no_trace in blackhole-e2e-tests ccl nightly tests
- PR: #45494
- [bug fix] DEVICE_PRINT was using the wrong field name for Quasar
- PR: #45575
- [test] Disable flaky TestSimpleL1Read timing assertion (#45624 followup)
- PR: #45647
- [Quasar] SFPU where: compute-side bringup
- PR: #44977
- Add quasar mxfp6 format metal infra.
- PR: #43374
- [skip ci] skip test_var_fp32_doscale_wt_gt_1 due to bit 11 precision regression
- PR: #45644
- Rtawfik/mul float bringup
- PR: #45645
- [Bug fix] moreh.layer_norm not populating rstd when mean=None
- PR: #45168
- fix: correct vector::reserve division error in erisc_datamover_builder
- PR: #44466
- [fabric][jit] Demote verbose log_info to log_trace
- PR: #44438
- ci: raise TTSim C++ tests step timeout from 5 → 10 minutes
- PR: #44375
- ci: move ttsim unit tests from merge-gate to sanity-tests
- PR: #45270
- Add quasar mxfp8 format metal infra.
- PR: #43342
- emule: role-aware L1_SLOT_MASK, per-NOC DRAM bank mapping, mcast include_self
- PR: #45623
- Update Fabric test infra Z link handling
- PR: #45420
- cmake: use CMAKE_COMPILE_WARNING_AS_ERROR to disable -Werror
- PR: #41600
- ci: add CPM source cache to clang-tidy-reusable workflow
- PR: #45672
- Fix DeepSeek MLA SDPA chunk sizes
- PR: #45658
- SDPA: hazard-free local reciprocal + enable chunked determinism tests
- PR: #45659
- [skip ci] Update release pipeline permissions
- PR: #45695
- #41662: Fix FP32 transpose RM sharded numerical regression
- PR: #44726