·
353 commits
to main
since this release
Immutable
release. Only release title and notes can be modified.
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/27173121996
- no changes
What's Changed
- Fix conv3d WH pack destination race by @pavlejosipovic in #44079
- [LLK] Fix num_tiles == 1 handling in unpack_tilize wormhole by @halghTT in #44426
- CI: add pytest-log-mode for LLK smoke (ci-quiet in PR Gate) by @ndivnicTT in #44436
- PDL update pcc and perf by @nmilicevicTT in #44506
- Add allowed_worker_cores to matmul program configs by @rmillerTT in #44341
- [skip ci] Install latest ttnn wheel on exabox multihost runner by @dpopovTT in #44508
- [Test] Replace assert_with_pcc in eltwise unary/activation test files by @VirdhatchaniKN in #44373
- Cleanup: PackMode enum for pack tilize/untilize (tt-metal#34587) by @ndivnicTT in #44370
- Add wan2.2-distill (lightx2v) , Index-AniSora V3.2 and LoRA adapter pipeline for Wan2.2 I2V. by @tvardhineniTT in #44031
- Optimise and improve accuracy of log (fp32/bf16) by @jasondavies in #44419
- Cleanup: centralize LLK eltwise SFPU helpers by @ndivnicTT in #44262
- Bug fix: Assert NoC write packet tags are cleared between kernels by @jbaumanTT in #44425
- Gemma3-4B Issues Fix by @ign-akshayr in #42534
- ttnn.isclose op for Integer support & LLK migration by @abdulla-TT in #43515
- [skip-ci] ci: disable galaxy-multiprocess tests (hanging ~1 month, tracked in #44339) by @tenstorrent-github-bot in #44340
- SFPI 7.50.0 614 by @nathan-TT in #44526
- ci: upgrade actions/checkout from v4 to v6.0.2 (Node.js 24) by @blozano-tt in #44056
- [skip ci] Add sku for allocating one galaxy in CI by @dpopovTT in #44380
- [Security] Bump torch module by @vtsilytskyiTT in #43782
- Fix bad optional access in CCL matmul paths by @rmillerTT in #44528
- Bump ttsim version to v1.6.1 by @mcraigheadTT in #44536
- Fixing DEVICE_PRINT on Quasar by @tt-vjovanovic in #44381
- [Cleanup] rename MeshTensor::legacy_shard_spec() to shard_spec() by @riverwuTT in #44535
- [cleanup] separate out const vs mutable accessor for MeshTensor by @riverwuTT in #44344
- #0: MoE TP8 Megatron refactor + integrate new matmuls for compressed tensor by @TT-BrianLiu in #44212
- Improve performance of MeshDispatchFixture unit test times by @anashTT in #44319
- [skip ci] [ttnn] fix batch_norm test compatibility with PyTorch 2.11+ by @tenstorrent-github-bot in #44541
- Bump WH resnet50 bs=16 BFP8 LoFi device-perf target by @pavlejosipovic in #44550
- security: bump protobuf 3.20.2 to 4.25.8 by @tenstorrent-github-bot in #43500
- #44555: fix batch norm test by @bbradelTT in #44556
- [wormhole] increase MEM_BRISC_FIRMWARE_SIZE to fix firmware overflow from a556756 by @tenstorrent-github-bot in #44545
- [Patch] Improve run-test tooling and sync Cursor to Claude by @njokovicTT in #43961
- Optimise int32/uint32/uint16 comparison ops by @jasondavies in #44435
- Update ttsim-version to 1.6.1 by @fvranicTT in #44568
- ci: nightly LLVM code coverage for ttnn by @tenstorrent-github-bot in #44574
- ci: Galaxy profiler pipeline time budgets by @tdowdallTT in #44324
- ci: forward build-type to wheels.yaml in build-artifact by @tenstorrent-github-bot in #44578
- [Test]: Enable SD Enable test_prefetcher.cpp by @sidnadTT in #43835
- Fix LoRA fusion in-place mutation corrupting base state dict by @tvardhineniTT in #44543
- [skip ci] ci: nightly LLVM code coverage for ttnn via reusable workflow by @tenstorrent-github-bot in #44576
- tests: tt_metal: Fix fabric microbench in multi-host cases by @p1-0tr in #44266
- ci: fix coverage missing ttnn C++ — collect all .so from build/lib by @tenstorrent-github-bot in #44592
- ci: standardise SKU list across all model test workflows by @handrewsTT in #44549
- [skip ci] ci: use setup-job in aggregate-coverage (fixes missing /opt/venv wheel install) by @tenstorrent-github-bot in #44597
- [Bug fix] #38973: Update gelu_bw (experimental) infra by @KalaivaniMCW in #44338
- #42398 [Descriptor Migration] prefetcher by @azaytsev-epam in #43619
- Fix DEVICE_PRINT linkage handling by @tt-vjovanovic in #44598
- [tt-train] PolyNorm3 FW kernel compute optimizations by @mdragulaTT in #42968
- [tt-train] Fix PolyNorm BW deadlock for non-4 block sizes by @mdragulaTT in #44511
- #44529: fix bad optional access for fused matmul callers by @bbradelTT in #44557
- Fix conv3d garbage output when batch size changes (#44565) by @pavlejosipovic in #44599
- [Test only] Fix nightly softmax_backward by @vtsilytskyiTT in #44523
- ci: Wire AI job/run summary into demo-sp-release by @ppetrovicTT in #43624
- Generalize moe_compute shape support beyond hardcoded model configs by @gajanan-choudhary in #43932
- ci: embed build-type in build artifact name by @tenstorrent-github-bot in #44585
- ci: add kernel ccache stats reporting to ops, tt-train, profiler, t3000 workflows by @tenstorrent-github-bot in #44563
- [tt-train] Fixing optimizer set_state_dict nanobind by @awliu-TT in #43320
- [Quasar] Remove duplicate llk_pack_reduce_mask_config, match with WH/BH function signature by @amokanTT in #44474
- [Feature] TTNN Metal 2.0 factory adapter by @akerteszTT in #44562
- operation : deepseek_moe_fast_reduce_nc_fused by @jungeunlim-TT in #42624
- [Bug fix] filter out kubernetes network interfaces in run_quad_galaxy_tests.sh by @maksim-tsishkouski-epam in #44487
- [Bugfix] Small Metal 2.0 validation fixes / improvements (from 1st op port attempt) by @akerteszTT in #44586
- ct/fabric cluster update by @ctaylorTT in #43899
- [Feature] Implicit DFBAccessor → uint32_t for LLK compatibility by @akerteszTT in #44646
- Feature: Persist H2D/D2H Socket Connector state in SHM across Processes by @tt-asaigal in #43967
- [skip ci] ci: fix aggregate-coverage — remove docker-job hack, use setup-job by @tenstorrent-github-bot in #44655
- Bump ttsim version to v1.6.2 by @mcraigheadTT in #44644
- Performance: Pipeline relay linear prefetcher reads with per-buffer TRIDs by @jbaumanTT in #44222
- Replace prepare_resources hook with MeshDescriptor API for mesh-workload ops by @dgomezTT in #44432
- ci: exclude .cpmcache/build/third_party from coverage report [skip ci] by @tenstorrent-github-bot in #44660
- [Feature] Tighten UnpackToDestMode legality checks in Metal 2.0 by @akerteszTT in #44656
- Adds INT32 MUL and COMP Binary SFPU kernels for Quasar by @pgardnerTT in #44427
- Adding Env Var to Disable Real Time Profiler by @awliu-TT in #44658
- [skip ci] : skip produce-cicd-data jobs on forks by @tenstorrent-github-bot in #44663
- ci: fix coverage ignore regex to match build_CodeCoverage dir [skip ci] by @tenstorrent-github-bot in #44666
- [tt-triage] Save LLM friendly output artifact in CI by @onenezicTT in #44609
- [BH] pack_untilize MOP perf: CFGSHIFTMASK + AddrMod-driven row advance by @pavlejosipovic in #44596
- [tt-triage] Triage run only for the first hanging test in a CI job by @onenezicTT in #44621
- [LLK] Enforce single-ADDR32 invariant for combined-mask packer RMW writes (BH) by @nstamatovicTT in #44251
- Add Quasar SFPU binary max/min kernel and tests by @njokovicTT in #43363
- #44519: add dst_full_sync_en and throttle propagation to mm program factories missing this by @bbradelTT in #44522
- sdxl: retune clip_encoder_1 blackhole baseline after unroll-pragma fix by @dstoiljkovicTT in #44692
- #43801: relax constraints for layer norm shapes by @bbradelTT in #44527
- ttnn: consolidate SDPA Q balancing on zigzag remapping by @pavlejosipovic in #44480
- [DCAMP-512] Force MPI OOB and BTL channels onto ens5f0np0 by @jpanasiukTT in #44497
- [DCAMP-215] Extend HD sockets tech report with Galaxy Rev C (Gen 5 PCIe) results by @jpanasiukTT in #43167
- [UMD Bump] Automated UMD Bump 15.05.2026 by @broskoTT in #44447
- ds_prefill(refactor) - move prefill_block and prefill_transformer out of moe/ by @mbezuljTT in #44600
- Enable traced prefill for APC in TT transformers (issue 268 @ vLLM) by @sanjaradylov in #43165
- [fuser] missing wh ops for sdpa by @markoradosavljevicTT in #44089
- SFPI 7.51.0 625 by @nathan-TT in #44700
- [Quasar] Remove llk_pack_rows_api.h for Quasar because it is not implemented by @amokanTT in #44709
- [skip ci] Remove uv pip install ttnn from multihost runner by @dpopovTT in #44712
- ci: merge per-core allocation tests into core ttnn unit test group by @tenstorrent-github-bot in #44581
- Add GELU activation support to
moe_computefor Gemma 4 26B by @gajanan-choudhary in #44515 - Improve performance of accurate exp on bfloat16 by @nmauriceTT in #43973
- Cleanup: Update BGE vLLM generators after VLLM split to plugin by @viktorpusTT in #44081
- GPT-OSS: ensure L1 dispatch + harden unit tests by @sraizada-tt in #44724
- Generating MPI rank IDs to match Mesh IDs where possible by @Riddy21 in #44197
- sdpa: hoist id_of out of fetch_block for PaddedAddrGenerator by @skrsticTT in #44337
- [skip ci]: Galaxy profiler test env by @tdowdallTT in #44728
- Cleanup: Quasar tilize/untilize cfg structs and writes by @nmohammedTT in #44290
- [Feature] Quasar - Assert on valid Buffer Descriptor programming (#1424) by @jihoonyouTT in #44430
- Remove dependency on debug bit 11 for transpose_dest by @amahmudTT in #43757
- Untilize Eltwise Binary Tests - Use named constants and remove outdated files by @amokanTT in #44616
- [skip ci]: disable ds_r1_qwen N300 demo test pending timeout fix by @tenstorrent-github-bot in #44737
- [skip ci] Update moreh ops owners by @bbradelTT in #44732
- [skip ci] Rename auto-triage workflow surfaces to regression-analysis by @ebanerjeeTT in #44260
- [Feature] Replace column-parallel LM head with row-parallel LM head for DeepSeek v3 by @nbabinTT in #44505
- Migrate experimental ops (PR #41637 scope: dit_layernorm + fused_rmsnorm + deepseek_grouped_gate + hc_sum_reduce) to ProgramDescriptor by @dgomezTT in #43299
- [Bug fix] Fix mamba 1D depthwise conv by @pavlejosipovic in #44490
- [skip ci] ci: add aarch64 native Ubuntu 24.04 build workflow by @blozano-tt in #44763
- Add overlay namespace for Quasar hardware intrinsics by @vvukomanovicTT in #39716
- [Bug fix] #43831: fixes zero comparison ops, binary comparison ops to handle special value inputs by @KalaivaniMCW in #44085
- Disabling zero_init for MoE by @pmilojevicTT in #44635
- [Bug fix] Fix DeepSeek teacher-forced decode input args by @gwangTT in #44537
- [tt-train] Deduplicate benchmark utilities and fix from_vector startup by @mdragulaTT in #44133
- [skip ci] ttsim: use PYTEST_ADDOPTS so skip-list args reach all chained pytests by @tenstorrent-github-bot in #44781
- Topology Solver: Migrating to CaDiCaL SAT solver engine by @Riddy21 in #44037
- [Feature] Add DeepSeek KV cache + Decoder Layer validation by @bzhangTT in #44540
- [Feature] DFB "from borrowed memory" support in Metal 2.0 by @akerteszTT in #44662
- [skip ci] cleanup: Ensure metal-infra co-owns all CMakeLists.txt files by @tenstorrent-github-bot in #44794
- Improve FillPadDeviceOperation performance by @nmauriceTT in #42673
- Bug fix: Fix binary SFPU LLK perf dst output index by @ndivnicTT in #44757
- fix: reenable replay buffer optimization by the compiler by @fvranicTT in #44772
- ttnn.reshape: apply pad_value for tiled tensors by @VirdhatchaniKN in #42770
- Add L1-safe defaults for paged SDPA decode by @pavlejosipovic in #44617
- Adds validation to other reduce ops (argmax, cumprod, cumsum, moe, topk, prod, fast_reduce_nc, ema, sampling) by @ign-abirami in #43430
- Perf tests for DeepSeek Prefill by @nmilicevicTT in #43168
- Fix ETH heartbeat failure in Galaxy all_gather_async + raise sweep hang-detection timeout by @Aswinmcw in #44811
- [skip ci] Update llama bh tests in release workflow by @dpopovTT in #44731
- Revert #43973 (accurate exp on bfloat16) — fixes ring-joint SDPA accuracy regression by @skrsticTT in #44828
- [Bug fix] LLK: preserve UInt16 datums whose low byte is zero across SrcA/SrcB reconfig (#37571) by @nstamatovicTT in #43882
- Bump ttsim version to v1.6.3 by @mcraigheadTT in #44800
- pdl: re-baseline deeplab_v3_plus_110_cores device perf after pack_untilize MOP opt (#44596) by @dstoiljkovicTT in #44832
- [Cleanup] Update fullname with trade mark symbols by @dhelmuthTT in #44754
- chore: update ttsim version to 1.6.3 by @fvranicTT in #44807
- #38973: gelu_bw (tanh) - dst[8] out of bounds in DST SyncHalf by @KalaivaniMCW in #44818
- [Quasar] L1 acc compute api and test, matmul short init by @amokanTT in #43990
- Cleanup : Restore 4-hour schedule for runtime perf tests by @msudumbrekar-TT in #44764
- Fix moe_gpt tilize CB overflow causing flaky device hangs by @sraizada-tt in #44723
- [Bug fix] fixed fabric in Deepseek tests by @maksim-tsishkouski-epam in #44833
- [ttnn] paged_fill_cache: allow asymmetric num_heads under HMA tensor sharing by @handrewsTT in #44603
- ci(gemma3): migrate perf tests to HF hub weight path by @tenstorrent-github-bot in #44762
- [Performance + Test Only] Update DeepSeek decode perf target by @maksim-tsishkouski-epam in #44846
- #44612: Use -ftt-no-dyninit by @nathan-TT in #44744
- Relax MLA prefill chunked atol by @pavlejosipovic in #44856
- Propagate kernel compilation warnings by @nathan-TT in #44843
- DEVICE_PRINT dispatch by @tt-vjovanovic in #43509
- Bug fix: skip Tracy queue pushes from RT profiler when no viewer is connected by @yusufbashiTT in #44784
- [skip ci] ttsim: skip test_tiled_concat on wormhole_b0 by @ebanerjeeTT in #44790
- [Bug fix] Use
uintptr_tfor device-side pointer casts by @sagarwalTT in #44801 - Feature: Harden bug checker failures and improve PR comment UX by @stevendae in #44845
- [Bug fix] Write
MCAST_DESTSregister innoc_write_with_stateandnoc_wwrite_with_statefunctions on Quasar by @sagarwalTT in #44864 - [Feature] Config option to enable python stack traces by @dcblundell in #44727
- [Quasar] Exp Bringup by @ryanzhuTT in #44236
- [skip ci] Disable consistently failing tests in (Runtime) Unit Tests by @ebanerjeeTT in #44769
- [Feature]: Support for aliased dfbs by @abhullar-tt in #44806
- [Feature] Add resolve_id() to DFBAccessor for Quasar LLK APIs by @akerteszTT in #44665
- [LLK] Add LLK debug API for writing to and dumping DEST by @halghTT in #44209
- ci(pr-gate): consolidate to ASan-only builds (remove Release build job) by @tenstorrent-github-bot in #43322
- Fix fabric init/teardown config and mesh device closure ordering by @nnyamagoudar-TT in #44789
- [Bugfix] Fix legality checks for aliased DFBs by @akerteszTT in #44892
- ci: classify Docker registry errors as DOCKER_REGISTRY_FAILURE instead of GENERIC_FAILURE [skip ci] by @tenstorrent-github-bot in #44733
- [Bug Fix]: Fix Idle ERISC Timeouts on BH by @sidnadTT in #44874
- BH Exabox CCL unit tests by @llongTT in #42185
- feat: optimize quantization llks by @fvranicTT in #44844
- infra: track TtSmiReset attempts with structured Pydantic model by @tenstorrent-github-bot in #44774
- [skip ci] ci(llk): remove llk-tests-changed from llk-build-quasar trigger by @tenstorrent-github-bot in #44429
- ci(tests): add module-level device fixture to pool tests by @tenstorrent-github-bot in #44564
- ci: classify artifact upload failures as ARTIFACT_UPLOAD_FAILURE in Snowflake [skip ci] by @tenstorrent-github-bot in #44783
- Revert "track TtSmiReset attempts with Pydantic model by @arshahTT in #44898
- infra: classify git checkout/submodule clone failures as CHECKOUT_FAILURE [skip ci] by @tenstorrent-github-bot in #44690
- llk blackhole simulator fix by @markoradosavljevicTT in #43545
- SDPA: global Q scheduling for single-chip path by @skrsticTT in #44614
- [ttnn] paged_update_cache: allow asymmetric num_heads under HMA tensor sharing by @handrewsTT in #44606
- [skip ci] Exabox cronjob push for one more hour by @dpopovTT in #44913
- RT profiler warning filter by @mo-tenstorrent in #44841
- [Test] Replace assert_with_pcc in eltwise test files by @VirdhatchaniKN in #44904
- [Quasar] Add quasar SFPU div LLK kernel by @njokovicTT in #44026
- Migrate matmul/factory/matmul_multicore_reuse to ProgramDescriptor by @dgomezTT in #44756
- Support head_dim=32 in ttnn.experimental.rotary_embedding_hf via dedicated Wt == 1 paths by @velonica0 in #43860
- Add trace-based PCC test for DeepSeek prefill transformer to CI by @ddjekicTT in #44376
- Add chunked trace validation for host I/O decoder sweep by @gwangTT in #44804
- [Bug Fix] DRAM Matmul - Fix Compute Reading Un-pushed Data by @edwinleeTT in #44872
- [Bug fix] Use
NOC_XY_ENCODINGandNOC_MULTICAST_ENCODINGin Quasar HAL by @sagarwalTT in #44859 - Do not append opIDs of zero to cq_disptach rt profiler ID buffer by @mo-tenstorrent in #44510
- [Cleanup] DeepSeek set trace_region_size=0 by @maksim-tsishkouski-epam in #44914
- SFPI 7.52.0 637 by @nathan-TT in #44926
- [Quasar] Remove skips for already implemented Quasar tests by @amokanTT in #44719
- Add N150 model traces: 17 models across LLM, CNN, VLM, embedding, TTS by @Aswinmcw in #44464
- [Test]: Fix test bug for intra+dmSxt6A dfbs where consumers don't process all tiles by @abhullar-tt in #44893
- [Bug fix] Finalize offsets before writing fast dispatch kernel runtime args by @sagarwalTT in #44931
- [ttnn] paged_scaled_dot_product_attention_decode: allow asymmetric num_heads by @handrewsTT in #44633
- Fix ring joint SDPA sigmoid reciprocal init by @pavlejosipovic in #44866
- [Bug fix] Replace flush_l2_cache_line with flush_l2_cache_range for watcher structs by @kstevensTT in #44964
- Add experimental Wan2.2 SVI infinite video pipeline + multi-LoRA stacking by @kevinmi-TT in #44887
- infra: track TtSmiReset attempts with structured Pydantic model by @tenstorrent-github-bot in #44902
- [Bug fix] Fixing eth_fw_api.h warnings by @kstevensTT in #44968
- [Feature] Adding handling of HW errors on compute cores by @arikTT in #44836
- Topology solver multi-solution enumeration API (solve_topology_mapping_n/all/next) by @Riddy21 in #43593
- infra: track TtSmiReset attempts with structured Pydantic model by @tenstorrent-github-bot in #44974
- [Bug fix] fused_rmsnorm_post_allgather: register weight/rope buffers as BufferBinding by @skrsticTT in #44963
- ring_joint_sdpa: trim program size to clear Tensix kernel-config ringbuffer by @skrsticTT in #44925
- [44242-ttnn-missing-api-files] Publish missing API files for ttnn ops by @rlewczuk in #44243
- [bug fix] Deassert ALL chip portions by @arikTT in #44972
- Bug fix: pinned-memory H2D write correctness (cache invalidation, relay alignment, prefetcher paging, per-MMIO budget) by @jbaumanTT in #44467
- Fix ttnn::pad stale pad-value buffer on program-cache hit (#44565) by @pavlejosipovic in #44601
- Better validation for matmul ops - Phase 1 by @ign-yaswanth in #43987
- Fix conv3d streaming output partial block race by @pavlejosipovic in #44973
- [Bug fix] Pool 2D compute: replace thread-unsafe pack HW configure with lightweight reconfig by @dstoiljkovicTT in #44842
- Expand bfp4_b testing ( migrate from LLK repo) by @ldjurovicTT in #42149
- [Bug fix] Correctly unpack outputs of
Transform.prepare_inputs_prefill. by @sanjaradylov in #44812 - ds_prefill(gate_refactor) - split TtMoERoutingSetup out of TtMoEGatePrefill by @mbezuljTT in #44623
- Fix conditions for using llk bcast with float32 output in binary_ng program factory by @bklockiewiczTT in #44699
- [skip ci] Fix upstream bh galaxy and multicard MLperf weights paths by @williamlyTT in #44879
- [Quasar] Sqrt and Recip Bringup by @ryanzhuTT in #44743
- Support Wan VAE width padding by @sosborne-TT in #43938
- Merge DeepSeek ds-rc1 onto main by @yieldthought in #44922
- refactor: remove pad_by_zero, replace with torch2tt_tensor (fixes #43919) by @tenstorrent-github-bot in #43958
- Added proper guard for DUAL_BH and QUAD_BH tests, remove single galax tests. by @llongTT in #45030
- Enable FP8 in tt-metal for combine DeepSeek V3 Prefill operator by @pmilojevicTT in #43775
- [skip ci] Disable consistently failing tests in t3000-perf-tests by @ebanerjeeTT in #45051
- Migrate ccl/all_broadcast to ProgramDescriptor (POC for workload-level prepare_resources) by @dgomezTT in #44407
- ci: increase timeout of LLK Blackhole tests to 23 minutes by @fvranicTT in #45057
- Bump ttsim version to v1.6.4 by @mcraigheadTT in #45059
- Migrate ccl/all_to_all_{combine,dispatch} to ProgramDescriptor by @dgomezTT in #44408
- Migrate point_to_point to ProgramDescriptor on mesh workload hook by @dgomezTT in #44411
- Migrate debug/apply_device_delay to ProgramDescriptor by @dgomezTT in #44401
- Replace old
stimuli_generatorwith refactored implementation by @jmacanovicTT in #44990 - Fix LLK perf tests by @ndivnicTT in #44907
- ring_joint_sdpa: chunked prefill by @skrsticTT in #44929
- Cleanup: LLK several enum to enum class conversion and enum deletions (BH/WH) by @ndivnicTT in #44622
- #44837: Use sfpi::vBool by @nathan-TT in #45014
- Hoist BH DEST remap setup for conv3d hot loops by @pavlejosipovic in #44986
- Re-land #43973 accurate exp on bfloat16 by @pavlejosipovic in #44965
- tt-llk: read fast tilize guard from worker core by @nkapreTT in #44994
- Cleanup: LLK enum class for EltwiseBinaryType by @ndivnicTT in #44619
- Cleanup: LLK enum class refactor for VectorMode by @ndivnicTT in #44443
- [Bug fix]composite_reduce_scatter: typecast BFLOAT8_B → BFLOAT16 around split/pad/concat by @VirdhatchaniKN in #44920
- [skip ci] Add Blackhole e2e SDPA test selection by @pavlejosipovic in #45008
- #36217: uint8 support for relational ops. by @abdulla-TT in #44824
- [Performance] Update Swin-S device perf target by @pavlejosipovic in #45001
- Migrate experimental/transformer leftover ops (6) to ProgramDescriptor by @dgomezTT in #44400
- [Feature] Base pointer retrieval API for TensorAccessor by @akerteszTT in #45091
- Migrate experimental/ssm leftovers to ProgramDescriptor by @dgomezTT in #44403
- Data format reconfigure Quasar by @vkrsmanovicTT in #44221
- Migrate moreh_dot op to ProgramDescriptor framework by @dgomezTT in #42467
- Migrate transformer/sdpa family (7 variants) + ring_attention_all_gather to ProgramDescriptor by @dgomezTT in #44755
- Migrate moreh family to ProgramDescriptor framework by @dgomezTT in #43204
- #41827 MoE compute: Blackhole single-card bring-up by @dchenTT in #44195
- [Bug fix] Extend page-table alignment to
batch_size > 1and guardchunk_start_idxby @sanjaradylov in #44993 - Fix DeepSeek MoE trace config and TG decode handling by @yieldthought in #45081
- Fix flexible chunked SDPA descriptor buffer binding by @pavlejosipovic in #45121
- ds_prefill [Feature]: ttnn.kv_cache.fill_cache_for_user_ update_idx exposed by @ipotkonjak-tt in #44827
- [DCAMP-510] Recursive cluster config aggregation in
merge_cluster_configs.pyby @jpanasiukTT in #44486 - Fix fusion clang-tidy missing MeshCoordinateRange include by @pavlejosipovic in #45124
- [Bug fix] Revert "LLK: preserve UInt16 datums whose low byte is zero across SrcA/SrcB reconfig" (#43882) by @nstamatovicTT in #45020
- sdpa: migrate kernels off deprecated dst and noc-tile APIs by @skrsticTT in #45033
- Migrate experimental/fusion/fusion_dispatch_op to ProgramDescriptor by @dgomezTT in #44405
- [Bug fix] test_bcast.cpp: qualify ELW* enumerators with EltwiseBinaryType:: by @ndivnicTT in #45131
- [tt-train] sdpa_bw: use 2-arg pack_reconfig_data_format to avoid SFPU drain by @vmelnykovTT in #44873
- Fix MoE kernel compile regressions by @pavlejosipovic in #45092
- Optimise exp2 fp32/bf16 by @aliraza556 in #44539
- ci(blackhole-e2e): Consolidate per-SKU AI run summaries into one by @ppetrovicTT in #44821
- #42373 [Descriptor Migration] embedding_backward by @azaytsev-epam in #44120
- data_movement: migrate transpose + permute + untilize + untilize_with_unpadding to ProgramDescriptor by @dgomezTT in #45071
- Performance: LLK CI — extend ci-quiet to e2e/perf and tighten PR Gate smoke sharding by @ndivnicTT in #45136
- Update fullname to include trademark symbol by @dhelmuthTT in #45043
- moreh_mean_backward: register output_grad as BufferBinding to fix stale-address cache hits by @dgomezTT in #45137
- [BugFix] DCAMP-595 and misc enhancements including --check mode, extra mpi arg… by @vdu-TT in #45115
- #45150: fix up test imports by @bbradelTT in #45151
- [Cleanup, Test only] Eliminate divergent code paths between legacy and metal 2.0 APIs by @shengxiangjiTT in #44638
- [Test Only] Migrate test_bmm and its kernels to Metal 2.0 API by @shengxiangjiTT in #44760
- Port device print to LLK by @iklikovacTT in #43981
- [Feature] Add MeshTensor overloads for RTArgs and CBDescriptor by @riverwuTT in #44327
- [Bug fix]: sem up wasn't using noc index to calculate noc addr by @abhullar-tt in #45120
- Add 31 N300 trace registry entries (traces 60-90) by @Aswinmcw in #45119
- #43337: Add support for snake_beta activation by @mouliraj-mcw in #43614
- Fix snake_beta VectorMode param type (compile failure on ttsim BH) by @tenstorrent-github-bot in #45171
- prefill_runner config loading and default iteration number by @jjovicicTT in #44819
- Fix ring joint SDPA streaming CB setup by @pavlejosipovic in #45104
- [Bug fix] binary_ng: disable PACK_RELU fast path for subtile-broadcast kernels by @Aswinmcw in #44912
- Migrate experimental/deepseek_prefill mesh-workload ops to ProgramDescriptor by @dgomezTT in #44409
- Eltwise binary init compute and llk api cleanup by @amokanTT in #44602
- [Bug fix][Quasar] noc_init: cap MAX_BYTES_IN_PACKET at 8KB and init cmd buffers on DM>0 by @vvukomanovicTT in #44847
- [Bug fix + Test Only] Fix multi-host tests by @maksim-tsishkouski-epam in #45202
- [ttnn] Relax to_layout ROW_MAJOR dtype assert: allow BFLOAT8_B -> BFLOAT16 by @dgolubovicTT in #45148
- Improve unary SFPU golden generation by @jmacanovicTT in #45126
- [skip ci] [cleanup] rename CPM googletest -> GTest to match upstream and UMD by @afuller-TT in #45167
- Cleanup: vLLM: Use additional_config, prefer it over plugin_config by @viktorpusTT in #44991
- [Bug fix] Fixed DeepSeek tests by @maksim-tsishkouski-epam in #45125
- [skip ci] ci: add ctcache (clang-tidy result caching) via Garage S3 by @tenstorrent-github-bot in #41796
- [skip ci] Update fullname to include trade mark symbol by @dhelmuthTT in #45044
- [vllm] Disable hybrid kv cache groups: emit FullAttentionSpec for all layers by @handrewsTT in #45184
- [Clean up]: Porting kernel lib to DFBs for WH/BH by @abhullar-tt in #45022
- Feature: Fix mock+silicon coexistence via MetalEnv (#38445) by @msudumbrekar-TT in #43754
- fix: pin host buffer in async cpu() to prevent use-after-free (issue #43638) by @Riddy21 in #44207
- emule: Metal 2.0 JIT support (named bindings + binding-aware cache key) by @arminaleTT in #45221
- Fix Quasar warnings by updating device_print and build configurations by @tt-vjovanovic in #45088
- Migrate experimental/reduction leftovers to ProgramDescriptor by @dgomezTT in #44402
- [Feature] Dynamic shape options for Metal 2.0 TensorParameters by @akerteszTT in #45106
- Fix finding MPI rank in tt-triage by @tt-vjovanovic in #44996
- [API Cleanup] Move DFB disable_implicit_sync to Gen2DMConfig in Metal 2.0 by @akerteszTT in #45160
- Migrate clone op to ProgramDescriptor framework by @dgomezTT in #42477
- [skip ci] Add tt-power-sidecar: light-weight power measurement by @tenstorrent-github-bot in #44983
- [Feature] paged_fill_cache: accept batched batch_idx_tensor for forge batch-32 perf+compile time improvement by @kmabeeTT in #45117
- Migrate data_movement family (26 ops) to ProgramDescriptor framework by @dgomezTT in #43840
- Bump ttsim version to v1.7.0 by @mcraigheadTT in #45275
- ci: enable kernel ccache on pull_request events [skip ci] by @tenstorrent-github-bot in #45283
- [Fix] Restore CircularBuffer in tilize_helpers.hpp after DFB migration (#45022) by @tenstorrent-github-bot in #45281
- [CI] Gate tt-train tests on path changes, not every merge queue entry [skip ci] by @tenstorrent-github-bot in #45223
- ci: fix tt-train merge queue deadlock (remove concurrency block from reusable workflow) by @tenstorrent-github-bot in #45293
- Fix SD3.5 tests by @sosborne-TT in #44024
- [ci] fix -Wunused-but-set-variable warnings in fabric_mux and cq_dispatch kernels by @tenstorrent-github-bot in #45165
- Reduce SDPA ring-joint code size by @pavlejosipovic in #45191
- [ttnn] paged_update_cache: validate input shard num_cores == num_users by @handrewsTT in #45016
- ci: move code-analysis from merge gate to PR gate (blocking) [skip ci] by @tenstorrent-github-bot in #45287
- [Model bringup] Add Z-Image-Turbo model demo by @svuckovicTT in #44611
- [ttnn/grid_sample] Honor align_corners + fix fp32_dest_acc + thread compute_kernel_config by @sott0n in #44901
- [Security] Bump protobuf from 4.25.8 to 5.29.6 in /tt_metal/python_env by @dependabot[bot] in #37210
- Fix kernel-build warnings in conv-team-owned kernels by @dstoiljkovicTT in #45187
- [BH] Bump ufld_v2 device perf baseline 595 -> 613 by @dstoiljkovicTT in #45176
- [GraphTracker] Make processors and hook thread_local to fix concurrent-access race by @rpavlovicTT in #44668
- [skip ci] Update cron job times by @dpopovTT in #45307
- [Cleanup] Quasar: Duplicate unpack_A llk api functions by @jihoonyouTT in #44876
- Bug Fix (Tier 3) mamba-2.8b E2E test by @fbuticTT in #44854
- [Quasar] SFPU ternary where: LLK fix by @njokovicTT in #45012
- SRAM expert integration to DeepSeek_b1 by @yugaoTT in #44773
- emule: define TRISC_UNPACK/MATH/PACK for non-Quasar compute kernels (fixes tt-emule#24) by @xanderchin in #45149
- [gpt-oss] test_decoder: cover paged kv-cache path; fix two pre-existing bugs by @handrewsTT in #45206
- Fix SDPA decode hang caused by c_8 CB ownership race (cur_pos_tensor path) by @djordje-tt in #45142
- [tt-train - bug fix] Separate stdout and stderr streams when generating training logs by @kevinwuTT in #45280
- [tt-train] Wire up or disable vocab parallel cross entropy loss across all trainers by @bklockiewiczTT in #45005
- [skip ci] Fix: mpi interface selection logic on exabox scripts by @gsarabandoTT in #45271
- [ttnn] paged ops: add cache_position_modulo for bounded sliding-window kv caches by @handrewsTT in #45193
- [Model demo] Z-Image-Turbo - Remove trace_region_size, l1_small_size, and allow external device ownership by @svuckovicTT in #45317
- [Skip CI] Fix DeepSeek B1 multihost tests for 4 and 16 Galaxies by @gwangTT in #44608
- [skip ci] clang-tidy: drop fetch-depth and fetch-tags from checkout by @tenstorrent-github-bot in #45330
- Minor fix to QwenImage VAE by @sosborne-TT in #45214
- [UMD Bump] Automated UMD Bump 19.05.2026 by @broskoTT in #44725
- Migrate mish to LLK and improve performance and accuracy by @mcw-anasuya in #44985
- fix: swap legacy TBB for oneAPI TBB 2021+ in manylinux build by @tenstorrent-github-bot in #45256
- [skip ci] Fix kernel ccache crashing fork/cross-repo PRs when Redis secrets unavailable by @tenstorrent-github-bot in #45361
- [Cleanup] Rename MeshTensor.device_mut() to mutable_device() by @riverwuTT in #45363
- Unary Broadcast Quasar by @vkrsmanovicTT in #41329
- MINFRA530: Reorg of profiler pipelines by @tdowdallTT in #44862
- [Cleanup] Delete empty tt_metal/impl/tensor/tensor_utils.cpp by @riverwuTT in #45343
- [API Cleanup] Structural API cleanup of Metal 2.0 by @akerteszTT in #45290
- Disable failing conv1d replicate pad case by @pavlejosipovic in #45393
- [tt_transformers] Remove deprecated lt script by @mtairum in #45353
- [gpt-oss-120b][tier-1] Fix dropdown, set throttle, and add --skip-model-load to CI run by @djordje-tt in #45301
- Fix ttnn examples test by @dpopovTT in #45401
- fix(tt_metal): add PCIe BDF alias for bh-qb-10 quietbox by @mbezuljTT in #39793
- [skip ci] Disable consistently failing tests in tt-metal-l2-tests by @ebanerjeeTT in #44860
- Add BH p150b/p100a traces + mesh-aware validation batching + Galaxy lead_models by @Aswinmcw in #45185
- [skip ci] Disable consistently failing tests in (T3K) T3000 e2e tests by @ebanerjeeTT in #45108
- [skip ci] Disable failing t3k_ttmetal and t3k_ttnn tests in T3000 unit tests pipeline by @ebanerjeeTT in #45306
- [skip ci] disable failing tests t3000 demo tests 20260521 by @ebanerjeeTT in #45406
- BH: Add FP8 to matmul by @rtawfik01 in #43481
- Support DRAM sharded input for untilize operation by @ctr-martemovTT in #44906
- Remove unused variable by @vtsilytskyiTT in #45403
- [Fix] page table allocation when hybrid is temporarily disabled. by @viktorpusTT in #45404
- [Quasar] Implement Mul LLK by @ryanzhuTT in #45344
- Remove deprecated DPRINT functionality and update to DEVICE_PRINT by @tt-vjovanovic in #44930
- Quasar - Ported Pack Untilize to TensorShape by @nmohammedTT in #45257
- [Qwen3-VL] enable batches of mixed image+text and text only prompts by @ssanjayTT in #44918
- ccl/all_to_all_combine: hash input buffer addresses to avoid stale-RTA cache hits (BHPC red since 2026-05-23) by @dgomezTT in #45332
- [Bugfix] Cap DFB HW configure loops at NUM_DFBS and skip non-participating slots by @ymoussaTT in #45156
- [Quasar] RSQRT bringup by @ryanzhuTT in #45358
- Fix typos in TT-Distributed-Architecture [skip ci] by @wusatosi in #22857
- [skip ci] Remove UF specific label for WH galaxies by @roseli-TT in #45362
- Add CODEOWNERS entries for tests/tt_metal/tt_metal/llk and LLK CMakeLists by @fvranicTT in #45427
- [Bug Fix] Enable Default Numeric Stability for Transformer attention_softmax by @edwinleeTT in #44849
- [Feature] GCBs with a DRISC sender (DRAM-core GlobalCircularBuffer) by @jbaumanTT in #44888
- experimental/transformer: migrate rotary embedding family to ProgramDescriptor by @dgomezTT in #45070
- Performance/accuracy: expm1 by @jasondavies in #45338
- fabric/ccl: add ProgramDescriptor variants of fabric_mux helpers by @dgomezTT in #45066
- [LLK] Move some SFPU ops into L4 by @halghTT in #45259
- [MeshTensor Integration] Reduction op family by @riverwuTT in #44654
- Add BH fast untilize support by @pavlejosipovic in #45103
- [tt-train] Python Implementation of TTML LR Schedulers by @awliu-TT in #43469
- Re-enable some DiT model tests by @sosborne-TT in #45433
- [skip ci] Remove @nardoTT from CODEOWNERS by @tenstorrent-github-bot in #45451
- #45440: Skip docs deploy for pre-release tags by @dimitri-tenstorrent in #45442
- [Cleanup] Submerge temporary quasar API headers to impl/ by @shengxiangjiTT in #45289
- [skip ci] Revert stale SDPA disable for blackhole 3-chunk 4x768 case by @ebanerjeeTT in #45449
- vLLM fixes by @pprajapatiTT in #45284
- [Fix] Migrate deprecated get_noc_addr callsites in embedding and rms kernels by @tenstorrent-github-bot in #45318
- [Quasar] Add binary SFPU Div - Compute Side by @njokovicTT in #43988
- [Quasar] exp bug fix by @ryanzhuTT in #45459
- #42240: Remove dead transpose_utils helpers superseded by #45071 by @mouliraj-mcw in #43871
- #45394: Uplift multigammaln to use the new lgamma device operation by @KalaivaniMCW in #45395
- [gpt-oss] skip fused MoE weight prep when cache files exist by @handrewsTT in #45438
- Take input and output shape into account for concat program cache by @ctr-martemovTT in #45144
- [skip ci] Disable consistently failing tests in vllm-nightly-tests by @ebanerjeeTT in #45313
- Update ttsim version to 1.7.0 by @fvranicTT in #45297
- fix:
ttnn.to_layout(ttnn.TILE_LAYOUT)hangs onHEIGHT_SHARDEDby @ctr-martemovTT in #44625 - [tt-train] Split K Gram matmul implementation by @zbaczewskiTT in #40928
- [tt-train] Migrate ttml models to fused SwiGLU by @imichalakTT in #45335
- #45441: Sync published_versions.json to match gh-pages reality by @dimitri-tenstorrent in #45457
- #42370 [Descriptor Migration] copy/typecast by @azaytsev-epam in #44257
- point_to_point: fix cache-hit stale-buffer bug from optional-output aliasing (#45422) by @dgomezTT in #45429
- Migrate deepseek_moe_fast_reduce_nc_fused + all_gather_via_broadcast to ProgramDescriptor (Contract-2) by @dgomezTT in #45076
- Removes MatmulMultiCoreReuseProgramFactory by @ign-abirami in #44587
- ring joint sdpa: bump 8k perf baseline 65.2 -> 65.6 by @skrsticTT in #45527
- #45444: update kernel_lib to compile on Quasar by @bbradelTT in #45445
- Bug fix: Fix DeepSeek matmul expert LLK assert by @ndivnicTT in #44850
- Ring joint SDPA: chunked-prefill accuracy, determinism, and per-chunk perf tests by @skrsticTT in #45525
- Adjust, reorder compiler options by @nathan-TT in #45348
- Fix failing CCL pytests in T3K e2e pipeline by @scardozaTT in #45434
- [Bug fix] Bump WH BRISC firmware size to 6KB+128 to fit profiler by @jbaumanTT in #45505
- [#40012] Support In-Process Topology Updates after Link Recovery by @jpanasiukTT in #42866
- #44966: Add pre-publish RTL Sim CI gate to package-and-release.yaml by @dimitri-tenstorrent in #45145
- [test] Relax TestSimpleL1Read tolerance from 0.05% to 0.5% to fix flaky CI by @tenstorrent-github-bot in #45624
- Replace deprecated API calls in kernels and docs by @vtsilytskyiTT in #45316
- #43444 MoE: Full fused op on BlackHole by @dchenTT in #45294
- [MeshTensor Integration] Matmul by @riverwuTT in #44220
- [Feature]: Implement Dispatch Telemetry by @anashTT in #44786
- [skip ci] Estimated number of machine hours per week per sku by @roseli-TT in #45615
- [Bug fix] Update ttnn.std/var and layernorm ops for FP32 precision (UnpackToDestFp32) by @fplavecTT in #45319
- Topology Mapper: Physical Grouping improved matching for heterogeneous groups by @Riddy21 in #43836
- Feature: Quasar SFPU binary add_int and metal test by @nmohammedTT in #45028
- [emule] Uplift Fixes by @arminaleTT in #45568
- Move bit 11 from unpacker to math thread and fix 32bit bcast (WH) by @markoradosavljevicTT in #44412
- Feature: Quasar SFPU binary mul_int compute api and metal test by @nmohammedTT in #45325
- bumping exalens to 0.3.20 by @tt-vjovanovic in #45640
- remove first token assert from prefill runner by @jjovicicTT in #45638
- ci: reduce number of threads for Quasar compile job from 35 to 20 by @fvranicTT in #45632
- [skip ci] Disable IntermeshSplit2x2FabricFixture.* in t3000-unit-tests t3k_tt_metal_multiprocess_tests by @ebanerjeeTT in #45492
- [skip ci] Disable MeshDeviceFixture.Top32RmDevPipelineCompletes in tt-metal-l2-nightly llk-sd-unit-tests by @ebanerjeeTT in #45484
- [skip ci] Disable swin_s test_e2e_performant / test_e2e_performant_dp in perf-models (TT_FATAL: trace buffer size exceeds trace_region_size) by @ebanerjeeTT in #45529
- ci: increase verbosity level of TT_LOGGER for CI jobs by @fvranicTT in #45634
- [skip ci] Disable test_all_to_all_combine_no_trace in blackhole-e2e-tests ccl nightly tests by @ebanerjeeTT in #45494
- [bug fix] DEVICE_PRINT was using the wrong field name for Quasar by @arikTT in #45575
- [test] Disable flaky TestSimpleL1Read timing assertion (#45624 followup) by @tenstorrent-github-bot in #45647
- [Quasar] SFPU where: compute-side bringup by @njokovicTT in #44977
- Add quasar mxfp6 format metal infra. by @uvelimirovicTT in #43374
- [skip ci] skip test_var_fp32_doscale_wt_gt_1 due to bit 11 precision regression by @markoradosavljevicTT in #45644
- Rtawfik/mul float bringup by @rtawfik01 in #45645
- [Bug fix] moreh.layer_norm not populating rstd when mean=None by @ign-sudha in #45168
- fix: correct vector::reserve division error in erisc_datamover_builder by @tenstorrent-github-bot in #44466
- [fabric][jit] Demote verbose log_info to log_trace by @tenstorrent-github-bot in #44438
- ci: raise TTSim C++ tests step timeout from 5 → 10 minutes by @tenstorrent-github-bot in #44375
- ci: move ttsim unit tests from merge-gate to sanity-tests by @tenstorrent-github-bot in #45270
- Add quasar mxfp8 format metal infra. by @uvelimirovicTT in #43342
- emule: role-aware L1_SLOT_MASK, per-NOC DRAM bank mapping, mcast include_self by @xanderchin in #45623
- Update Fabric test infra Z link handling by @nnyamagoudar-TT in #45420
- cmake: use CMAKE_COMPILE_WARNING_AS_ERROR to disable -Werror by @lu-zero in #41600
- ci: add CPM source cache to clang-tidy-reusable workflow by @tenstorrent-github-bot in #45672
- Fix DeepSeek MLA SDPA chunk sizes by @pavlejosipovic in #45658
- SDPA: hazard-free local reciprocal + enable chunked determinism tests by @pavlejosipovic in #45659
- [skip ci] Update release pipeline permissions by @dpopovTT in #45695
- #41662: Fix FP32 transpose RM sharded numerical regression by @mouliraj-mcw in #44726
- gpt-oss: row-sharded eval + 16K/65K OOM fixes by @sraizada-tt in #45633
- DeepSeek V3 Prefill - Moving single-chip tests to L2-nightly by @pmilojevicTT in https://github.com/tenstorrent/tt-metal/pull/45304
- ds_prefill(model_configs) - Add reference configs for GLM 5.1, MiniMax M2.7, GPT-OSS 120B, DeepSeek V4 Flash/Pro, Kimi K2.6 by @mbezuljTT in https://github.com/tenstorrent/tt-metal/pull/45408
- Add naive FSDP support by @philei-tt in https://github.com/tenstorrent/tt-metal/pull/44019
- Reimplement ASSERT by @nathan-TT in https://github.com/tenstorrent/tt-metal/pull/45627
- fix reduce_block_max_row uninit bit 11 leak by @markoradosavljevicTT in https://github.com/tenstorrent/tt-metal/pull/45518
- [WIP][tt-train] SDPA Forward: F9 multi-tile K/V chunking by @vmelnykovTT in https://github.com/tenstorrent/tt-metal/pull/44198
- Support indexed ND-sharded KV in ring joint SDPA by @pavlejosipovic in https://github.com/tenstorrent/tt-metal/pull/45315
- fix precission regresion after #44412 by @markoradosavljevicTT in https://github.com/tenstorrent/tt-metal/pull/45692
- ci: upgrade actions/cache v4 → v5 (Node.js 24 compat) by @tenstorrent-github-bot in https://github.com/tenstorrent/tt-metal/pull/45669
- [Bug fix] batch_norm: update running stats after normalization (#41127) by @ign-sudha in https://github.com/tenstorrent/tt-metal/pull/45578
- gemma4: vLLM bridge for hybrid kv-cache-groups + kv-share aliasing by @handrewsTT in https://github.com/tenstorrent/tt-metal/pull/44265
- [Skip ci] Add BH Galaxy DeepSeek decoder sweep to demo SP release by @gwangTT in https://github.com/tenstorrent/tt-metal/pull/45263
- BSPM-on-SRAM: unify SRAM hot-expert allocation + demo wiring by @yugaoTT in https://github.com/tenstorrent/tt-metal/pull/45556
- [skip ci] Remove overly broad ttnn catch-all owner mapping by @ebanerjeeTT in https://github.com/tenstorrent/tt-metal/pull/45717
- Device 2.0 Semaphore: add relay_unicast / relay_multicast by @iwroszTT in https://github.com/tenstorrent/tt-metal/pull/45038
- [LLK Test Infra]Strategy based refactor of stimuli generator by @jmacanovicTT in https://github.com/tenstorrent/tt-metal/pull/45209
- [fuser] refactor yaml validator by @markoradosavljevicTT in https://github.com/tenstorrent/tt-metal/pull/45013
- Remove validate-metalium-deprecation pre-commit hook by @riverwuTT in https://github.com/tenstorrent/tt-metal/pull/45364
- Add support for fp8 -> bfp8 for tilize op by @jvegaTT in https://github.com/tenstorrent/tt-metal/pull/44307
- [Feature] More stats in jit_load_report.py by @ruizhangTT in https://github.com/tenstorrent/tt-metal/pull/45058
- [skip ci] fabric ubench: update T3K golden CSV after Z-link routing fix (#45420) by @tenstorrent-github-bot in https://github.com/tenstorrent/tt-metal/pull/45726
- [API Cleanup] Naming cleanup of Metal 2.0 API by @akerteszTT in https://github.com/tenstorrent/tt-metal/pull/45598
- Feature: DRAM-core DRISC prefetcher by @jbaumanTT in https://github.com/tenstorrent/tt-metal/pull/45169
- Make PACK, UNPACK,MATH macros variadic to help with templates by @nathan-TT in https://github.com/tenstorrent/tt-metal/pull/45730
- [Bug fix] Fixed perf microbenchmark kernel configs to define and disable dispatch telemetry by @anashTT in https://github.com/tenstorrent/tt-metal/pull/45733
- Add DeepSeek B1 stress and performance test coverage by @gwangTT in https://github.com/tenstorrent/tt-metal/pull/45508
- Add support for sub-torus topology and different stage sizes (pt1) by @aagarwalTT in https://github.com/tenstorrent/tt-metal/pull/44890
New Contributors
- @ign-akshayr made their first contribution in #42534
- @jihoonyouTT made their first contribution in #44430
- @dhelmuthTT made their first contribution in #44754
- @aliraza556 made their first contribution in #44539
- @vdu-TT made their first contribution in #45115
- @iklikovacTT made their first contribution in #43981
- @dgolubovicTT made their first contribution in #45148
- @ymoussaTT made their first contribution in #45156
- @wusatosi made their first contribution in #22857
- @lu-zero made their first contribution in #41600
Full Changelog: v0.71.2...v0.72.0