Skip to content

v0.74.0-dev20260627

Pre-release
Pre-release

Choose a tag to compare

@github-actions github-actions released this 27 Jun 03:30
· 58 commits to main since this release
Immutable release. Only release title and notes can be modified.
4f93945

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/28273668996

LLK (low-level kernels)

  • Comparison-to-zero SFPU - Compute API PR 47804
  • ring_mla reduce_trigger determinism: two-phase handshake fix PR 48061
  • Feature: Add Quasar SFPU TopK kernel (local-sort / merge / rebuild) PR 44729
  • Enable square SFPU op PR 47813
  • Unpack tilize operands kernel compatible with reduce PR 47247
  • Add 32bit transpose-dest with unpack_to_dest PR 47936
  • fix llk blackhole ci timeout PR 48254
  • fast-untilize BFP: write SCRATCH_SEC0_val via ordered WRCFG PR 48161
  • Add GeLU kernel that uses tanh approximation PR 47399

Metalium (tt-metal core)

  • #45821: clear program cache on MeshDevice reconfiguration PR 47921
  • #48155: plumb quasar through stack to allow models pytests PR 48158
  • Comparison-to-zero SFPU - Compute API PR 47804
  • ring_mla reduce_trigger determinism: two-phase handshake fix PR 48061
  • Feature: Add Quasar SFPU TopK kernel (local-sort / merge / rebuild) PR 44729
  • Enable square SFPU op PR 47813
  • Add 32bit transpose-dest with unpack_to_dest PR 47936
  • Add support for Quasar device read/write to host over PCIe PR 47888
  • SFPI 7.62.0 722 PR 48282
  • LocalTensorAccessor: node-local tensor access (legacy + Metal 2.0) PR 48190
  • More descriptive error on cache miss in trace capture PR 46535
  • Add GeLU kernel that uses tanh approximation PR 47399
  • H2D stream service perf optimizations PR 47857

TT-NN

  • #47797: port more experimental/quasar ops to metal 2.0 api PR 47853
  • tilize/to_layout: support FP8_E4M3 input (-> any float TILE format) PR 48046
  • Adding padding awareness to moe_grouped_gate and dispatch PR 44272
  • fix hash collision 45821 all reduce create qkv heads device operation PR 47527
  • clamp_bw / clip_bw: fix inverted ge/le in tensor both-bounds branch PR 48204
  • Alexey zaytsev epam/fix hash collision 45821 dropout PR 47345
  • Fix [CI] test_pow_fractional_composite PCC PR 48227
  • #48155: plumb quasar through stack to allow models pytests PR 48158
  • #23179: INT32 and UINT32 large scalars in binary PR 48209
  • ring_mla reduce_trigger determinism: two-phase handshake fix PR 48061
  • group_norm: reject ROW_MAJOR interleaved input instead of hanging (#47972) PR 48143
  • indexer_score: MiniMax M3 GQA support (indexer_score_msa) PR 48205
  • Fix LLK assert sanity coverage PR 48067
  • Add 32bit transpose-dest with unpack_to_dest PR 47936
  • experimental/deepseek_prefill: migrate kernels to Device 2.0 PR 47137
  • fix hash collision 45821 ring joint sdpa device operation ring attention all gather async device operation PR 47559
  • fix hash collision 45821 all reduce async device operation PR 47526
  • fix hash collision 45821 minimal matmul strided reduce scatter async PR 47529
  • Alexey zaytsev epam/fix hash collision 45821 reduce scatter minimal async device operation PR 47531
  • fix hash collision 45821 neighbor pad async device operation PR 47530
  • Alexey zaytsev epam/fix hash collision 45821 slice reshard async device operation PR 47534
  • Alexey zaytsev epam/fix hash collision 45821 strided all gather minimal matmul async PR 47535
  • Migrate moreh dataflow kernels to Device 2.0 API PR 47923
  • Ign/reduce sum int32 PR 44061
  • sparse_sdpa_msa: add native GQA support for MSA prefill PR 48045
  • Refactor TTNN comparison mode feature to work with ttnn.graph_report PR 45448
  • Add GeLU kernel that uses tanh approximation PR 47399
  • H2D stream service perf optimizations PR 47857

tt-train

  • Bug fix: Remove int conversion for head_dim PR 48272

Models

  • #48195: adjust resnet50 quasar test for flexible grids PR 48196
  • GPT-OSS: fix 120B router-weights test via realistic dummy weight scale PR 47970
  • stable diffusion CI errors fix PR 48202
  • Adding padding awareness to moe_grouped_gate and dispatch PR 44272
  • Qwen3-32B bringup to TTTv2 PR 47353
  • #48155: plumb quasar through stack to allow models pytests PR 48158
  • Pipeline Prefill PR 47420
  • qwen25_vl: fix Qwen2.5-VL-32B on-device decode gibberish on wh_llmbox_perf (#48037) PR 47822
  • Llama-3.3-70B bringup to TTTv2 PR 47350
  • Remove skip_for_BH and add op tests to L2 tests pipeline PR 48063
  • Fix LLK assert sanity coverage PR 48067
  • #48242: point more resnet50/quasar op calls to experimental/quasar ops PR 48248
  • precise pipeline prefill chunk timing + code_debug default for kimi PR 48246
  • Qwen2.5-7B model bringup to TTTv2 PR 43814
  • Falcon-40B tests: fix transformers 5.x silent weight-load failure (PCC/NaN) — #47924 PR 47929
  • Manifest-Driven Prefill Migration Tests PR 48004
  • Add GeLU kernel that uses tanh approximation PR 47399
  • H2D stream service perf optimizations PR 47857

Infrastructure & CI

  • Adding padding awareness to moe_grouped_gate and dispatch PR 44272
  • Removing install_debugger.sh script and updating CODEOWNERS PR 48234
  • indexer_score: MiniMax M3 GQA support (indexer_score_msa) PR 48205
  • Remove skip_for_BH and add op tests to L2 tests pipeline PR 48063
  • Disable Galaxy DiT Flux.1 perf test (bricks runners with TLB-window leak) PR 47968
  • CI: Reduce Tier 2 Models Unit pipeline frequency to once a day PR 48223
  • Remove Llama from Galaxy stress pipeline (#47407) PR 48247
  • sparse_sdpa_msa: add native GQA support for MSA prefill PR 48045
  • Make mmfusedreduce codeowners for resnet50/quasar PR 48253
  • remove flux1 performance tests from the glx perf tests workflow PR 48286
  • remove flux1 performance tests from the t3k tests workflow PR 48284
  • Bump ttsim version to v1.9.2 PR 48310

Documentation

  • indexer_score: MiniMax M3 GQA support (indexer_score_msa) PR 48205
  • sparse_sdpa_msa: add native GQA support for MSA prefill PR 48045

Other

  • #48155: plumb quasar through stack to allow models pytests PR 48158
  • ring_mla reduce_trigger determinism: two-phase handshake fix PR 48061
  • ci: add check-merge-conflict hook to pre-commit config PR 48237
  • Refactor TTNN comparison mode feature to work with ttnn.graph_report PR 45448