You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update MiniMax article: title change and Kimi Linear Attention update
- Change title to "Trouble in LA LA Land: Did MiniMax Just Kill Efficient Attention?"
- Add link to interleaved-thinking pattern explanation
- Add Oct 31 update section covering Kimi.ai's KLA release
- Discuss KLA architecture (3:1 FA ratio) and strong performance results
- Note the Streisand effect: MiniMax's post-mortem triggered accelerated activity
- Mention related releases (Brumby-14B, Higher-order Linear Attention, FLA kernel merges)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: site/posts/minimax-m2.md
+18-2Lines changed: 18 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: "Did MiniMax Just Kill Efficient Attention? Not Quite"
2
+
title: "Trouble in LA LA Land: Did MiniMax Just Kill Efficient Attention?"
3
3
publication:
4
4
status: published
5
5
date: Oct 30, 2025
@@ -42,7 +42,7 @@ So MiniMax's decision makes sense for single-shot reasoning. They (and everyone
42
42
43
43
## The M2 Problem
44
44
45
-
No, it doesn't. M2 follows the standard interleaved-thinking pattern: it emits `<think>...</think>` blocks, and these blocks accumulate over multi-turn conversations. Every bit of thinking history needs to be retained, and the [README](https://huggingface.co/MiniMaxAI/MiniMax-M2#inference-parameters) warns that removing them hurts performance. Thus every exchange adds more tokens, building up tens of thousands of reasoning tokens that must stay in memory.
45
+
No, it doesn't. M2 follows the standard [interleaved-thinking pattern](https://huggingface.co/blog/MiniMax-AI/aligning-to-what#the-need-for-interleaved-thinking): it emits `<think>...</think>` blocks, and these blocks accumulate over multi-turn conversations. Every bit of thinking history needs to be retained, and the [README](https://huggingface.co/MiniMaxAI/MiniMax-M2#inference-parameters) warns that removing them hurts performance. Thus every exchange adds more tokens, building up tens of thousands of reasoning tokens that must stay in memory.
46
46
47
47
Delethink can't help here because it only works for single reasoning chains, where you can truncate and carry forward a small state. But in multi-turn conversations, the thinking tokens from previous turns belong to conversation history. You can't delete all of them without negatively impacting performance.
48
48
@@ -69,3 +69,19 @@ So, where does that leave us? It's clear that we can't just proclaim that linear
69
69
For single-shot reasoning, FA plus Delethink is pragmatic. Stable infrastructure, and the Markovian behavior is already there in pretrained models. That buys time for the efficient attention ecosystem to mature. At M2's scale, the infrastructure pain and model quality risks of hybrids outweigh the compute savings and speed benefits of efficient attention. FA with Delethink is the right call for the majority of single-shot reasoning workloads right now.
70
70
71
71
For multi-turn interleaved thinking, hybrids with learned placement will eventually become essential. Context accumulates, Delethink can't reset it, and FA's quadratic cost will dominate. Optimized hybrids will win that regime, and that's where the field is heading. That's where my team is heading. See you in the efficient-attention trenches.
72
+
73
+
---
74
+
75
+
**Update Oct 31, 2025:** The day after I posted this, [Kimi.ai released](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)[Kimi Linear Attention (KLA)](https://github.com/fla-org/flash-linear-attention/tree/main/fla/ops/kda), a new hybrid architecture that validates the core argument here in ways even I didn't expect, at least not so soon.
76
+
77
+
KLA extends Gated DeltaNet with channel-wise gating (instead of scalar gating) and interleaves this efficient attention with full attention in a 3:1 ratio. That ratio matters: just like Jet Nemotron found that only 2-3 layers out of 36 need FA, Kimi found that roughly 25% FA is enough to maintain quality while cutting KV cache by 75% and delivering 6x decoding throughput at 1M context.
78
+
79
+
The results are great. On synthetic tasks like MQAR and Stack, KLA significantly outperforms Mamba2 and beats Gated DeltaNet. On real reasoning tasks (AIME 2025, MATH500, LiveCodeBench), it matches or beats both MLA (DeepSeek's compressed full attention) and Gated DeltaNet-H after the same SFT recipe. Pretraining scaling laws show 1.16x better loss at the same compute budget.
80
+
81
+
Two caveats: First, this is a 3B-activated, 48B-total parameter model. M2 is 230B total, so roughly 5x larger. We still don't know what happens at that scale, but the trend is promising. Second, Kimi uses a fixed 3:1 ratio rather than learned placement, so we don't know if that's optimal or just good enough.
82
+
83
+
But here's what matters: within days of MiniMax saying "efficient attention has some way to go," another major industrial lab shipped a production-grade hybrid that beats full attention on the metrics that matter.
84
+
85
+
The confusion MiniMax's post-mortem created also triggered a flurry of activity. Everyone working on efficient attention saw an opening and pushed out their work. [Brumby-14B](https://manifestai.com/articles/release-brumby-14b/) (an attention-free model converted from Qwen), [Higher-order Linear Attention](https://arxiv.org/abs/2507.04239), and several others all dropped within days. The Flash Linear Attention community [merged Kimi's optimized KDA kernel](https://github.com/fla-org/flash-linear-attention/pull/621) within hours. There's now active debate about which approach wins where.
86
+
87
+
What looked like a setback for efficient attention turned into its Streisand moment. MiniMax forced everyone to clarify their positions, tighten their arguments, and ship their code. The infrastructure is maturing faster than anyone expected. The timeline just got a lot shorter.
0 commit comments