- [2025-05] We include LIMOPro for Efficient and Effective Reasoning in Test-time Scaling.
- [2025-05] We update more papers on adaptive reasoning, which describe how a system/model autonomously switches between long and short reasoning chains based on problem complexity.
- [2025-05] Welcome to our latest paper "Scaling Reasoning, Losing Control", which shows that the longer the reasoning chain, the poorer its instruction-following ability. Therefore, efficient reasoning may also be important for instruction following in LRMs.
- [2025-04] We include AgentPrune, where efficient reasoning is important for agent systems.
- [2025-04] We include benchmarks for Efficient Reasoning: MME-CoT, S1-Bench, DUMB500.
- [2025-04] We add Mamba Reasoning models (e.g M1) and Hybrid models (e.g Mamba-Transformer) in Efficient Reasoning during Pre-training. It is naturally efficient to infer.
- [2025-04] We add a new "Model Merge" category in Efficient Reasoning during Inference. It is feasible to be a promising direction.
- [2025-04] 📢 Our work is reported by both Synced (机器之心) and Zhuanzhi (专知).
- [2025-03] 📢 Our work is reported by both Deep Learning and NLP (深度学习自然语言处理) and Machine Learning and NLP (机器学习算法与自然语言处理).
- [2025-03] We released our survey "A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond". This is the first survey for efficient reasoning of Large Reasoning Models, covering language, multimodality, agent, and applications. We provide several promising future directions in our survey.
- [2025-03] We created this repository to maintain a paper list on Awesome-Efficient-LRM-Reasoning.
If you find our survey useful for your research, please consider citing:
@article{qu2025survey,
title={A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond},
author={Qu, Xiaoye and Li, Yafu and Su, Zhaochen and Sun, Weigao and Yan, Jianhao and Liu, Dongrui and Cui, Ganqu and Liu, Daizong and Liang, Shuxian and He, Junxian and others},
journal={arXiv preprint arXiv:2503.21614},
year={2025}
}
- Awesome-Efficient-LRM-Reasoning
In the age of LRMs, we propose that "Efficiency is the essence of intelligence." Just as a wise human knows when to stop thinking and start deciding, a wise model should know when to halt unnecessary deliberation. An intelligent model should manipulate the token economy, i.e., allocating tokens purposefully, skipping redundancy, and optimizing the path to a solution. Rather than naively traversing every possible reasoning path, it should emulate a master strategist, balancing cost and performance with elegant precision.
To summarize, this survey makes the following key contributions to the literature:
- Instead of offering a general overview of LRMs, we focus on the emerging and critical topic of efficient reasoning in LRMs, providing an in-depth and targeted analysis.
- We identify and characterize common patterns of reasoning inefficiency, and outline the current challenges that are unique to improving reasoning efficiency in large models.
- We provide a comprehensive review of recent advancements aimed at enhancing reasoning efficiency, structured across the end-to-end LRM development pipeline, from pretraining and supervised fine-tuning to reinforcement learning and inference.
- Done Is Better than Perfect: Unlocking Efficient Reasoning by Structured Multi-Turn Decomposition
- Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting
- AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting
- Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens
- Dynamic Early Exit in Reasoning Models
- Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models
- Reasoning Models Can Be Effective Without Thinking
- How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach
- Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
- Chain of Draft: Thinking Faster by Writing Less
- SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
- s1: Simple test-time scaling
- Token-budget-aware llm reasoning
- Efficiently Serving LLM Reasoning Programs with Certaindex
- Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning
- Scaling llm test-time compute optimally can be more effective than scaling model parameters
- Concise thoughts: Impact of output length on llm reasoning and cost
- The impact of reasoning step length on large language models
- The benefits of a concise chain of thought on problem-solving in large language models
- Guiding language model reasoning with planning tokens
- DynamicMind: A Tri-Mode Thinking System for Large Language Models
- Fast-Slow-Thinking: Complex Task Solving with Large Language Models
- Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking
- Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
- Visual Agents as Fast and Slow Thinkers
- System-1.x: Learning to Balance Fast and Slow Planning with Language Models
- DynaThink: Fast or slow? A dynamic decision-making framework for large language models
- Accelerated Test-Time Scaling with Model-Free Speculative Sampling
- Learning Adaptive Parallel Reasoning with Language Models
- SplitReason: Learning To Offload Reasoning
- SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
- MixLLM: Dynamic Routing in Mixed Large Language Models
- Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding
- EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
- RouteLLM: Learning to Route LLMs with Preference Data
- LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
- Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models
- Speculative Decoding with Big Little Decoder
- Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding
- Efficient Test-Time Scaling via Self-Calibration
- Scalable Best-of-N Selection for Large Language Models via Self-Certainty
- Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
- Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
- Fast Best-of-N Decoding via Speculative Rejection
- TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling
- Scaling llm test-time compute optimally can be more effective than scaling model parameters
- DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models
- Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models
- Z1: Efficient Test-time Scaling with Code
- Self-Training Elicits Concise Reasoning in Large Language Models
- TokenSkip: Controllable Chain-of-Thought Compression in LLMs
- Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models
- C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness
- Can Language Models Learn to Skip Steps?
- Distilling System 2 into System 1
- Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models
- From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
- CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
- Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning
- SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
- LightThinker: Thinking Step-by-Step Compression
- Efficient Reasoning with Hidden Thinking
- Training Large Language Models to Reason in a Continuous Latent Space
- Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
- How Far Are We from Optimal Reasoning Efficiency?
- ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models
- When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning
- Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
- Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning
- ARM: Adaptive Reasoning Model
- ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning
- HAWKEYE: Efficient Reasoning with Model Collaboration
- ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning
- Think When You Need: Self-Adaptive Chain-of-Thought Learning
- DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
- Demystifying Long Chain-of-Thought Reasoning in LLMs
- Training Language Models to Reason Efficiently
- O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
- Kimi k1.5: Scaling Reinforcement Learning with LLMs
- Concise Reasoning via Reinforcement Learning
- Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning
- Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization
- Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
- LLM Pretraining with Continuous Concepts
- Scalable Language Models with Posterior Inference of Latent Thought Vectors
- Byte latent transformer: Patches scale better than tokens
- Large Concept Models: Language Modeling in a Sentence Representation Space
- RWKV-7 "Goose" with Expressive Dynamic State Evolution
- LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid
- Native sparse attention: Hardware-aligned and natively trainable sparse attention
- MoBA: Mixture of Block Attention for Long-Context LLMs
- MoM: Linear Sequence Modeling with Mixture-of-Memories
- Gated Delta Networks: Improving Mamba2 with Delta Rule
- Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality
- Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention
- Gated linear attention transformers with hardware-efficient training
- Liger: Linearizing Large Language Models to Gated Recurrent Structures
- Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing
- LoLCATs: On Low-Rank Linearizing of Large Language Models
- The Mamba in the Llama: Distilling and Accelerating Hybrid Models
- M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
- Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
- Compositional Reasoning with Transformers, RNNs, and Chain of Thought
- Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
- Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners
- Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tunin
- Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
- Fast-Slow Thinking for Large Vision-Language Model Reasoning
- Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?
- Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
- Value-Guided Search for Efficient Chain-of-Thought Reasoning
- LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
- Efficient Test-Time Scaling via Self-Calibration
- Dynamic self-consistency: Leveraging reasoning paths for efficient llm sampling
- X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability
- Deliberative alignment: Reasoning enables safer language models
- The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
- Chain-of-Retrieval Augmented Generation
- Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems
- THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
- S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models
- DNA Bench: When Silence is Smarter -- Benchmarking Over-Reasoning in Reasoning LLMs
- MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
- Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
⭐" Join us in improving this repository! If you know of any important works we've missed, please contribute. Your efforts are highly valued! "