Autonomous Agents

Autonomous Agents-research papers. Updated daily. See as well the Resources-section.

Research papers

Chronological order.

18th June 2025

SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence

SwarmAgentic: introduces a framework for fully automated agentic system generation, using Particle Swarm Optimization to explore a language-driven design space, optimizing Agentic Systems composed of an Agent Set and Collaborative Structure.
The framework iteratively refines Agentic Systems by updating Particle positions and velocities based on Fitness Function evaluation and Flaw Identification.
Velocity updates integrate Failure-Driven Adjustments, Personal Best Guidance, and Global Best Guidance to refine Agent functionality and collaboration strategies, yielding the best system as the Search Result.

Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents

LPA (LLM-based Planning Agent): introduces a system for managing complex failure analysis workflows, with Agent Core orchestrating control flow, Memory retaining information, Plan Generation creating step-by-step plans, Action Matching and Execution selecting and running tools, Feedback and Reflection adjusting plans based on results, LLM processing language and reasoning, Tools providing external system interfaces, and Data serving as external information sources.
The agent utilizes LLMs as the "brain" to decompose complex queries and resolve them through reasoning and autonomous tool use, employing ReAct or Online Replanning approaches.
The system integrates external tools like databases, search engines, and AI models to retrieve data and perform analysis tasks, supporting FA engineers.

PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection

PhishDebate: introduces a modular multi-agent LLM-based debate framework for phishing website detection, with URL Analyst Agent, HTML Structure Agent, Content Semantic Agent, Brand Impersonation Agent, Moderator, and Judge components.
The framework employs specialized agents to analyze different website aspects and coordination agents to manage a structured debate process.
This multi-agent approach aims to improve detection accuracy, interpretability, and robustness compared to single-agent methods.

The Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games

Approach: introduces a framework for constructing natural language state representations for prompting LLM agents in repeated multi-agent games, implemented with LLM Agents, Game Environment, Prompting Mechanism, State Representation, LangChain, and OpenAI API.
The system evaluates LLM agent behavior in a dynamic selfish routing game by varying state representations along action informativeness, reward informativeness, and prompting style axes.
The research finds that summarized state representations, regret-based feedback, and limited information about others' actions lead to more stable, equilibrium-like agent behavior.

Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents

LPA (LLM-based Planning Agent): introduces an agent architecture for failure analysis workflows, integrating a Large Language Model for reasoning and planning, Memory for retaining information, Action Matching and Execution for tool use, and Feedback and Reflection for plan refinement, interacting with a User and the external Environment via Data and Tools.
The agent utilizes ReAct-style iterative task generation or online replanning to process complex queries and generate human-readable responses.
The implementation integrates external tools like databases and ML models, demonstrating technical feasibility and robustness in a production-like environment.

AGENTGROUPCHAT-V2 : Divide-and-Conquer Is What LLM-Based Multi-Agent System Need

AGENTGROUPCHAT-V2: introduces a novel framework with Query Manager (Frontend, query decomposition), Task Manager (Central coordination, task flow), Group Manager (Execution, collaboration organization), Agent (Individual LLM participant), Task (Basic processing unit), Group (Collaborative work unit), and Task Forest (Hierarchical task structure) for LLM-based multi-agent systems.
The framework employs a divide-and-conquer parallel architecture, dynamic task tree decomposition, and specialized agent role assignment to address challenges in system architecture, generalizability, and performance.
Experimental results demonstrate superior performance on complex reasoning, code generation, and diverse tasks compared to existing multi-agent approaches.

RAS-EVAL: A COMPREHENSIVE BENCHMARK FOR SECURITY EVALUATION OF LLM AGENTS IN REAL-WORLD ENVIRONMENTS

RAS-Eval: introduces a comprehensive security benchmark for LLM agents, including Test Cases, Attack Tasks, Scenarios, Toolkits, Risk Management, and Evaluation Pipelines.
The benchmark supports both Real Execution and Simulated Execution of tools across JSON, LangGraph, and MCP formats.
It incorporates Failure Modes and Vulnerability Types for granular analysis and uses Evaluation Pipelines to measure task completion and attack success rates.

From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem

Agent Framework: introduces a survey of jailbreak attacks and defenses in the LLM ecosystem, with Core (Central processing unit), Planning (Task decomposition/logic), Tools (External interfaces/applications), Memory (Information management/storage), and LLM Network (Multi-agent interaction) components, where the paper reviews the evolution from LLMs to MLLMs and Agents and analyzes security challenges.
The survey categorizes jailbreak techniques by attack impact and visibility and defense strategies by response timing and technical approach.
It also details datasets and evaluation metrics used in jailbreak research and outlines future research directions.

Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs

Multi-Response Generation (MRG) and Preference-based Selection (PS): introduces a two-stage framework for open-domain dialogue response generation, where MRG generates a set of diverse responses and PS selects the best one based on human preference.
The approach leverages smaller LLMs and introduces the o2mDial dataset to explicitly capture the one-to-many property.
Empirical results show the framework enhances response diversity and quality in smaller LLMs, approaching the performance of larger models.

HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models

LLM-based Embodied Agent Pipeline: studies hallucinations in embodied agents by evaluating a pipeline that takes Scene (Visual input) and Task Description (Natural language instruction), processes them via a Scene Parser (Extracts scene info) and LLM as Goal Interpreter (Generates symbolic goals) to produce LTL Goal (Symbolic task goals) for Execute the Task (Action planning/execution).
The study constructs a hallucination probing set by systematically modifying the Task Description and Scene Information inputs to introduce scene-task inconsistencies.
The research finds that LLMs struggle to reconcile scene-task inconsistencies, leading to hallucinations and failures in handling infeasible tasks.

OS-HARM: A Benchmark for Measuring Safety of Computer Use Agents

OS-HARM (Benchmark): introduces a benchmark for measuring the safety of computer use agents, featuring OS-HARM tasks, OSWorld Ubuntu VM, LLM Agent, OSWorld scaffolding, Agent Traces, and LLM Judge.
The benchmark evaluates LLM-based agents on tasks involving deliberate user misuse, prompt injection attacks, and model misbehavior within a realistic OSWorld environment.
An automated LLM Judge evaluates agent performance and safety based on recorded execution traces, including reasoning steps, screenshots, and accessibility trees.

LLM Agent for Hyper-Parameter Optimization

LLM Agent Framework and MCP: introduces an interactive framework orchestrating collaboration between the LLM Agent (comprising Profile, Memory, Planning, and Action components), Human inputs, and the Environment (WS-PSO-CM algorithm) for automatic hyper-parameter tuning.
The Model Context Protocol (MCP) defines a unified communication specification enabling the LLM Agent to interact with external systems via MCP Client and MCP Server architecture.
The framework iteratively refines hyper-parameters for the WS-PSO-CM algorithm based on prompt requirements and environmental feedback to optimize UAV trajectory and communication.

17th June 2025

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.

Gemini 2.X model family: introduces Gemini 2.5 Pro and Flash, built on Sparse mixture-of-experts transformers (Core Architecture) with Multimodal Support (Text, Image, Video, Audio), Long Context Processing (>1M tokens), Tool Use Support (Function calls), Thinking (Inference process), and Deep Think (Reasoning approach).
The models enable next-generation agentic capabilities, demonstrated by the Gemini Plays Pokémon agent harness which includes components like Persistent Memory & Context, Goals, Action History & Summaries, Game State, Periodic Processes (Memory Summarizer, Guidance Gemini), Agentic Tools (Pathfinder, Boulder Puzzle Strategist), and Game I/O.
Gemini 2.5 models achieve state-of-the-art performance on various benchmarks, including long-context video understanding (processing up to 3 hours of video) and coding, while also undergoing extensive safety and security evaluations.

AGENTDISTILL: TRAINING-FREE AGENT DISTILLATION WITH GENERALIZABLE MCP BOXES

AgentDistill: introduces a training-free agent distillation framework, with Teacher Agent (Generates MCPs), Manager Agent (Teacher) (Coordinates tasks), Basic Image Captioner (Teacher) (Captions images), MCP Creation Module (Creates task MCPs), MCP-Box Construction (Builds MCP Box), Abstraction (Parameterizes MCPs), Clustering (Groups MCPs), Consolidation (Merges MCPs), MCP Box (Reusable task modules), Student Agent (Uses MCP Box), Manager Agent (Student) (Coordinates tasks, uses MCP Box), and Basic Image Captioner (Student) (Captions images), which transfers task-solving capabilities from large teacher agents to small student agents via reusable Model-Context-Protocols (MCPs).
The framework involves a teacher agent generating MCPs, a construction process to build a reusable MCP-Box by abstracting, clustering, and consolidating them, and a student agent that directly integrates this MCP-Box for inference.
AgentDistill enables student agents to inherit sophisticated problem-solving skills and generalize across tasks by providing a structured MCP-Box without requiring additional training or trajectory replay.

Unified Software Engineering agent as AI Software Engineer

USEagent (Unified Software Engineering agent): introduces a unified agent for software engineering tasks, with Meta-Agent orchestrates actions, Actions perform SE tasks, Task State stores shared information, and Program is the target software project.
The Meta-Agent uses a ReAct-style loop to select actions based on the current task state and action outputs.
The framework utilizes a set of modular actions encapsulating units of work and a structured task state for consensus memory among actions.

Doppelgänger Method : Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack

Doppelgänger Method: introduces a prompt-based transferable adversarial attack method to break LLM agent consistency, evaluated using the PACAT Level metric, and countered by the CAT Prompt defense.
The method demonstrates the risk of role hijacking and internal information exposure in LLM agents.
Experimental results show the attack's effectiveness and the defense prompt's ability to mitigate consistency degradation.

Automated Decision-Making on Networks with LLMs through Knowledge-Guided Evolution

LLMNet: introduces a system for automated GNN design using LLM-based agents, including Knowledge Agent (Builds, manages knowledge bases), Prior Knowledge Base (Stores task-specific knowledge), Experiment Knowledge Base (Stores experimental results), Planning Agent (Generates task plan, evaluates), Data Agent (Performs feature engineering), Configuration Agent (Configures search space), and Evaluation Agent (Fine-tunes, experiments), which leverages knowledge bases and RAG for knowledge-guided evolution.
The system employs a pipeline of specialized agents that interact with constructed knowledge bases to design and refine GNN model architectures step by step.
LLMNet demonstrates superior performance across various graph learning tasks by effectively integrating graph-related knowledge into the automated design process.

GENERATIONPROGRAMS: Fine-grained Attribution with Executable Programs

GENERATIONPROGRAMS: introduces a modular generation framework that decomposes the process into program generation by an LLM and program execution by neural modules, producing an output with sentence-level attributions from input documents.
The framework first generates an executable program plan composed of modular text operations tailored to the query, then executes this plan using neural modules like paraphrasing, compression, fusion, and extraction on retrieved document sentences.
This two-stage approach enables fine-grained attribution by tracing the program execution and linking generated content back to source sentences, enhancing interpretability and verifiability.

SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

SIRI-Bench (Spatial Intelligence ReasonIng Benchmark): introduces a benchmark for evaluating VLMs' spatial intelligence using video-based 3D geometry problems, generated by an Automatic Scene Creation Engine leveraging Specialized LLM Agents to transform Original Math Problems into Realistic 3D Scenes and Video inputs for VLMs, alongside textual Questions and numerical Answers.
The Automatic Scene Creation Engine generates the benchmark data by solving geometric conditions, generating Blender Python Scripts, and refining textual inputs and outputs.
SIRI-Bench challenges VLMs to extract spatial information from video and perform complex reasoning, revealing limitations in current models compared to human performance and text-based LLMs.

LLM-Powered Swarms: A New Frontier or a Conceptual Stretch?

LLM-Powered Swarms: introduces a new paradigm for swarm intelligence using Large Language Models as agents, featuring LLM Agents (Large Language Models), Multi-Agent Coordination (Interconnected agents collaborate), Client-Side Operation (Framework runs locally), LLM Access (Cloud or local models), and Prompts (Natural language instructions).
This approach contrasts with traditional rule-based swarms by trading execution speed for flexibility and higher-level reasoning capabilities.
Evaluation using Boids and ACO models highlights significant latency and resource costs compared to classical methods, suggesting potential for hybrid systems.

Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent

ECPO (Expectation Confirmation Preference Optimization): introduces a novel multi-turn preference optimization paradigm leveraging Expectation Confirmation Theory to align LLM-based conversational recommendation agents with user expectations.
The framework explicitly models user satisfaction evolution across turns using Forward Expectation Confirmation and rewrites unsatisfactory responses via Backward Expectation Derivation with a Rewriter.
ECPO is supported by AILO, an LLM-based user simulator that provides realistic feedback and performs expectation confirmation, enabling efficient turn-level preference optimization without extensive sampling.

ADRD: LLM-DRIVEN AUTONOMOUS DRIVING BASED ON RULE-BASED DECISION SYSTEMS

ADRD (LLM-Driven Autonomous Driving Based on Rule-based Decision Systems): introduces a framework with Information Module, Agents Module (Planner, Coder, Summarizer), and Testing Module, leveraging LLMs to generate and refine rule-based decision trees for autonomous driving.
The Information Module gathers scenario data, the Agents Module generates and codes driving tactics, and the Testing Module provides feedback for iterative refinement.
ADRD demonstrates superior performance, response speed, and interpretability compared to baselines by integrating LLMs with rule-based decision systems.

From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents

TIMER: introduces timely dialogue response generation, with Time Interval Prediction (predicts delay), Time-conditioned Response Generation (generates response), and Fine-tuned Dialogue Model (base language model), addressing when and what to respond based on temporal context.
The model is trained using a multi-task learning objective on a large-scale synthetic dataset derived from event knowledge graphs and LLMs.
TIMER demonstrates improved performance over baselines in predicting appropriate response delays and generating time-specific, coherent dialogue.

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

AgentSynth: introduces a scalable pipeline for synthesizing computer-use tasks and trajectories by iteratively chaining LLM-generated subtasks, executed by a Task Executor, verified by a Task Verifier, revised by a Task Reviser, proposed as follow-ups by a Follow-up Task Proposer, and summarized into final tasks by a Task Summarizer, operating within an Environment guided by a Persona.
The pipeline leverages information asymmetry, generating simple subtasks that compose into challenging long-horizon tasks, enabling controllable difficulty.
AgentSynth generates over 6,000 diverse and realistic tasks at a low cost, providing a benchmark that reveals performance gaps in current LLM agents on multi-step computer tasks.

MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment

MAS-LitEval: introduces a multi-agent system for literary translation quality assessment, with Terminology Consistency Agent (Ensures key term consistency), Narrative Perspective Consistency Agent (Verifies narrative voice alignment), Stylistic Consistency Agent (Evaluates tone rhythm style), and Coordinator (Combines agent scores feedback).
The system employs specialized LLMs within agents to evaluate distinct dimensions of literary translation quality across segmented text chunks.
The Coordinator integrates agent evaluations into an Overall Translation Quality Score (OTQS) and a detailed report, ensuring global consistency.

FormGym: Doing Paperwork with Agents

Agent Framework with FieldFinder: introduces a system for end-to-end form completion using agents equipped with tools, including a novel field localization tool.
The system evaluates Vision-Language and GUI agents on the FormGym benchmark, which includes diverse forms, user profiles, and tasks.
The FieldFinder tool assists agents by predicting bounding boxes for input fields, significantly improving text placement accuracy.

Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification

CVDP (Comprehensive Verilog Design Problems): introduces a benchmark dataset and infrastructure, with Datapoint, Prompt, Context, Reference Solution, Test Harness, Testbench, Benchmark Runner, Agent Under Test, Model Under Test, Mini Repo, EDA Tools, Docker, LLM Judge, Map Feature, and Report & Logs components, designed to evaluate large language models and agents on hardware design and verification tasks.
The benchmark includes 783 human-authored problems across 13 categories covering RTL generation, verification, debugging, and comprehension, provided in both Non-Agentic (single-turn) and Agentic (multi-turn, tool-using) formats.
The infrastructure supports Dockerized agents and test harnesses for realistic tool interaction using EDA tools, and includes an LLM judge for quality filtering of datapoints.

16th June 2025

LocationReasoner: Evaluating LLMs on Real-World Site Selection Reasoning

LocationReasoner benchmark: introduces a benchmark to evaluate LLMs' real-world reasoning abilities in site selection, with Query Generation, Sandbox Environment, Datasets, In-house Tools, Execution Pathways, and Automated Verification components, evaluating Direct Code Generation, ReAct, and Reflexion approaches.
The benchmark uses curated datasets and in-house tools within a sandbox environment to test LLMs on constraint-based location search with automated verification.
Evaluation reveals current LLMs and agentic strategies struggle with complex real-world reasoning tasks, highlighting limitations in holistic and non-linear reasoning.

Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning (HRL): introduces an overview of methods for discovering temporal structure, formalized using the options framework including option policy, option termination function, option initiation function, high-level policy, and option model function, and discusses agent architectures like Hierarchical Components, Goal Conditioned, Feudal Architecture, and Single Network.
The paper surveys methods for temporal structure discovery categorized by learning from online experience, offline datasets, and foundation models.
HRL aims to improve exploration, credit assignment, transfer, and interpretability by leveraging temporal structure in sequential decision-making problems.

How Does LLM Reasoning Work for Code? A Survey and a Call to Action

Code Reasoning Taxonomy: introduces a classification of techniques for LLM reasoning on code tasks, including Code CoT Reasoning, Execution-based reasoning, Inference Scaling, and Agentic approaches.
The taxonomy details sub-techniques such as Plan-based CoT, Self-evaluation of execution behavior, Sampling, and Agentic Workflow.
The survey highlights how these distinct reasoning strategies and their components are applied and perform on various code-related benchmarks.

Spec2RTL-Agent: Automated Hardware Code Generation from Complex Specifications Using LLM Agent Systems

Spec2RTL-Agent: introduces an LLM-based multi-agent system for automated RTL code generation from complex specifications, including Iterative Understanding and Reasoning Module, Progressive Coding and Prompt Optimization Module, Adaptive Reflection Module, and Code Optimization and Conversion Module.
The system processes unstructured specification documents, refines code generation through multiple abstraction levels, and iteratively verifies outputs.
Spec2RTL-Agent demonstrates effectiveness in generating accurate RTL code with reduced human intervention compared to existing methods.

We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent Systems

SAFEMCP: introduces a controlled framework to examine safety issues in MCP-powered agent systems, with Agent, Backbone LLM, MCP-Servers, Attack, Defense, Passive Defense, Active Defense, Evaluation, Scenario, and Metric components.
The framework simulates third-party attacks on LLM agents interacting with external services via the Model Context Protocol (MCP).
SAFEMCP provides tools for evaluating attack effectiveness and defense strategies using various scenarios and metrics.

CAMS: A CityGPT-Powered Agentic Framework for Urban Human Mobility Simulation

CAMS: introduces a CityGPT-powered agentic framework for urban human mobility simulation, with MobExtractor (Extracts/synthesizes mobility patterns), GeoGenerator (Generates geospatial knowledge/trajectories), and TrajEnhancer (Enhances trajectories via DPO) components.
The framework leverages an urban foundation model (CityGPT) and agentic reasoning to generate realistic and plausible human mobility trajectories.
CAMS integrates urban spatial knowledge and multi-dimensional feedback for controllable and generalizable simulation without relying on external geospatial information.

--

Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning

LA-CDM: introduces a hypothesis-driven uncertainty-aware language agent system for clinical decision making, comprising a Hypothesis Agent (forms hypothesis and confidence), a Decision Agent (decides action), Shared LLM Weights (underlying language model), and a Clinical Decision Making Environment (simulates patient interaction and provides feedback).
The system is trained using a hybrid paradigm combining supervised learning for hypothesis generation and reinforcement learning for uncertainty estimation and efficient action selection.
This approach models the iterative clinical process of forming hypotheses and requesting tests to converge towards a diagnosis, improving diagnostic performance and efficiency.

Towards Pervasive Distributed Agentic Generative Al - A State of The Art

Pervasive Distributed Agentic Generative AI: surveys the state of the art in LLM-based agents deployed in pervasive computing environments, detailing the Transformer Architecture, Short-Term Memory, Long-Term Memory, Hybrid Memory, Cloud Layer, Fog Layer, and Edge Layer components.
The paper examines the architecture of LLM agents, their deployment strategies across different infrastructure layers, and evaluation methods.
It highlights challenges in deploying these agents on resource-constrained pervasive devices and proposes the "Agent as a Tool" concept.

A Game-Theoretic Negotiation Framework for Cross-Cultural Consensus in LLMS

Game-Theoretic Negotiation Framework: introduces a systematic approach for cross-cultural consensus among LLMs, with Cultural Agents, Guideline Sets, Guideline Weights, Utility Functions, Negotiation Process (PSRO-based), Meta Strategy Solver, Best Response Oracle, Regional Cultural Agents, and Consensus Evaluation Toolkit, designed to achieve fair and robust agreement.
The framework models consensus as a Nash Equilibrium and employs a PSRO-based negotiation process driven by utility functions balancing consistency, acceptance, and novelty.
Culturally aligned Regional Cultural Agents are constructed using survey data, and consensus outcomes are evaluated using perplexity-based acceptance and value self-consistency metrics.

Querying Large Automotive Software Models: Agentic vs. Direct LLM Approaches

Direct Full-Context Prompting: introduces a baseline approach where the LLM (Processes model) receives the entire Software Model File (Complete input data) along with Instructions (Guidance for LLM) and a Question (User query) to produce an Answer (LLM response).
Agent with File Tools (ReAct Architecture): presents an agentic approach where the LLM (Agent's reasoning engine) interacts with Software Model Files (Data source) via a Toolkit (External tool access) containing specific Tools (File interaction functions), communicating through Messages (Communication channel) and append observation (Tool output) to answer a User (Initiates query) Question (User query) with an Answer (Agent's response).
The study compares these two architectures for querying large automotive software models, evaluating their accuracy and token efficiency using various LLMs and a custom question dataset.

Leveraging In-Context Learning for Language Model Agents

ICL-DS (In-Context Learning with Demonstration Selection): introduces an approach for LLM agents that leverages in-context learning with dynamically selected demonstrations, including an LLM Agent (generates thoughts and actions), a Demonstration Pool (stores annotated trajectories and snippets), an Iterative Annotation Algorithm (automatically annotates tasks for demonstrations), a Demonstration Selector (retrieves relevant demonstrations), Prompt Construction (formats input for LLM), a ReAct Solver (executes tasks iteratively with reasoning), a Plan & Execute (PnE) Solver (plans subtasks and executes them), and an Environment (provides observations and executes actions).
The paper proposes an iterative annotation algorithm to automatically and efficiently create a demonstration pool of solution trajectories for agentic tasks, which are then used to improve LLM agent performance, reliability, and efficiency.
The research demonstrates that using task-level trajectory demonstrations and smaller step-level snippet demonstrations significantly boosts performance for LLM agents, enabling them to rival costlier trained agents.

MOTIVEBENCH: How Far Are We From Human-Like Motivational Reasoning in Large Language Models?

MOTIVEBENCH: a comprehensive evaluation benchmark designed to assess the extent to which Large Language Models (LLMs) can replicate human-like motivations and behaviors, consisting of 200 rich contextual scenarios and 600 reasoning tasks covering multiple levels of motivation.
The benchmark addresses limitations of existing datasets by providing detailed scenarios, character profiles, and reasoning tasks that mimic real-world situations, thereby enabling a more accurate evaluation of LLMs' motivational intelligence.
MOTIVEBENCH aims to provide insights into LLMs' capabilities in understanding and exhibiting human-like motivational reasoning, highlighting areas where current models fall short and suggesting directions for future research in humanizing LLMs.

MAGIC: Multi-Agent Argumentation and Grammar Integrated Critiquer

MAGIC (Multi-Agent Argumentation and Grammar Integrated Critiquer): is a framework that utilizes multiple specialized agents to evaluate distinct writing aspects, aiming to predict holistic scores and produce detailed, rubric-aligned feedback for essays.
The framework employs an orchestrator to consolidate the outputs from individual agents, which focus on specific components of argumentative writing such as argument structure, grammar, vocabulary, and comprehension.
MAGIC aims to provide greater transparency, flexibility, and extensibility compared to monolithic automated essay scoring and feedback systems.

Scaling Test-time Compute for LLM Agents

ATTS (Agentic Test-Time Scaling): explores test-time scaling strategies for language agents, including parallel sampling, sequential revision, verifiers and merging, and diversifying rollouts.
The research systematically analyzes the impact of different design strategies on agent performance, finding that scaling test-time compute improves agent capabilities.
Key findings include the importance of knowing when to reflect, the superiority of list-wise methods for verification and merging, and the positive effect of diversified rollouts on agent performance.

15th June 2025

WEREWOLF-PLUS: AN UPDATE OF WEREWOLF GAME SETTING BASED ON DSGBENCH

WereWolf-Plus: introduces a multi-model, multi-dimensional, and multi-method benchmarking platform with Werewolf Simulation (Rule-compliant environment), LLM Agents (Flexible model assignment), Role Configuration (Customizable roles), Reasoning Enhancement (Experience-Retrieval Augmentation), and Evaluation Framework (Metrics for agents) for evaluating multi-agent strategic reasoning in the Werewolf game.
The platform provides a flexible and reliable environment supporting standard and customizable game setups with various roles and flexible LLM-role assignment.
WereWolf-Plus incorporates retrieval-augmented memory for contextual compression and reflection, and introduces comprehensive quantitative evaluation metrics for different roles and players.

Mastering Da Vinci Code: A Comparative Study of Transformer, LLM, and PPO-based Agents

Transformer-based Baseline Model: introduces, with Transformer architecture (Predicts opponent tiles), Input Representation (Current game state string), and Prediction Task (Predict hidden tile values) components, a model designed to predict opponent tiles based on the current public game state.
This baseline model utilizes a Transformer architecture but is limited by its restricted access to the full game history.
Its reasoning is primarily based on the current state snapshot and explicit negative constraints from prior guesses.

SoK: The Privacy Paradox of Large Language Models: Advancements, Privacy Risks, and Mitigation

Systematization of Knowledge (SoK) on LLM Privacy: introduces a comprehensive analysis of privacy challenges in LLMs, categorizing them across training data, user prompts, generated outputs, and LLM agents, with all LLM (Large Language Model), LLM System, Training Data, User Prompts, LLM Generated Output, LLM Agent System, Main LLM Agent, Secondary Agents, External Tools, Knowledge Base, Memory, User, Service Provider-components, where the paper evaluates existing mitigation techniques and identifies research gaps.
The paper highlights how LLMs' advanced capabilities and interactive nature introduce distinct privacy risks compared to traditional AI.
It discusses various mitigation strategies for each category of privacy challenge, noting their effectiveness and limitations.

SCISAGE: A MULTI-AGENT FRAMEWORK FOR HIGH-QUALITY SCIENTIFIC SURVEY GENERATION

SciSage (Scientific Sage): introduces a multi-agent framework with Interpreter (Understand/rewrite query), Organizer (Construct outline), Collector (Retrieve/rerank papers), Composer (Generate content), Refiner (Refine final document), and Reflector (Iterative hierarchical reflection) agents for high-quality scientific survey generation.
The framework employs a reflect-when-you-write paradigm, with the Reflector agent critically evaluating drafts at multiple levels.
SciSage coordinates specialized agents through query understanding, retrieval, content generation, and iterative hierarchical reflection processes.

14th June 2025

Synthetic Socratic Debates: Examining Persona Effects on Moral Decision and Persuasion Dynamics

Synthetic Socratic Debates: introduces a system using Agent (LLM with persona), Persona (6-dimensional identity profile), Moderator (Manages debate turns), Multi-Turn Debate Framework (Simulates agent interactions), Persona Modeling (Assigns identity attributes), Decision Measures (Quantify moral judgments), Persuasion Measures (Evaluate debate effectiveness), Rhetorical Strategy Evaluation (Assesses persuasion modes), and LLM-as-a-judge (Evaluates rhetorical strategies) to simulate moral debates between AI agents with distinct personas.
The system investigates how persona traits influence moral decision-making and persuasive strategies in LLMs during multi-turn debates over real-world moral dilemmas.
The research reveals that political ideology and personality traits significantly shape initial moral stances and debate outcomes, impacting persuasive success and rhetorical strategies.

Towards Building General Purpose Embedding Models for Industry 4.0 Agents

Recommender Agent (ReAct Agent with Multi-Task Embedder Tool): introduces a framework for industrial asset maintenance agents, combining a ReAct Agent (Plans and reasons), a Multi-Task Embedder (Retrieves relevant items) used as a tool, and LLM Augmentation (Enriches queries) for improved context.
The framework leverages domain-specific embeddings fine-tuned on nine industrial tasks derived from ISO documents to enhance retrieval performance for complex queries.
Ablation studies demonstrate the effectiveness of LLM query augmentation and the importance of balanced positive/negative samples for training the embedding model.

The Rise of AI Companions: How Human-Chatbot Relationships Influence Well-Being

Study Approach: investigates how human-chatbot relationships influence well-being by collecting Survey Data and Chat History Data, deriving Chatbot Companionship Measures, Interaction Intensity Measure, Self-Disclosure Measures, Human Social Support Measure, and Well-Being Measure, and analyzing them using LLM-based Text Analysis, Topic Modeling, Regression Analysis, and CFA.
The study uses a mixed-methods approach, triangulating self-report surveys, open-ended descriptions, and chat transcripts to understand chatbot companionship and its psychological associations.
Findings suggest that companionship-oriented chatbot use, especially with high intensity and self-disclosure, is associated with lower well-being, particularly for users with limited offline social support.

AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving

AgentOrchestra: introduces a hierarchical multi-agent framework with a Planning Agent (Central orchestrator) coordinating Specialized Sub-Agents (Domain-specific processing team), utilizing various tools, memory, and models for general-purpose task solving.
The framework features a two-tier architecture where the planning agent decomposes tasks and delegates sub-tasks to specialized agents equipped with domain-specific tools.
AgentOrchestra supports flexible orchestration, inter-agent communication, and adaptive role allocation, enabling robust performance on complex, multimodal tasks.

Tiered Agentic Oversight: A Hierarchical Multi-Agent System for AI Safety in Healthcare

TAO (Tiered Agentic Oversight): introduces a hierarchical multi-agent framework for AI safety in healthcare, featuring an Agent Recruiter (Recruits expert agents), Agent Router (Routes query to tier), Tier 1 (Initial assessment/screening), Tier 2 (Specialized review/analysis), Tier 3 (Expert consultation/synthesis) of Medical Agents (Core assessment units), Case Escalation (Escalates to higher tiers), Intra-Tier Collaboration (Discussion within tier), Inter-Tier Collaboration (Dialogue between tiers), Final Decision Agent (Synthesizes final decision), and Human Oversight (Targeted human intervention).
The framework routes tasks based on complexity and agent roles, escalating complex or high-risk cases through tiers with automated inter- and intra-tier collaboration and role-playing.
TAO enhances AI safety through layered, automated supervision, demonstrating superior performance on healthcare safety benchmarks compared to single-agent and multi-agent baselines.

Topology-Assisted Spatio-Temporal Pattern Disentangling for Scalable MARL in Large-scale Autonomous Traffic Control

TGN-TMoE: introduces a novel MARL framework for large-scale traffic control, with Agent Observations (Raw graph data), MF Synchronization (Integrates mean-field information), Temporal Learning (Processes temporal features), Topological Processing (Extracts topological features), Spatial Learning (Processes spatial features), TMoE Module (Routes features to experts), Graph Pooling (Aggregates graph features), and Decision Making Module (Policy and value networks), designed to enhance environmental representation and agent coordination.
The framework integrates Dynamic Graph Neural Networks and Topological Data Analysis, employing a TSD-enhanced Mixture of Experts architecture for scalable multi-agent reinforcement learning.
TGN-TMoE leverages topological signatures to disentangle graph features for specialized processing within observation fusion and decision-making modules.

Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM

MAoP (Multiple Aspects of Planning): introduces a travel planning framework with a Strategist Model (Decomposes, routes aspects), Planning Model (Generates plan), Preprocessing Framework (Prepares input context including eliciting preferences, selecting POIs, and spatial optimization), Reward Model (Provides training signal), and Distilled Model (One-step inference).
The framework leverages a strategist for pre-planning and a planning model for generating travel itineraries based on wide-horizon thinking over multiple aspects.
A separate agent-based simulation framework, Travel-Sim, is proposed for evaluating the feasibility and personalization of the generated travel plans.

SHEETMIND: AN END-TO-END LLM-POWERED MULTI-AGENT FRAMEWORK FOR SPREADSHEET AUTOMATION

SheetMind: introduces a modular multi-agent framework for spreadsheet automation, with Manager Agent (Decomposes user instructions), Action Agent (Generates BNF commands), Reflection Agent (Validates actions, monitors effects), Front-End (User interface, executes actions), Back-End (Houses agent pipeline), and Spreadsheet (Target environment), enabling natural language interaction.
The framework decomposes complex instructions into subtasks, translates them into structured commands using BNF grammar, and validates actions through a feedback loop.
SheetMind integrates LLM-driven planning, structured execution, and agentic feedback to bridge the gap between natural language and spreadsheet functionalities.

INDOORWORLD : Integrating Physical Task Solving and Social Simulation in A Heterogeneous Multi-Agent Environment

INDOORWORLD: introduces a heterogeneous multi-agent environment integrating physical task solving and social simulation, with Agent (Autonomous entity), Perception (Processes observations), Memory (Stores information, history), Planning (Determines objectives, tasks), Action (Selects, executes actions), Task Prioritization (Encourages task focus), Environment (Simulated indoor space), Object (Physical entity), Location (Spatial area), and World State (Joint state variables) components, designed to simulate occupant behaviors in indoor spaces.
The environment features heterogeneous agents with multi-level profiles (role, action space, capability, knowledge) and human needs, interacting with objects and locations to modify the world state.
The framework supports both collaborative task-solving and autonomous social simulation sessions, providing a testbed for LLM-based multi-agent systems and potential applications in architectural design.

Cloud Infrastructure Management in the Age of AI Agents

AI Agents: introduces an envisioned agentic system architecture for automating cloud infrastructure management using LLM-powered agents, including User-agent Interface, Agent-cloud Interface, Multi-agent Orchestration, Memory, Reasoning, Tools, Planning, Guardrails, Actions, Cloud Vendors, and Cloud Gym components.
The proposed architecture utilizes different cloud interaction modalities (SDK, CLI, IaC, Web) and incorporates exploration/exploitation phases and guardrails for safety and reliability.
A preliminary study evaluates agents across modalities on VM management tasks, highlighting trade-offs in efficiency, success rate, and error handling.

13th June 2025

ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification

ReVeal: introduces a multi-turn reinforcement learning framework for code agents, featuring an Iterative Generation-Verification Loop where a Policy LLM generates code and test cases, External Tools execute them, and Tool Feedback provides results, guided by Turn-Level Reward Design and Outcome Reward, trained using Turn-Aware PPO on the Dataset (TACO).
The framework enables LLMs to autonomously generate and verify code through iterative refinement and tool interaction, improving performance and self-verification capabilities.
ReVeal's approach allows for effective test-time scaling into deeper inference regimes and pushes reasoning boundaries beyond the base model.

The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs

Behavior Gap Evaluation Framework: introduces a comprehensive framework to quantify the behavior gap between LLM Agents and Human Experts in task-oriented dialogs using Behavior Gap Metrics, Task Complexity Metrics, and Performance Metrics within a Teacher-Forcing Approach on various Datasets, revealing significant discrepancies that negatively impact LLM Agent performance.
The study utilizes LLM-based Classifiers and an LLM-based Evaluator to analyze specific behavioral dimensions like dialog acts, Tool usage, and knowledge integration, demonstrating that the gap widens with increasing task complexity.
Aligning LLM agent behavior closer to human strategies through Behavior Intervention significantly improves performance, particularly in complex tasks.

A Fast, Reliable, and Secure Programming Language for LLM Agents with Code Actions

QUASAR: introduces a programming language for LLM agent code actions, with LLM Agent generating Python Subset code, Transpiler converting it to QUASAR Language, and QUASAR Interpreter executing it using Internal Rewrite Rules and External Rewrite Rule, managing External Calls tracked in the Execution Set, validated by User Approval Mechanism, and supporting Conformal Semantics with Abstract External Functions.
The QUASAR Language separates pure internal computation from external side effects, enabling automatic parallelization, dynamic security controls, and uncertainty quantification.
The approach leverages LLM proficiency in Python by transpiling a restricted subset to QUASAR for improved performance, security, and reliability compared to direct Python execution.

PRO-V: An Efficient Program Generation Multi-Agent System for Automatic RTL Verification

PRO-V: introduces an efficient program generation multi-agent system for automatic RTL verification, with Stimulus Generator (Generates input signals), Functional Model (Generates reference outputs), Self-Improvement (Selects, refines models), Validator (Verifies DUT with judge), Judge Agent (LLM for selection, validation), and Refinement Agent (LLM for refinement) components.
The system enhances verification accuracy and coverage through inference-time scaling via dual sampling and a self-improvement mechanism using a best-of-n selection strategy.
PRO-V integrates an LLM-as-a-judge into the validation process, leveraging rule-based static analysis converted to natural language for enhanced prompting and root cause analysis.

Robot Context Protocol (RCP): A Runtime-Agnostic Interface for Agent-Aware Robot Control

RCP (Robot Context Protocol): introduces, "a lightweight, middleware-agnostic communication protocol designed to abstract robotic system complexity", with Adapter Layer (Translates client interfaces), Transport Layer (Handles communication channels), Service Layer (Defines core operations), ROS2 Interface Layer (Maps to ROS2 runtime), and Status Query (Provides health and feedback), where "RCP provides a unified and semantically meaningful interface that decouples client-facing operations from backend implementations".
The protocol is structured in modular layers, including the Adapter Layer for diverse clients, the Transport Layer for communication channels (HTTP, WebSocket, SSE), the Service Layer for high-level operations (read, execute, write, subscribe), and the ROS2 Interface Layer for mapping to the underlying runtime.
RCP includes a Status Query component for real-time protocol health and command feedback, supporting robustness and operational transparency.

Your Ride, Your Rules: Psychology and Cognition Enabled Automated Driving Systems

PACE-ADS (Psychology and Cognition Enabled Automated Driving Systems): introduces a human-centered autonomy framework with Driver Agent (Analyzes external traffic), Psychologist Agent (Interprets occupant psychological state), and Coordinator Agent (Synthesizes inputs, decides behavior) interfacing with Perception module (Provides sensor data), Route planning module (Computes global route), Motion planning module (Generates behavior, trajectory), and Control module (Executes planned trajectory) for adaptive driving.
The framework leverages LLM-based agents to sense, interpret, and respond to both external traffic conditions and internal occupant states.
Operating in a closed-loop architecture, the system dynamically adjusts driving style and supports vehicle operation recovery.

Revealing Political Bias in LLMs through Structured Multi-Agent Debate

Structured Multi-Agent Debate Framework: introduces a system to investigate political bias in LLMs, with LLM Agents (Simulate participants) assigned Agent Personas (Political/gender identities) debating generated Debate Scenarios (Generated topics/questions) following a specific Debate Format (Structured rounds/statements), evaluated by an LLM-as-a-Judge (Evaluates attitudes) using Attitude Scoring (Quantifies agreement/disagreement) and a defined Speaking Order (Agent turn sequence).
The framework systematically varies LLM models, agent gender attributes, and debate formats to examine influences on political bias and attitude shifts.
Experiments reveal Republican agents shift towards neutral, gender influences attitudes, and echo chambers form with attitude intensification, particularly when gender is known.

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

SEC-bench: introduces an automated benchmarking framework for evaluating LLM agents on security engineering tasks, with Preprocessor (collects instances), Verifier (reproduces, verifies vulnerabilities), and Evaluator (transforms, formulates tasks) components.
The Verifier component employs a multi-agent scaffold including Manager, Builder, Exploiter, and Fixer agents to reproduce and validate vulnerabilities.
The framework automatically creates high-quality software vulnerability datasets with reproducible artifacts for evaluating LLM agent capabilities in tasks like proof-of-concept generation and vulnerability patching.

AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments

AgentSense: introduces a virtual data generation pipeline using LLM agents in simulated home environments to create diverse sensor data for human activity recognition.
The pipeline involves LLMs generating personas, routines, and actions, which are then executed in an extended VirtualHome simulator equipped with virtual ambient sensors.
The generated virtual sensor data is used to pretrain HAR models, demonstrating improved performance, especially in low-resource settings, compared to training solely on real data.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

RACE and FACT evaluation frameworks: introduce two novel evaluation frameworks, with Judge LLM, Adaptive Criteria Generation, Reference-Based Scoring, Statement-URL Extraction, Support Judgment, Jina Reader API, and Citation Metrics Calculation components, designed to comprehensively assess Deep Research Agents.
RACE evaluates report generation quality using adaptive criteria and reference-based scoring, while FACT assesses information retrieval and citation trustworthiness.
These frameworks are part of the DeepResearch Bench, a benchmark of 100 PhD-level research tasks for evaluating LLM-based agents.

A Hybrid Multi-Agent Prompting Approach for Simplifying Complex Sentences

Hybrid Multi-Agent Prompting Approach: introduces a system using multi-agent collaboration for sentence simplification, including Agent 1 Sentence Simplifier, Agent 2 Semantic and Lexical Similarity Evaluator, Agent 3 Alternative Sentence Simplifier, and Comparator components.
The system processes complex sentences through a workflow where agents decompose, evaluate, and iteratively revise the output to preserve meaning while reducing complexity.
This multi-agent architecture demonstrates improved performance over single-agent methods for simplifying complex sentences in domains like video game design.

ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification

ReVeal: introduces a multi-turn reinforcement learning framework that enables code agents to engage in an iterative generation-verification loop using a single Policy LLM, guided by Input Prompt and Tool Feedback, structured as a Multi-turn Rollout producing an Output Rollout, optimized with Outcome Reward and Turn-Level Rewards via Turn-Aware PPO.
The framework alternates between Generation (producing code) and Verification (generating test cases and plans) stages, leveraging external Tools like Python Interpreters for execution.
This iterative process and dense reward structure allow the model to self-verify, refine outputs, and improve both generation and verification capabilities over multiple turns.

Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards

Agent-RLVR: introduces a framework for training software engineering agents using Reinforcement Learning from Verifiable Rewards (RLVR), incorporating agent guidance and environment rewards, with Policy, Environments, Trajectory, Evaluation, Environment Information, Agent Guidance, Guidance Generation, RLVR Data, Policy Update, and Instruct Tuning components.
The framework trains an agent Policy by having it interact with Environments, evaluating Trajectories via Evaluation, generating Environment Information from failures, and using Guidance Generation to create Agent Guidance.
Incorrect trajectories are reattempted with Agent Guidance, and the resulting RLVR Data is used for Policy Update via DPO and optional Instruct Tuning to improve agent performance.

Large Language Model-Powered Conversational Agent Delivering Problem-Solving Therapy (PST) for Family Caregivers: Enhancing Empathy and Therapeutic Alliance Using In-Context Learning

LLM-powered Conversational Agent Models: introduces an LLM-powered agent delivering Problem-Solving Therapy (PST) for family caregivers, integrating Motivational Interviewing (MI) and Behavioral Chain Analysis (BCA) using prompting techniques, Retrieval-Augmented Generation (RAG), and clinician-curated content.
The research evaluates four distinct configurations of this agent, comparing different LLMs (GPT-40, Llama 3) and combinations of in-context learning techniques (Few-shot, RAG) for their impact on perceived empathy and therapeutic alliance.
The models aim to provide empathetic and tailored mental health support by improving contextual understanding and generating personalized, actionable strategies for caregivers.

Secure API-Driven Research Automation to Accelerate Scientific Discovery

S3M (Secure Scientific Service Mesh): introduces, "Secure Scientific Service Mesh (Overall framework), Manages data streaming, Automates complex workflows, Manages compute jobs, Provides resource status, Retrieves environment info, Manages access tokens, Enables secure communication, Underlying service mesh platform, Python interface, Validates client interactions, Creates streaming objects, Deploys streaming clusters", a framework providing API-driven infrastructure for automated scientific discovery with integrated streaming, workflow orchestration, and fine-grained authorization.
The framework utilizes a service mesh architecture built on OpenShift and Istio to ensure modularity, scalability, and policy-driven security enforcement across computational services.
S3M offers a comprehensive set of APIs and an SDK to enable authenticated external systems and intelligent agents to securely provision resources, stream data, and trigger compute jobs dynamically.

Your Ride, Your Rules: Psychology and Cognition Enabled Automated Driving Systems

PACE-ADS (Psychology and Cognition Enabled Automated Driving Systems): introduces a human-centered autonomy framework with Psychologist Agent (Interprets occupant state/intent), Driver Agent (Perceives external traffic context), Coordinator Agent (Synthesizes inputs, decides behavior), Perception module (Provides sensor data), Route planning module (Plans/replans vehicle route), Motion planning module (Generates behaviors/trajectories), and Control Module (Executes low-level commands), enabling AVs to sense, interpret, and respond to external traffic and internal occupant states.
The framework uses three specialized foundation model agents in an agentic workflow to manage complex driving tasks and enable adaptive, interpretable, and collaborative driving.
PACE-ADS complements existing AV modules by operating at the high-level behavioral decision layer, personalizing riding experience, and supporting recovery from immobilization.

Self-Regulating Cars: Automating Traffic Control in Free Flow Road Networks

SRC (Self-Regulating Cars): introduces a physics-informed reinforcement learning protocol for automating traffic control in free-flow networks by having a central RL agent modulate vehicle speeds on super-segments based on traffic state observations, guided by a reward function.
The system utilizes Deep Q-Learning with a neural network to learn speed modulation policies, evaluated in a PTV Vissim simulation environment.
The approach aims to optimize network throughput and prevent congestion by coordinating individual self-regulating cars without requiring new physical infrastructure.

PE-MA: Parameter-Efficient Co-Evolution of Multi-Agent Systems

PE-MA (Parameter-Efficient Multi-Agent Co-Evolution): introduces, "a novel collaboration framework", with Frozen Backbone (Fixed feature extractor), Personalized Adapter (Adapts to local tasks/data), Shared Adapter (Shares knowledge across agents), Communication Mechanism (Exchanges and aggregates adapters), designed for efficient, scalable, and personalized co-evolution in multi-agent systems.
Each agent maintains a lightweight personalized adapter for agent-specific behavior and a shared adapter collaboratively optimized across neighboring agents.
The dual-adapter architecture balances global coordination with local adaptation, significantly reducing training and communication costs.

Interaction, Process, Infrastructure: A Unified Architecture for Human-Agent Collaboration

Unified Architecture for Human-Agent Collaboration: introduces a layered framework for human-agent collaboration with Interaction Layer (surface of shared understanding), Process Layer (collaborative core), and Infrastructure Layer (orchestration, execution, memory).
The Process Layer explicitly models goals, workflows, and progress, serving as connective tissue for human-agent alignment and coordination over time.
This modular architecture supports transparency, extensibility, and adaptive, goal-aligned collaboration by decoupling interaction, process logic, and computational foundation.

12th June 2025

AUTOMIND: Adaptive Knowledgeable Agent for Automated Data Science

AUTOMIND (Adaptive Knowledgeable Agent for Automated Data Science): introduces an adaptive, knowledgeable LLM-agent framework with a curated expert knowledge base, an agentic knowledgeable tree search algorithm, and a self-adaptive coding strategy.
The framework leverages the expert knowledge base via a retriever to ground the tree search, which explores solutions through drafting, improving, and debugging actions.
The self-adaptive coding strategy dynamically adjusts code generation based on task complexity, using either one-pass generation or stepwise decomposition with execution feedback.

Specification and Evaluation of Multi-Agent LLM Systems - Prototype and Cybersecurity Applications

Multi-Agent LLM System: introduces a system architecture and specification for multi-agent LLM applications, including a Client Application, Conversational User Interface, Agent Manager, Conversation Manager, Execution Engine, Agents, LLM Services, Host Execution Environment, and Agent Schema.
The system allows specifying agents with executable prompts, actions, and data, supporting prompting/reasoning techniques and conditional execution based on results.
The Agent Schema defines agent types, functions (execution, evaluation), and configurations for agents and LLMs, enabling systematic evaluation of LLMs and techniques in specific applications like cybersecurity.

From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review

GPT ranking system: introduces a novel peer review mechanism using LLM agents for pairwise comparisons, aggregated by the Bradley-Terry model to derive a global ranking of submissions.
The system contrasts pairs of manuscripts to determine relative quality, moving away from traditional independent absolute scoring.
Empirical experiments demonstrate the system's potential to identify high-impact papers more effectively than rating-based methods, while also revealing biases against topic novelty and institutional diversity.

LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation

Reflection Agent with Dedicated Evaluator: introduces a system for automatic code validation and refinement, including a Code Generator, Reflect module, and Evaluator using specific metrics.
The Evaluator utilizes Bidirectional Functionality Matching and Logic Representation metrics to assess generated Bash code quality without requiring reference code.
The system incorporates judgments and feedback from the evaluation metrics to refine the initial code snippet generated by the Code Generator.

Using Invocable APIs derived from NL2SQL datasets for LLM Tool-Calling Evaluation

LLM Tool-Calling Evaluation Framework: introduces a method to convert NL2SQL datasets into NL2API datasets for LLM tool-calling evaluation using a Data Generation Pipeline.
The framework includes generated API Collections (SLOT, SEL, REST) with varying characteristics, Invocable APIs for live interaction, and an Evaluation Set pairing natural language queries with ground-truth API sequences.
It evaluates the performance of various LLMs and ReACT Agents on these generated datasets to assess their tool-calling capabilities.

Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges

Reasoning Agentic RAG: introduces a paradigm integrating retrieval with model-driven reasoning and decision-making, encompassing Question, LLM/LRM, Reasoning, Retrieval, Retrieved Information, Distilled Information, and Answer components.
The framework categorizes approaches into predefined reasoning with fixed pipelines and agentic reasoning with autonomous tool orchestration.
This survey reviews techniques, architectural designs, reasoning strategies, and tool coordination within this paradigm to address industry challenges.

Provably Learning from Language Feedback

HELIX: introduces a framework for Learning from Language Feedback (LLF), including an LLM Policy, Reference Policy, Reward Mapping, set of Hypotheses, set of Actions, and Score Matrix.
The LLM Policy generates hypotheses and actions, the Reference Policy adds random actions, and the Reward Mapping scores actions under hypotheses to form a Score Matrix.
The algorithm uses the Score Matrix for decision-making, employing exploitation when consensus exists and exploration otherwise, potentially re-scoring with the Reference Policy.

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

SWE-Factory: introduces an automated pipeline for GitHub issue resolution benchmark construction, including Raw Issue Collection (Collects GitHub issue data), SWE-Builder (Automates environment setup), Grading Results (Grades test outcomes), and Fail2pass Validation (Validates fail-to-pass transition).
The SWE-Builder component is a multi-agent framework comprising a Repository Explorer (Collects repository setup information), Environment Manager (Generates Dockerfile), Test Manager (Generates test script), and Test Analyst (Validates environment, plans iterations), supported by an Evaluation Environment Memory Pool (Stores and reuses setups).
The pipeline automates environment construction, grading via exit codes, and fail2pass validation to reduce manual effort in creating large-scale, high-quality datasets.

Build the web for agents, not agents for the web

AWI (Agentic Web Interface): introduces a new web interface paradigm specifically designed for agents, featuring unified higher-level actions, compatibility with user interfaces, access control for agents, progressive information transfer, and agentic task queues.
This paradigm shift aims to overcome limitations of current human-designed web interfaces for web agents.
The paper establishes guiding principles for AWI design, emphasizing safety, efficiency, and standardization, and advocates for broad ML community involvement.

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Lightweight Sequential Monitoring Framework: introduces a defense against decomposition attacks by using an external monitor to evaluate the cumulative context of subtasks.
The monitor outputs a binary flag at each step to halt the LLM if harmful intent is detected based on the prompt history.
This framework outperforms single-input monitoring and is cost/latency efficient for mitigating decomposition attacks.

Execution Guided Line-by-Line Code Generation

EG-CFG (Execution-Guided Classifier-Free Guidance): introduces a novel approach for neural code generation that incorporates real-time execution signals into the language model generation process, utilizing a Large Language Model, Programming Task input, Initial Prompt, Candidate Generation via beam search, Executable Extraction via AST parsing, Execution Engine for running test cases, Execution Trace generation, Dynamic Signal aggregation, Dynamic Prompt construction, Classifier-Free Guidance for token generation, an Inference Loop for autoregressive generation, and Parameter Search.
The method dynamically incorporates execution signals as the model generates code line-by-line, guiding the generation process toward executable solutions.
EG-CFG achieves state-of-the-art performance on multiple code generation benchmarks by leveraging execution feedback and Classifier-Free Guidance.

Dynamic Epistemic Friction in Dialogue

Dynamic Epistemic Friction (DEF): introduces a formal model of dynamic epistemic friction in dialogue, operationalized within Dynamic Epistemic Logic and vector-based belief representations, using epistemic states, propositions, evidence, alignment, friction, QBank, EBank, FBank, an update function, friction coefficients, and friction equilibrium.
The model quantifies resistance encountered during belief updates by measuring vector similarity between agent beliefs and new information combined with evidence.
Empirical analysis on a situated collaborative task dataset demonstrates that the model effectively predicts participant belief updates by modeling this resistance.

OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems

OPT-Agent: introduces a framework that emulates human reasoning for optimizing solutions by iteratively generating, validating, and improving solutions using historical feedback, with all Drafting, Improving, Debugging, Historical Information, Error Analysis, Validation, and Metrics components.
The framework's workflow involves generating a Draft solution, iteratively Improving valid solutions or Debugging buggy ones based on Error Analysis and Historical Information, with Validation and Metrics guiding the process.
OPT-Agent is evaluated on OPT-BENCH, a benchmark of machine learning and NP problems, to assess LLMs' iterative optimization capabilities.

Integrating Large Language Models into Text Animation: An Intelligent Editing System with Inline and Chat Interaction

Text Animation Editing System: introduces an LLM-aided system for text animation editing, featuring a Script Panel (Edit text, properties), Timeline Panel (Arrange, time clips), Chat Panel (Natural language commands), Resource Panel (Manage assets), Inspector Panel (Adjust properties), Preview Panel (Visualize edits), Inline Agent (Contextual suggestions), Chat Agent (Conversational task execution), LLM (Large Language Model) (AI engine), Semantic-Animation Mapping (Intent to action), and Script-Timeline Synchronization (Panels linked).
This system employs a dual-mode agent pipeline (Inline and Chat Agents) powered by an LLM for intelligent assistance and natural language interaction.
The system aims to lower creative barriers for non-professionals and enhance editing efficiency through seamless inline edits and chat-based interactions.

Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding

VLFly (Vision-Language Fly): introduces a novel VLN framework for UAVs, including an instruction encoding module, a goal retrieval module, a waypoint planning module, and action execution, designed for open-vocabulary goal understanding and continuous control.
The framework processes natural language instructions, retrieves a goal image, generates waypoints from egocentric observations, and executes continuous velocity commands.
VLFly achieves robust generalization and outperforms baselines in simulation and real-world UAV navigation tasks without task-specific fine-tuning.

SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis

SDialog: introduces a Python toolkit for synthetic dialogue generation and analysis, with Turn (Single utterance), Event (Action/instruction), Dialog (Complete conversation structure), Persona (Character profile definition), PersonaAgent (Simulates agent role-playing Persona), BaseOrchestrator (Abstract control class), SimpleReflexOrchestrator (Triggers on condition), LengthOrchestrator (Controls dialogue length), ChangeMindOrchestrator (Simulates agent changing mind), SimpleResponseOrchestrator (Suggests responses by similarity), InstructionListOrchestrator (Provides sequence of instructions), DialogGenerator (Generates dialogue using LLM), PersonaDialogGenerator (Generates dialogue between Personas), Dataset Utilities (Work with external datasets), Serialization Utilities (Save/load dialogues), and Visualization Utilities (Analyze/visualize dialogues) components, designed for creating realistic, diverse, and controllable conversational data.
The toolkit provides abstractions for personas, orchestration, and scenario management, leveraging instruction-tuned Large Language Models for generation.
SDialog supports workflows like multi-agent simulation and scenario-driven generation, aiming to standardize synthetic data generation for reproducibility.

Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models

Multi-User Dialogue Data Construction Method: introduces a method to extend single-user dialogue datasets by incorporating a second user's utterances, utilizing a Single-User Dialogue Structure, Speech Act Type Identification, User2 Utterance Generation, User2 Utterance Validation, and an LLM to create a Multi-User Dialogue Structure for evaluating DST.
The method systematically generates and validates user2 utterances based on speech act theory to create a controlled multi-user setting for assessing LLM performance.
This approach enables evaluating LLMs on multi-user dialogue state tracking challenges with minimal dataset construction costs.

BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis

BugGen: introduces a self-correcting multi-agent LLM pipeline, with Module Splitter (partitions RTL), Mutation Index (lists mutation types), Mutation Cache (stores history), Region Selector Agent (chooses region), Mutation Selector Agent (chooses mutation), Mutation Injector Agent (inserts mutation), Evaluation (validates bug), and Rollback/Retry (corrects failures), designed to autonomously generate, insert, and validate realistic functional bugs in RTL.
The pipeline leverages LLM agents in a closed-loop architecture with shared memory and iterative refinement to produce unique, syntactically valid, and functionally detectable bugs.
BugGen achieves high functional accuracy and throughput, outperforming existing methods and generating high-quality bug datasets suitable for training ML-based debugging models.

Minimizing False Positives in Static Bug Detection via LLM-Enhanced Path Feasibility Analysis

LLM4PFA (LLM-Enhanced Path Feasibility Analysis): introduces an iterative path feasibility analysis framework for static bug detection, with Iterative Function Analysis, Feasible Path Constraint Extraction, Critical Path Conditional Branches Identification, Feasible Path Conditional Expression Extraction, Context-Aware Symbolic Range Reasoning, LLM Agent, Variable Symbolic Range Reasoning, Function Call Symbolic Range Reasoning, Function Retrieval Tool, Source Code Repository, Function Call Memory, Constraints Solving, SMT Query Script Generation, Script Template Generation, SMT Constraints Generation, Script Merging, Constraint Solver, Control-Flow Graph (CFG), and Initial States P.
The framework iteratively analyzes functions in a call trace, extracting and solving feasible path constraints using LLM agents for symbolic reasoning and a constraint solver.
LLM4PFA leverages LLM agents' self-planning and tool-usage capabilities for context-aware symbolic range reasoning and iteratively generates and solves SMT queries to minimize false positives.

WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models

WGSR-Bench: introduces, "a wargame-based benchmark for large language models", with Environmental situational awareness, Opponent risk assessment, and Policy generation components, where "it systematically assesses strategic reasoning abilities using wargame scenarios".
The benchmark evaluates LLMs' capabilities in multi-agent decision-making, intent inference, and counterfactual reasoning within a high-complexity wargame environment.
It employs a structured cognitive framework (S-POE) and utilizes real adversarial wargame data for comprehensive evaluation and analysis.

11th June 2025

AUTONOMOUS COMPUTER VISION DEVELOPMENT WITH AGENTIC AI

Agentic AI approach: introduces an autonomous computer vision development system, with OpenManus Agent (Orchestrates task execution), Memory (Stores runtime state/context), Planning (Decomposes tasks, selects tools), Reasoning (Analyzes inputs, makes decisions), Self-Correction/Adaptation (Handles errors, refines plans), Tools (Execute Python, browser, files, shell), SimpleMind Framework (Executes computer vision tasks), Configurable Tools (Perform image processing, neural nets), Knowledge Graph (Defines SimpleMind workflow), Blackboard (Central working memory), SM-Learn (Trains neural network weights), SM-Think (Performs inference), User Prompt (Natural language task input), System Prompt (Guides LLM planning), Verifier (Checks YAML configuration), Tool Configuration File (KG) (YAML workflow definition), Tool Execution (Runs SimpleMind modules), where the system translates natural language prompts into SimpleMind workflows for medical image analysis.
The OpenManus agent leverages an LLM for planning and tool use, generating a YAML Knowledge Graph that configures SimpleMind's computer vision tools.
SimpleMind executes the planned workflow, utilizing its Blackboard for data flow and SM-Learn/SM-Think for training and inference on medical images.

AURA: A Multi-Agent Intelligence Framework for Knowledge-Enhanced Cyber Threat Attribution

AURA (Attribution Using Retrieval-Augmented Agents): introduces a multi-agent framework for cyber threat attribution, comprising input processing, query rewriting, semantic retrieval, decision making, external search, attribution generation, conversational memory, and a knowledge base.
The framework processes diverse threat data via collaborative agents, integrating Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs) for knowledge-enhanced reasoning and interpretable attribution.
AURA generates transparent, evidence-backed attribution decisions by tracing reasoning to contextual evidence and providing natural language justifications.

Disclosure Audits for LLM Agents

CMPL (Conversational Manipulation for Privacy Leakage): introduces an automated auditing framework for conversational privacy risks in LLM agents, featuring an Application Agent A (LLM agent being audited), an Adversary U (LLM agent attempting leakage), and an Auditor D (LLM agent detecting leakage) interacting within a Conversation Loop (iterative interaction process) based on a Scenario Description σ (public context), Information Subject Profile I (private data), Privacy Directive ψ (disclosure rules), and Task Description T (agent A's goal).
The Adversary U employs a Strategist Se (adversary planning module) and Prompt Generator Ge (adversary query module) to manipulate the Conversation History H (dialogue turns) and may use a Side-channel Predictor Pe (adversary inference module) to make a Prediction (adversary guess) with Confidence kt (adversary prediction score).
The Auditor D monitors the Conversation History H and uses an Entail Function (auditor explicit leakage detector) and the adversary's Prediction and Confidence to produce an Indicator zt (auditor leakage signal) when explicit or implicit leakage is detected, while both agents utilize Memory (stores/summarizes history) to maintain state.

Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information

Chat-of-Thought: introduces a collaborative multi-agent system for domain-specific information generation, featuring LLM-based Agents, Context Discovery, Multi-Round Chain of Interactions, Template-driven Routing, and Quality Check.
The system employs specialized LLM-based Agents with defined roles and state to engage in iterative discussions guided by templates and dynamic assignment.
It leverages diverse input sources, question/answer banks, and various learning methods to generate and refine domain-specific knowledge like FMEA documents.

A quantum semantic framework for natural language processing

Quantum Semantic Framework: introduces a non-classical approach to natural language processing, modeling semantic meaning as observer-dependent and contextually actualized through the interaction of a Semantic Expression (symbol affording interpretations) and an Interpretive Agent (observer) via an Interpretive Observable (semantic probe operator).
The framework posits that meaning is not intrinsic but emerges dynamically, influenced by the agent's Semantic Memory (agent internal state) and Context (situational factors), with interpretation dynamics governed by a Semantic Hamiltonian (interpretation dynamics).
A Semantic Bell Test (experimental method) using LLM Agents (computational observers) configured with Personas (agent configurations) and presented with Ambiguous Word Pairs (stimuli) demonstrates non-classical contextuality in interpretation, supporting the framework's premise.

AI Agent Behavioral Science

AI Agent Behavioral Science: introduces, "AI Agents (autonomous systems)/Memory (stores history)/Planning (strategizing actions)/Tool Use (interacts with tools)/Action Modules (executes decisions)/Intrinsic Attributes (internal traits)/Environmental Constraints (external structures)/Behavioral Feedback (adaptation mechanism)/Ability (foundational competence)/Motivation (drive from feedback)/Trigger (initiating signals)", a paradigm for studying AI agents as behavioral entities in context, emphasizing systematic observation, intervention design, and theory-guided interpretation.
This perspective focuses on understanding how AI agent behavior emerges and adapts through the interplay of internal factors, environmental context, and interaction feedback.
The paper systematizes research on individual, multi-agent, and human-agent interactions and positions this behavioral science approach as essential for responsible AI.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

V-JEPA 2 (Self-Supervised Video Model): introduces a self-supervised approach combining internet video data and robot interaction data, with V-JEPA 2 Encoder (Extracts video representations), V-JEPA 2 Predictor (Predicts masked representations), V-JEPA 2 EMA Encoder (Target for prediction), V-JEPA 2-AC Frozen Encoder (Provides learned representations), V-JEPA 2-AC Action-Conditioned Predictor (Predicts future state representations), MLLM Projector Module (Maps visual to LLM), and MLLM LLM Backbone (Language model).
The framework pre-trains V-JEPA 2 on internet video, then post-trains V-JEPA 2-AC on robot data for planning, and aligns V-JEPA 2 with an LLM for video question answering.
V-JEPA 2 demonstrates strong performance on motion understanding, action anticipation, video question answering, and enables zero-shot robot manipulation via planning.

Patterns of Patterns III

PLACARD: introduces a methodology combining PAR, CLA, and DPL for collective learning and design, extended to pattern-competent AI agents within Multi-Agent Systems, utilizing Language Model Substrates and various agent types.
The approach structures pattern use via an A/B/C catalogue and proposes AI agent pattern types (Interactional, Cognitive, Infrastructural) along with specific candidate patterns for agents and their environments.
Different agent roles, including Code-Writing & Execution Agents, Pattern-Aware Dialogue Agents, Pattern-Reflective Meta-Agents, and Multi-Agent Institution Designers, interact within the MAS environment, grounded by a Real-World Interface layer.

Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling

VRS (Verbalized Rejection Sampling): introduces, "a natural-language adaptation of classical rejection sampling", with LLM (Performs accept/reject decision), Target Distribution (Desired sample distribution), Proposal Distribution (Source of candidate samples), Candidate Sample (Sample from proposal), Verbalized Prompt (LLM input with descriptions), Accept/Reject Decision (LLM binary output), and Sampling Loop (Generates candidates, repeats), which prompts an LLM to reason about and accept or reject proposed samples from a proposal distribution to generate samples from a target distribution.
The framework verbalizes the target distribution, proposal distribution, and a candidate sample into a prompt for the LLM, which acts as a black-box decision engine.
The external sampling loop generates candidate samples and repeats the LLM decision process until the required number of accepted samples is collected.

SRLAgent: Enhancing Self-Regulated Learning Skills through Gamification and LLM Assistance

SRLAgent: introduces an LLM-assisted system that fosters self-regulated learning skills through gamification and adaptive support, with Game Environment (Minecraft 3D world), Learning Management System (Manages learning elements), Task System (Manages hierarchical tasks), Agent System (AI-powered support), LLM (Large Language Model), Planning Agent (Supports forethought phase), SubTask Monitor (Tracks performance), SubTask Tutor Agent System (Provides tutoring), Quiz Agent (Supports quizzes), Review Agent (Guides analysis), Chatting Agent (Facilitates discussions), Writing Agent (Supports writing), Reflection Agent (Guides reflection phase), Task State (Current task status), SRL Phase (Current SRL stage), Task Learning Views (User interface views), Learning Subtasks Content (Educational materials/activities), Learning Feedback (Agent-provided feedback), Tutor Feedback (Tutor agent feedback), Prompt Configurations (Agent prompt templates), where it guides students through SRL phases using gamification and LLM-powered agents in a game-based environment.
The system is grounded in Zimmerman's three-phase SRL framework, enabling goal-setting, strategy execution, and self-reflection within an interactive game environment.
SRLAgent offers real-time feedback and scaffolding powered by LLMs to support students' independent study efforts.

PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants

PersonaLens: introduces a benchmark for evaluating personalization in task-oriented conversational AI assistants, with diverse user profiles, tasks with situational contexts, and two LLM-based agents (User Agent and Judge Agent) to simulate interactions and evaluate performance.
The benchmark features 1,500 user profiles, 111 tasks across 20 domains, and LLM-powered agents for scalable, automated evaluation.
PersonaLens assesses personalization, task completion, and dialogue quality in multi-turn interactions, revealing insights into current LLM assistants' capabilities.

INTELLIGENT DESIGN 4.0: PARADIGM EVOLUTION TOWARD THE AGENTIC AI ERA

ID 4.0 (Intelligent Design 4.0): introduces a multi-agent-based paradigm for engineering design automation, composed of stage-level agents (interprets inputs/decomposes tasks, explores early solutions/ideation, translates concepts/coordinates modeling, refines geometry/produces models, applies optimization/fine-tunes) and functional agents (retrieves external knowledge, accesses databases, infers user intent/context, facilitates divergent thinking, synthesizes 2D/3D geometry, conducts multi-physics simulations, verifies compliance) operating within a shared information environment (supports coordination/learning).
This framework envisions autonomous, task-specialized AI agents coordinating via orchestrated design workflows to support complex, end-to-end design processes.
Agents interact with human designers and external tools, leveraging shared memory for cross-agent coordination and cumulative learning throughout the design process.

Feature Engineering for Agents: An Adaptive Cognitive Architecture for Interpretable ML Monitoring

CAMA (Cognitive Architecture for Monitoring Agent): introduces a cognitive architecture for ML monitoring, with Semantic Memory (Stores reference data), Working Memory (Holds current context), Episodic Memory (Retains past instances), Procedural Memory (Stores agent code), Decision Procedure (Feature engineering approach), LLM (Reasoning engine), and Agent code (Includes prompts/chains), designed to enhance interpretability and actionability of monitoring outputs.
The Decision Procedure implements a three-step feature engineering-inspired approach: Refactor, Break Down, and Compile, to process monitoring data.
This architecture leverages structured memory and feature engineering principles to provide robust, interpretable, and actionable insights for ML model monitoring.

Intent Factored Generation: Unleashing the Diversity in Your Language Model

IFG (Intent Factored Generation): introduces a two-stage sampling process including Prompt, LLM Internal, Intent, Phrasing, and Response.
The process samples a semantically dense Intent first, then the final Response conditioned on the Prompt and Intent, allowing independent temperature control for diversity and coherence.
IFG can be implemented via Few-shot Prompting or Finetuning on intent-annotated data, with IFG-prompting encouraging granular steps in reasoning tasks.

Application-Driven Value Alignment in Agentic AI Systems: Survey and Perspectives

LLM-based Agent System: introduces, this survey, with LLM-based Agent (core system participant), Multi-Agent System (multiple interacting agents), Value Principles (hierarchical ethical norms), Value Alignment Evaluation (assessing adherence to values), and Value Coordination (managing values in multi-agents), where this survey reviews application-driven value alignment in agentic AI systems.
The paper integrates AI advancements driven by large models with demands of social governance, covering value principles, application scenarios, and evaluation methods.
It systematically examines datasets and methods for value alignment assessment and explores value coordination among multiple agents within agent systems.

DipLLM: Fine-Tuning LLM for Strategic Decision-making in Diplomacy

DipLLM (Fine-Tuned LLM-Based Agent): introduces DipLLM, a fine-tuned LLM-based agent for strategic decision-making in Diplomacy, with Llama 3 8B (LLM backbone), Autoregressive Factorization (decomposes actions), TextDiplomacy (state to text, text to actions), Equilibrium Search (piKL-Hedge) (generates Q-values), Human IL Model (DipNet) (collects raw data), Environment (Diplomacy) (game simulation), Loss Function (aligns policy), LoRA (parameter adaptation), and Data (collected game data).
DipLLM leverages autoregressive factorization to simplify complex multi-unit action assignment into sequential unit-level decisions.
The agent is fine-tuned on a small dataset using a designed loss function to align its policy with an equilibrium objective, outperforming state-of-the-art models like Cicero with significantly less data.

Effective Red-Teaming of Policy-Adherent Agents

CRAFT (Constraint-aware Red-teaming with Adversarial Framing and Tactics): introduces a multi-agent red-teaming system with Policy Analyzer (Extracts relevant policy), Deception Planner (Plans attack strategies), Avoidance Advisor (Plans what to avoid), Dialogue Executor (Executes interaction dialogue), Conversation Memory (Dialogue history), Policy-Adherent Agent (Target system), Policy (Rules for target), and User Request (Initial user input), designed to expose vulnerabilities in policy-constrained LLM-based agents through strategic, multi-step adversarial planning.
The system leverages policy knowledge, strategic reasoning, and pre-execution planning to generate policy-aware persuasive strategies that undermine policy adherence in customer service scenarios.
CRAFT achieves significantly higher attack success rates compared to generic jailbreak methods and cooperative user simulations, highlighting the need for stronger safeguards against malicious user behavior.

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Multi-Agent Framework: introduces ReasonMed, a large-scale medical reasoning dataset generated and refined using a multi-agent system for initial path generation, followed by verification, ranking, summarization, error correction, and quality assessment.
The framework generates 370k high-quality medical reasoning examples by distilling 1.7 million initial paths from multiple large language models through a rigorous multi-stage verification and refinement pipeline.
Leveraging the generated dataset, the authors train ReasonMed-7B and find that combining detailed chain-of-thought reasoning with concise answer summaries yields the most effective fine-tuning strategy for medical question answering.

A Call for Collaborative Intelligence: Why Human-Agent Systems Should Precede AI Autonomy

LLM-HAS (LLM-based Human-Agent Systems): introduces implementation guidelines with Initial Setup (define environment, roles, interaction), Human Data (acquire, process, use human data), Model Engineering (iterative development, feedback, learning, optimization), Post-Deployment (monitor, maintain alignment, adapt), and Evaluation (assess effectiveness, safety, experience) components.
The framework advocates for collaborative AI-human partnerships, prioritizing human involvement for guidance, control, and enhanced system trustworthiness and adaptability.
Progress in AI is measured by how well systems work with humans, enhancing human capabilities through partnership rather than pursuing full autonomy.

Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation

Multi-Agent Language Models: introduces integrating Language Model (LM) (Recommends/generates actions) into Reinforcement Learning (RL) Agent (Learns decision policy) loops interacting with an Environment (Simulates game world), utilizing Dataset (Human gameplay examples) and Replay Buffers (Stores game transitions) for training via Distillation Loss (Transfers LM knowledge) and TD Loss (Reinforcement learning update), employing Encoders (GRU) (Encode text inputs), Decoder (Combines encoded features), and MLP (Predicts action values).
The approach explores LM-in-the-Loop for action recommendation in text games and LM as Multi-Agent in Hanabi, showing improved performance and accelerated convergence compared to baselines.
Key findings highlight the importance of careful transition selection for LM training and demonstrate the potential for distilling language model knowledge into reinforcement learning agents.

10th June 2025

Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation

ANN (Agentic Neural Network): introduces a framework conceptualizing multi-agent collaboration as a layered neural network architecture, including Agent (Node), Layer (Agent Team), Agent Pipeline, Dynamic Routing/Team Selection, Aggregation Function, Forward Pass, Backward Pass/Optimization, Textual Gradient, Global Optimization, Local/Layerwise Optimization, Momentum, Validation, Memory/Trajectory, LLM Backbone, Prompt, and Tool components.
The framework employs a two-phase optimization strategy: a forward pass for dynamic team selection and a backward pass for iterative refinement using textual gradients.
This approach enables agents to self-evolve their roles, prompts, and coordination, dynamically reconfiguring teams and strategies based on performance feedback.

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

UTBoost: introduces a framework for augmenting test cases using intramorphic testing, including the UTBoost Workflow (Orchestrates testing process), UTGenerator (Generates augmented test cases), and Intramorphic Testing (Establishes test oracle).
The UTGenerator component utilizes an LLM and a multi-level localization process (file, function/class, line) to identify relevant code areas before generating new test cases and their dependencies.
UTBoost enhances the evaluation of coding agents on benchmarks like SWE-Bench by identifying insufficient test cases and erroneous patches, leading to more reliable results and leaderboard updates.

GUIROBOTRON-SPEECH: TOWARDS AUTOMATED GUI AGENTS BASED ON SPEECH INSTRUCTIONS

GUIRoboTron-Speech: introduces an end-to-end autonomous GUI agent accepting speech instructions and screenshots, with Vision Encoder (Processes GUI screenshot), Audio Encoder (Processes speech instruction), Large Language Model (Processes inputs, predicts action), Grounding Stage (Trains visual understanding), and Planning Stage (Trains reasoning and planning) components, designed to predict GUI actions from multimodal input.
The approach leverages a progressive training framework with grounding and planning stages to develop capabilities in understanding GUI elements and task execution.
Mixed-instruction training is employed during the grounding stage to mitigate modality imbalance from pre-trained foundation models.

Agent-based Condition Monitoring Assistance with Multimodal Industrial Database Retrieval Augmented Generation

MindRAG (Multimodal industrial database Retrieval-Augmented Generation): introduces an agent-based condition monitoring assistance framework with LLMs, Agents, a Multimodal & semi-structured annotated machine graph Vector Store, Multimodal RAG techniques for Retrieval, Generation, Tools, and Knowledge Bases.
The framework integrates LLM-based reasoning agents (Main thinker, CM analyst, Maintenance scheduler, Evaluation agent) with a novel vector store structure designed for industrial condition monitoring data.
MindRAG leverages multimodal retrieval and generative capabilities, supported by custom tools and knowledge bases, to provide decision support and explainable interfaces for condition monitoring analysts.

Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search

LWM-Planner (LLM-based World Model Planning Agent): introduces an LLM agent framework that enhances planning via in-context learning using a Fact Extractor (Extracts atomic facts from experience), a Planner LLM (Performs lookahead search planning), Atomic Facts (Learned knowledge base), and Interaction History (Recent observation-action memory).
The agent extracts task-critical atomic facts from interaction trajectories to dynamically augment prompts for LLM components responsible for action proposal, world model simulation, and value estimation.
Planning involves a depth-limited lookahead search where the Planner LLM simulates trajectories and evaluates outcomes guided by accumulated facts and history.

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

ALE-Bench: introduces a benchmark for long-horizon objective-driven algorithm engineering, featuring Problem (Provides statement/metadata), Scorer (Evaluates solution code), Visualizer (Displays execution results), Test Run (Executes code in sandbox), Code Sandbox (Replicates execution environment), Leaderboard (Ranks submissions/calculates metrics), and Session (Orchestrates AI interaction/evaluation).
The benchmark provides a software framework simulating competitive programming contests, allowing AI systems to iteratively refine solutions using test-run feedback and visualizations.
ALE-Bench quantifies AI performance on computationally hard optimization problems from AtCoder Heuristic Contests, enabling comparison against human experts and fostering long-horizon problem-solving research.

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

VIKI-R (Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning): introduces a two-stage framework that fine-tunes a pretrained vision-language model using Chain-of-Thought demonstrations and reinforcement learning, evaluated on the VIKI-Bench benchmark.
The framework addresses embodied multi-agent cooperation across three hierarchical levels: agent activation, task planning, and trajectory perception, utilizing diverse robot embodiments and multi-view visual observations.
The approach significantly outperforms baselines, demonstrating enhanced visual reasoning and compositional cooperation patterns among heterogeneous agents in complex environments.

Design Patterns for Securing LLM Agents against Prompt Injections

Design Patterns: introduces, with Action-Selector Pattern (Selects predefined actions), Plan-Then-Execute Pattern (Defines plan, executes actions), LLM Map-Reduce Pattern (Dispatches isolated sub-agents), Dual LLM Pattern (Privileged/quarantined LLMs), Code-Then-Execute Pattern (Writes formal program), and Context-Minimization pattern (Removes prompt from context), a set of principled design patterns for building AI agents resistant to prompt injection attacks.
These patterns impose intentional constraints on LLM agents, limiting their ability to perform arbitrary tasks and preventing untrusted input from triggering consequential actions.
The paper analyzes the trade-offs of these patterns in terms of utility and security and illustrates their application through case studies of LLM agent applications.

Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Evaluation Frameworks for LLM-based Data Science AI Systems: includes LLM/Agent (AI system being evaluated), Environment (Where agent operates), Tools (Capabilities like code execution), Evaluation System (Measures performance), Data (Input for tasks), Task Description (Instructions for agent), and User (Interacts with agent).
These frameworks assess AI systems, ranging from assistants to autonomous agents, on various data science activities using diverse metrics and setups.
Evaluation often involves the agent interacting with data and tools within an environment, with performance judged by an automated or human-assisted evaluation system against task descriptions and data.

Improved LLM Agents for Financial Document Question Answering

Multi-Agent Framework: introduces a system for financial document question answering with Analyst, Critic, Improved Critic, and Calculator agents.
This framework utilizes multiple LLM-based agents to improve numerical reasoning on tabular and textual financial data.
The proposed calculator agent demonstrates improved performance and safety compared to the previous state-of-the-art approach for this task.

Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMS

End-to-End DST Model: introduces an end-to-end dialogue state tracking system using a pretrained speech encoder, a small connector module, a pretrained LLM, and dialogue history.
The system processes speech input and dialogue history to directly output a JSON string representing the dialogue state.
The approach bridges speech and language model representation spaces through a two-stage training scheme for ASR pre-training and joint ASR-DST finetuning.

MasHost Builds It All: Autonomous Multi-Agent System Directed by Reinforcement Learning

MasHost: introduces a reinforcement learning-based framework for autonomous multi-agent system construction, with Multi-agent System (Mas), LLM Agent, Role Pool, Interaction Pathway, Markov Decision Process (MDP), State, Action Space, Node-level Actions, Edge-level Actions, Policy Function, Node-level Policy (πθ), Edge-level Policy (πφ), Reward Function, Joint Probabilistic Space Sampling (JPSS), Hierarchical Relative Policy Optimization (HRPO), Group-relative Advantage, Action-wise Absolute Reward, Triple Objective, Query, State List, Selected Agents, Global Messages, Summarizer Agent, and EXIT Node components.
The framework models Mas construction as an MDP, employing JPSS for joint node and edge sampling and HRPO for multi-objective optimization towards performance, efficiency, and rationality.
MasHost enables autonomous Mas graph construction and role selection from a full-scale space, guided by a hierarchical reward structure combining group-relative and action-wise rewards.

CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

CAF-I (Collaborative Agent Framework for Irony): introduces a multi-agent framework for irony detection with Context, Semantic, Rhetoric, Decision, and Refinement Evaluator Agents.
CAF-I performs multi-perspective analysis and interactive collaborative optimization to improve detection accuracy and interpretability.
The framework achieves state-of-the-art zero-shot performance by simulating human-like multi-perspective analysis.

TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration

TACTIC (Translation Agents with Cognitive-Theoretic Interactive Collaboration): introduces a multi-agent translation framework inspired by cognitive translation studies, including DraftAgent (Generates multiple drafts), RefinementAgent (Synthesizes drafts), EvaluationAgent (Evaluates translation quality), ScoreAgent (Scores translation quality), ContextAgent (Provides contextual information), and ResearchAgent (Gathers external knowledge).
The framework comprises six distinct agents mirroring human translation processes, operating in base and complex workflows for iterative refinement.
TACTIC leverages LLMs to simulate cognitive functions like strategic variation, processing, and contextual cognition for high-quality translation.

Reinforce LLM Reasoning through Multi-Agent Reflection

DPSDP (Direct Policy Search by Dynamic Programming): introduces a reinforcement learning algorithm to train an actor-critic LLM system for multi-turn reasoning refinement using direct preference learning on self-generated data, incorporating Actor, Critic, and DPO.
The approach models the multi-turn refinement process as a Markov Decision Process, where the Actor generates responses and the Critic provides feedback, iteratively improving answers.
DPSDP leverages DPO for training the agents, demonstrating improved performance on reasoning benchmarks through collaborative refinement.

Your Agent Can Defend Itself against Backdoor Attacks

ReAgent (Reverse and Reflective Agent): introduces a novel defense against backdoor attacks on LLM-based agents, utilizing Execution-Level Detection, Planning-Level Detection, Agent's Thoughts, Agent's Actions, Agent's Thought Trajectory, User's Instruction, and Reconstructed Instruction components to detect inconsistencies.
The defense employs a two-level approach, verifying consistency between agent thoughts and actions at the execution level and between the user instruction and reconstructed instruction from the thought trajectory at the planning level.
ReAgent leverages the compromised agent's own capabilities for self-defense and provides chain-of-thought explanations for transparency.

Reinforcement Fine-Tuning for Reasoning towards Multi-Step Multi-Source Search in Large Language Models

R-Search: introduces a single-LLM framework that unifies multi-step planning, multi-source search execution, and answer synthesis within one coherent inference process, utilizing Policy LLM, , , , , NL-DAG Parser, DAG Validator, Topological Sort, Search Execution, Search Tools, ReFT, GRPO Optimizer, Reward Function, and Reference LLM.
The framework structures output into four components: reasoning traces, NL-DAG search plans, retrieved results, and synthesized answers, enabling integrated reasoning and multi-source search execution.
A specialized Reinforcement Fine-Tuning method based on GRPO is used with a multi-component reward function to optimize answer correctness, structural validity, and format adherence.

TrajFlow: Multi-modal Motion Prediction via Flow Matching

TrajFlow: introduces a flow matching-based framework for multi-modal motion prediction, utilizing a Context Encoder (encodes scene context), Flow Matching Decoder (decodes noisy trajectory to predicted trajectories and scores), Prediction Heads (predicts trajectory, classification, and ranking scores), ODE Solver (solves ODEs for inference), NMS (filters predicted trajectories), Loss Functions (optimizes model parameters), and Self-Conditioning (mitigates overfitting during training).
The framework predicts multiple plausible future trajectories in a single pass by learning to map noise vectors to data distributions via ordinary differential equations.
A Plackett-Luce distribution-based ranking loss and a self-conditioning training strategy are employed to improve uncertainty estimation and generalization.

ORFS-agent: Tool-Using Agents for Chip Design Optimization

ORFS-agent: introduces an LLM-based iterative optimization agent for chip design parameter tuning, integrating an LLM, ORFS flow, METRICS2.1 metrics, GLOBALCONTEXT state, Toolbox external tools (INSPECT, OPTIMIZE, AGGLOM), Inputs, Outputs, and an Iteration Loop.
The agent executes the ORFS flow in parallel runs, gathers METRICS2.1 data, analyzes and proposes parameters using the Toolbox, and updates design files iteratively.
Guided by user Inputs (PDK, Verilog, Prompts), the agent maintains state in GLOBALCONTEXT to optimize design metrics and constraints, producing optimized Outputs (Config, SDC files).

Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study

Software Engineering Agents (SWE agents): introduces a systematic study of SWE agent behavior using execution traces, focusing on bug localization, patch generation, and reproduction test generation components.
The study analyzes agent effectiveness in fixing issues, generating tests, and comparing agent-generated patches to human-written ones.
Findings reveal agents struggle with complex issues, benefit from bug localization for test generation, and often produce localized edits compared to human refactorings.

9th June 2025

Scaling Laws of Motion Forecasting and Planning A Technical Report

MotionLM: introduces an encoder-decoder autoregressive transformer model with Scene Encoder (Processes scene data) and Motion Decoder (Generates motion tokens) components for joint motion forecasting and planning.
The Scene Encoder uses an Early fusion network (Scene encoder backbone) to process multimodal inputs, while the Motion Decoder employs Cross-attention (Decoder attends encoder) and Flattened agent-time self-attention (Single pass attention) to generate Discrete motion tokens (Represent trajectories).
This architecture enables studying scaling laws for performance improvements with increased compute, data, and model size in autonomous driving tasks.

From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

AR-Bench (Active Reasoning Benchmark): introduces, "a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills", with Player (LLM under evaluation), Judge (Provides answers/feedback), Problem (Initial incomplete information), Interaction Rounds (Multi-turn Q&A), Solution (Final derived answer), where "AR-Bench evaluates LLMs on tasks requiring iterative questioning and information gathering under incomplete information."
The benchmark simulates multi-round conversations between the LLM player and NPC judges providing answers or feedback based on the puzzle's underlying truth.
AR-Bench highlights LLMs' difficulties in active reasoning, particularly in generating high-quality questions and effectively leveraging acquired information to solve problems.

From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium

ECON (Efficient Coordination via Nash Equilibrium): introduces a hierarchical reinforcement-learning paradigm with Coordinator LLM (Generates strategy, aggregates answers), Execution LLM (Produces answers based on strategy/belief), Individual Belief Network (Maps history/observation to belief/action), Belief Encoder (Aggregates belief states), Centralized Mixing Network (Coordinates beliefs, computes global Q), and Reward Design (Provides optimization feedback), recasting multi-LLM coordination as an incomplete-information game.
The framework replaces explicit inter-agent communication with belief-based coordination, where Execution LLMs optimize responses based on beliefs about co-agents to achieve a Bayesian Nash Equilibrium.
ECON demonstrates improved performance and scalability compared to existing multi-agent debate methods by reducing communication overhead and ensuring convergence.

EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

EconWebArena: introduces a benchmark for evaluating autonomous agents on economic tasks, featuring an AI Agent interacting with a Real-World Web Environment via Observation and Action to answer a Question and provide an Answer.
The benchmark comprises 360 tasks on 82 authoritative websites, requiring agents to navigate, interpret content, interact with interfaces, and extract precise data.
The framework utilizes structured observations like AXTree and screenshots, and supports fine-grained browser control actions for realistic web interaction.

SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents

SOP-Bench: introduces a benchmark generation workflow with User Inputs, LLM, Human Review and Correction, Data Schema Generation, SOP Document Generation, Dataset Generation, API & Tool Specification Generation, and Tools Code Generation, designed to evaluate LLM agents on complex industrial SOPs using SOP, Task, ToolSpecs, Mock APIs, and Dataset.
The benchmark generation workflow creates realistic SOPs, associated data, and tools, incorporating complexity, ambiguity, and interdependencies.
The benchmark evaluates agent architectures like FC Agent and ReAct Agent on their ability to execute multi-step, context-dependent procedures requiring tool use and error handling.

Cognitive Weave: Synthesizing Abstracted Knowledge with a Spatio-Temporal Resonance Graph

Cognitive Weave: introduces a novel memory framework for AI agents centered around a Spatio-Temporal Resonance Graph (STRG), orchestrated by the Nexus Weaver (NW), which processes information via the Semantic Oracle Interface (SOI) and Vectorial Resonator (VR).
The STRG is a multi-layered hybrid structure comprising a Core Particle Store for persistent storage of Insight Particles (IPs) and Insight Aggregates (IAs), a Vectorial Subsystem for embeddings, a Temporal Index for time-based queries, and a Relational Graph for modeling relationships.
The system features a dynamic Cognitive Refinement process, managed by the NW and leveraging the SOI, to autonomously synthesize IAs, manage relational structures, and recalibrate importance, enabling continuous learning and memory evolution.

Supporting Construction Worker Well-Being with a Multi-Agent Conversational AI System

Multi-Agent Conversational AI System: introduces a conversational multi-agent AI system for construction worker well-being, with User Interface, User, User Message, Multi-Agent Orchestration, Agents, Agent Configuration, Large Language Model (LLM), Retrieval-Augmented Generation (RAG), Vector Database, External Knowledge, Chunking, Vectorization, Prompt Engineering/Automation, and Personas components.
The system leverages LLMs and RAG, featuring multiple agents with distinct personas and domain knowledge integrated from external documentation.
The multi-agent framework provides practical problem-solving support and social engagement through a collaborative agent workflow managed by orchestration.

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

HeuriGym: introduces an agentic framework for evaluating LLM-crafted heuristics in combinatorial optimization, with Prompt (Input to LLM), LLM (Large Language Model), Generate (Heuristic algorithm code), Compiler / Interpreter (Code processing), Stage I: Execution (Runs generated code), Logs / Errors (Execution feedback), Solution File (Program output), Stage II: Solution Generation (Output produced), Stage III: Verification (Solution checked), Verifier (Checks constraints), Constraints Satisfaction (Verification result), Evaluator (Calculates cost), Cost (Evaluation result), Feedback (Appended to prompt), and Final Results (Overall outcome).
The framework enables LLMs to generate, execute, verify, and iteratively refine heuristic algorithms for complex optimization problems.
Evaluation uses a feedback loop and the Quality-Yield Index metric to assess reasoning, tool use, planning, and adaptive refinement.

LUCIFER: Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement

LUCIFER (Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement): introduces a hierarchical decision-making framework integrating Large Language Models for context understanding and exploration guidance to enhance autonomous decision-making in dynamic environments.
The architecture features a Strategic Decision Engine for high-level task planning and specialized Workers for low-level action execution, leveraging an Information Space for structured knowledge.
LLMs function as Context Extractors, converting verbal input to structured insights, and Exploration Facilitators, predicting actions, with an Attention Space mechanism embedding these contextual cues into the reinforcement learning policy, reward, and action space.

QUITE: A Query Rewrite System Beyond Rules with LLM Agents

QUITE (Query Rewrite): introduces a training-free, feedback-aware system leveraging LLM agents, rewrite middleware, and a query hint recommender to rewrite SQL queries for improved performance.
The system employs a multi-agent framework controlled by a finite state machine, specialized middleware tools, and a novel hint injection technique.
This approach supports a broader range of query patterns and rewrite strategies, achieving significant execution time reductions and higher rewrite equivalence rates.

MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

MCPWorld: introduces a unified benchmarking testbed with Task Manager (Initializes tasks/environment), Environment (Containerized desktop), Unified Tool-based Space (Agent interaction interface), App Interface (Connects tools to app), Hooker (Captures app signals), and Evaluator (Verifies task completion) components, designed for evaluating API, GUI, and hybrid computer use agents using a white-box approach.
The testbed utilizes "white-box apps" with source code availability to enable programmatic verification of task completion via dynamic code instrumentation.
MCPWorld supports GUI, API, and hybrid interaction modalities and provides a standardized environment and tool-based interface for agent evaluation.

SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling

SWE-Dev: introduces a software engineering agent framework, with Repo Info Extraction (extracts codebase info), Description Generation (generates Gherkin scenarios), Test Case Generation (generates test case code), Revision from Traceback (refines test cases), and Fail-to-pass Test Cases (final test dataset) components, which builds SWE agents using a scalable test case generation pipeline.
The framework focuses on training and inference scaling to improve performance on software engineering tasks.
Training scaling involves synthesizing test cases and scaling agent trajectories, while inference scaling increases interaction budget per run.

MalGEN: A Generative Agent Framework for Modeling Malicious Software in Cybersecurity

MalGEN: introduces a modular, multi-agent framework for generating malware-like artifacts, including Task Planner, Developer, Code Integration, and Executable Builder agents.
The framework simulates adversarial workflows by decomposing user intent into sub-tasks, generating code snippets, integrating them, and building an executable.
MalGEN aims to support defensive cybersecurity research by producing behaviorally diverse, ethically controlled malware samples aligned with MITRE ATT&CK.

Beyond the Sentence: A Survey on Context-Aware Machine Translation with Large Language Models

Context-Aware Machine Translation with LLMs: surveys research on using large language models for context-aware machine translation, covering prompt-based, fine-tuning, and other application approaches.
Prompt-based methods utilize LLMs with prompts and examples, while fine-tuning adapts LLMs using specific data and processes.
Other applications include automatic post-editing using an initial MT system and an LLM, agentic frameworks with LLMs, memory, decoding, and agents, and LLM-based evaluation.

SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems

SAFEFLOW: introduces a principled protocol for trustworthy and transactional autonomous agent systems, with User (U), Decider (D), Environment (E), Information (I), SafeFlowAgent-Level (SF), Transactional Logging System, SAFEFLOW MONITOR, Dependency Graphs (DAG), Concurrency Control System, SAFEFLOWAGENT SCHEDULER, SAFEFLOWAGENT VERIFIER, and Bayesian Trust Estimation Process components.
The framework enforces fine-grained information flow control using SafeFlowAgent-Levels and ensures reliability through transactional logging, dependency graphs, and concurrency control.
A trusted Verifier component dynamically adjusts trust levels based on logged behavior and a Bayesian trust estimation process, enhancing security and adaptability.

ChemAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning

ChemAgent: enhances LLMs for chemistry and materials science using a HE-MCTS (hierarchical tree search) framework for tree-search based tool learning.
The HE-MCTS framework decouples tool planning (Policy Model) and execution (Execution Model), guided by PRM (step reward) and ORM (outcome reward), and trained via LLM Self-Training (autonomous optimization).
The system integrates a large Chemistry ToolPool (chemical tools) and is benchmarked/trained on the ChemToolBench (tool learning dataset).

INTENTEST: introduces an API-centric stress testing framework for LLM agents, with Semantic Partitioning (divides input space), Seed Task Generation (creates initial tasks), Testcase Mutator (generates task variants), Intent Preservation Sampling (filters intent-preserving mutations), Error Likelihood Estimation (predicts error likelihood), Strategy Memory (stores successful strategies), Strategy Adaptation (retrieves and adapts strategies), LLM Agent (system under test), and Judge (evaluates for violations), designed to systematically uncover intent integrity violations.
The framework partitions the input space based on API parameters and intent categories, generates seed tasks, and iteratively mutates them while preserving user intent.
INTENTEST prioritizes mutations likely to cause errors using a predictive model and improves efficiency by adapting successful strategies from a memory.

Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent

ROS-based agentic framework: introduces a system for natural language control of PX4-based UAVs, integrating PX4 Autopilot (Low-level flight control), ROS2 Middleware (Communication layer), Ollama (Serves LLMs and VLMs), Visual QnA Node (Processes images and queries), Path Planning Node (Generates collision-free trajectory), Map Encoder Node (Embeds pose and semantic info), LLM (Generates action commands), VLM (Assesses visual input), NVIDIA Omniverse (Simulation environment), and Hardware-In-The-Loop (Physical drone setup).
The framework uses Ollama to serve various open-source LLMs and VLMs, managing tasks through modular ROS2 nodes for visual question answering, path planning, and map encoding.
The system enables a drone agent to interpret natural language instructions, perceive its environment, and execute flight actions in both simulation and real-world settings.

MedChat: A Multi-Agent Framework for Multimodal Diagnosis with Large Language Models

MedChat: introduces a multi-agent framework for multimodal diagnosis, with Retinal Fundus Image, Clinical Note, Glaucoma Classifier, Disk/Cup Segmentor, Shared Prompt, Role-Based Agents, Sub-Reports, Director Agent, Final Report, Frontend, Backend, and Interactive Chat Interface components, designed to emulate multidisciplinary clinical workflows for generating diagnostic reports.
The framework processes medical images and clinical notes using deep learning models, verbalizes outputs into a shared prompt, distributes it to role-specific LLM agents, and synthesizes their sub-reports into a final diagnostic report via a director agent.
A companion platform provides a user interface for input, report viewing, and interactive question-answering, enhancing transparency and usability for clinical review and education.

G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems

G-Memory: introduces a hierarchical memory system for Multi-Agent Systems (MAS) with Insight Graph (abstracts generalizable insights), Query Graph (encodes task meta-information), and Interaction Graph (stores communication logs) components, which manages MAS interaction history via a three-tier graph hierarchy.
G-Memory performs bi-directional memory traversal to retrieve high-level insights and fine-grained interaction trajectories for new queries.
The hierarchical memory architecture is updated upon task completion by assimilating new trajectories and distilling insights, enabling agent teams to evolve.

Shapley-Coop: Credit Assignment for Emergent Cooperation in Self-Interested LLM Agents

Shapley-Coop: introduces a cooperative workflow for self-interested LLM agents, with Structured Negotiation Protocol (Communication protocol), Short-Term Shapley Chain-of-Thought (Heuristic reasoning), and Long-Term Shapley Chain-of-Thought (Retrospective reasoning), enabling credit assignment.
The framework integrates Shapley Chain-of-Thought reasoning with structured negotiation protocols to align heterogeneous goals and facilitate fair credit assignment.
Shapley-Coop fosters spontaneous cooperation through rational task-time pricing and transparent post-task reward redistribution.

8th June 2025

SCGAgent: Recreating the Benefits of Reasoning Models for Secure Code Generation with Agentic Workflows

SCGAgent: introduces an agentic workflow for secure code generation, with Code Generation, Unit Test Generation, CWE Prediction, Guideline Retrieval, Guideline Relevance Checking, Code Modification, Enforce Functionality Module, Fault Determination, Guideline Database, and Test Runner components.
The framework utilizes an underlying language model to perform generation, prediction, checking, modification, and fault determination tasks, guided by a workflow and a database of secure coding guidelines.
SCGAgent iteratively refines generated code based on secure coding guidelines and feedback from executing LLM-generated unit tests to improve both security and functionality.

Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs

Agentic Framework (knowledge organization): introduces a lightweight, agentic framework for question answering under temporal conflict by incrementally building a structured, external memory from source documents.
The framework utilizes a Language Model agent to decompose Incoming Questions, extract facts from Incoming Text Documents, and update a structured Knowledge Base.
For answering, the agent queries the Knowledge Base for temporally filtered, relevant facts, enabling reliable reasoning over dynamic information without model re-training.

Learn as Individuals, Evolve as a Team: Multi-agent LLMs Adaptation in Embodied Environments

LIET (Learn as Individuals, Evolve as a Team): introduces a framework for multi-agent LLM adaptation in embodied environments, featuring LIET Agent, Environment, Communication Channel, Memory, Utility Function, Comm., Planner, Know. List, Message Generator, and Reflector components.
The framework enables LLM agents to learn individually via a utility function for cost estimation and evolve as a team through an evolving communication scheme.
Individual agents use memory and the utility function for local planning, while team communication is guided by a shared knowledge list updated via reflection on received messages.

LLM-Enhanced Rapid-Reflex Async-Reflect Embodied Agent for Real-Time Decision-Making in Dynamically Changing Environments

RRARA (Rapid-Reflex Async-Reflect Agent): introduces a hybrid embodied agent combining a rule-based policy (low-latency reflex) for immediate actions with an asynchronous LLM-based Reflector (asynchronous reflection feedback) for in-situ refinement.
The agent executes initial actions based on the low-latency rule-based policy while the LLM-based Reflector analyzes the situation in parallel to provide feedback.
This parallel processing allows the agent to maintain real-time responsiveness while incorporating high-level reasoning to revise suboptimal decisions in dynamic environments.

BIMgent: Towards Autonomous Building Modeling via Computer-use Agents

BIMgent: introduces an agentic framework for autonomous architectural building modeling, with Design Layer (Transforms design requirements), Action Planning Layer (Hierarchically plans modeling steps), High-level Planner (Generates general design steps), Low-level Planner (Generates detailed action substeps), Execution Layer (Executes planned GUI operations), Pure-Action Workflow (Executes deterministic actions), Vision-Driven Workflow (Executes GUI-grounded actions), Supervisor (Monitors execution, provides feedback), and Memory (Stores execution trajectories).
The framework transforms multimodal design intents into 3D BIM models by interpreting design, planning software workflows hierarchically, and executing GUI actions with supervision and reflection.
BIMgent leverages multimodal LLMs as backbones and integrates components like RAG for documentation access and a screen parser for dynamic GUI grounding to handle complex BIM software environments.

Mind the Web: The Security of Web Use Agents

Web-use Agent Architecture: introduces, "a new attack vector exploiting web-use agents' high-privilege browser capabilities by embedding malicious content in web pages", with all LLM (interprets content, plans actions), Perception Module (gathers web content), Execution Engine (interacts with browser/system), State Management (handles sessions, credentials), Agent Interface (user interaction) components, where "the attack leverages LLMs' contextual reasoning limitations to frame malicious instructions as helpful task guidance".
The paper demonstrates nine payload types compromising confidentiality, integrity, and availability against four popular web-use agent implementations.
Mitigation strategies including oversight, execution constraints, and task-aware reasoning are proposed to address these vulnerabilities.

BRIGHT+: Upgrading the BRIGHT Benchmark with MARCUS, a Multi-Agent RAG Clean-Up Suite

MARCUS (Multi-Agent RAG Clean-Up Suite): introduces a multi-agent pipeline with SafeClean (conservative cleaning), FastClean (aggressive cleaning), and Splitter (semantic chunking) agents to clean and re-chunk the BRIGHT corpus into BRIGHT+.
The pipeline leverages LLMs to systematically remove structural noise and address semantic discontinuity in web-scraped data.
BRIGHT+ yields improvements in retrieval accuracy and multi-hop reasoning across diverse retrievers.

Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

ToTh (Theorem-of-Thought): introduces a multi-agent framework with Multi-Paradigm Reasoning Agents (Generate reasoning traces), Formal Reasoning Graph Construction (Transform traces to graphs), Bayesian Confidence Propagation (Evaluate graph consistency), Graph Scoring (Select best graph), and Answer Extraction (Extract final answer), modeling reasoning as collaboration and verification.
The framework employs distinct agents for abductive, deductive, and inductive reasoning, structuring their outputs into graphs.
Consistency is verified using NLI-calibrated Bayesian belief propagation to select the most coherent reasoning path.

Accelerating Two-Dimensional Materials Research via a Universal Interatomic Potential and Large Language Model Agent

MCP-based Agent Platform: introduces a platform integrating a universal ML-IAP for 2D materials with an LLM-powered agent, including a database, model (ML-IAP and band gap model), and functional modules.
The platform enables natural language interaction for 2D materials property simulations and high-throughput screening.
The ML-IAP model, based on MatterSim/M3GNet, is trained on a large 2D material dataset, while a GNN/CNN model predicts band gaps.

Position: Simulating Society Requires Simulating Thought

GenMinds (Generative Minds): introduces a conceptual modeling paradigm for generative agents, with Structured Thought Capture (Elicits, parses explanations), Causal Motifs (Minimal causal units), Causal Belief Network (CBN) (Symbolic causal graph), Symbolic-Neural Hybrid Graph Simulation (Inference, belief updates), and Awareness of Unknown (Highlights missing links), designed to simulate structured, revisable, and traceable thought for social simulations.
The framework grounds agents in modular belief representations using causal graphs derived from natural language interviews to capture reasoning fidelity.
The paper also introduces RECAP, a benchmark framework to evaluate reasoning fidelity based on traceability, demographic sensitivity, and intervention coherence.

7th June 2025

An Agentic Framework for Autonomous Metamaterial Modeling and Inverse Design

Agentic Framework: introduces an autonomous system for metamaterial inverse design, including Planner (Orchestrates process), Input Verifier (Validates inputs), Forward Modeler (Develops forward model), Inverse Designer (Designs geometry), Memory (Stores information), File_Check (Checks files), Forward_Train (Trains model), Data_Generate (Generates data), Controller (Guides iteration), Code_Modify (Adapts code), Neural_Adjoint (Inverse design tool), Numerical_Simulation (Verifies design), User Message (Input/Output), and System Prompt (Configures agents).
The framework leverages specialized LLM agents and external tools to automate the end-to-end design process from user input to optimized metamaterial geometry.
The system demonstrates autonomous planning, reasoning, and adaptation, achieving performance comparable to human expert-designed solutions.

Boosting LLM Reasoning via Spontaneous Self-Correction

SPOC: introduces a spontaneous self-correction approach for LLMs, with LLM Agent (core model), Dual Agent Roles (proposer/verifier), Interleaved Generation (solution/verification turns), PairSFT (initial training), and Online RL (policy optimization), enabling models to generate interleaved solutions and verifications in a single pass.
The approach dynamically elicits and terminates reasoning generations based on self-verification outcomes, effectively scaling inference time compute.
Training leverages synthetic data for fine-tuning and online reinforcement learning using correctness as reward, yielding substantial performance improvement on math reasoning benchmarks.

Multimodal Spatial Language Maps for Robot Navigation and Manipulation

AVLMaps (Audio-Visual-Language Maps): introduces a unified 3D spatial map representation built by fusing multimodal features from visual, object, area, and audio localization modules, enabling cross-modal reasoning for spatial goal navigation guided by an LLM.
The framework supports zero-shot spatial and multimodal goal navigation, demonstrating improved recall in ambiguous scenarios.
The maps are reusable across different robot embodiments and extensible to additional sensing modalities.

United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory

CoThinker: introduces, with Agent Parallel Thinking (Divides cognitive labor), Thinking Style Orchestrator (Assigns thinking styles), Transactive Memory System (Manages shared knowledge), Communication Moderator (Structures communication network), Synthesizer (Consolidates final solution), and Agents (Individual LLMs), a multi-agent LLM architecture operationalizing Cognitive Load Theory principles to mitigate cognitive overload and enhance collaborative problem-solving.
The architecture distributes intrinsic cognitive load through agent specialization via thinking styles and manages transactional load via structured communication and a collective working memory.
CoThinker demonstrates improved performance on complex problem-solving tasks and high cognitive load scenarios compared to existing multi-agent baselines.

AI PsyRoom: Artificial Intelligence Platform for Segmented Yearning and Reactive Outcome Optimization Method](http://arxiv.org/abs/2506.06740v1)

AI PsyRoom: introduces a multi-agent simulation framework for psychological counseling, including PsyRoom A (Dialogue generation module) for generating dialogues and PsyRoom B (Treatment plan module) for generating treatment plans.
The framework leverages fine-grained emotion analysis through Segmenting Psychological Emotions (Fine-grained emotion analysis) and Segmented Emotional Classification (Fine-grained emotion analysis).
PsyRoom A employs Multi-agents (Simulate counseling dialogue) including Client (Simulated patient agent), Counselor (Simulated therapist agent), and Professor of Psychology (Dialogue evaluation agent), while PsyRoom B uses an Emotional Assessor (Emotion evaluation agent) and Emotional Therapist (Treatment plan agent).

WORLDLLM: IMPROVING LLMS' WORLD MODELING USING CURIOSITY-DRIVEN THEORY-MAKING

WorldLLM: introduces a framework to improve LLMs' world modeling by combining Bayesian inference and active exploration, including a Statistician (LLM forward model) to evaluate hypotheses, a Scientist (LLM theory generator) to refine hypotheses, and an Experimenter (Data collection agent) to collect challenging transitions.
The framework iteratively alternates between the Experimenter collecting data, the Statistician evaluating the current hypotheses on this data, and the Scientist updating the natural language hypotheses based on the evidence.
This curiosity-driven process aims to autonomously improve the LLM's predictive accuracy and generate human-interpretable theories of environment dynamics without costly gradient-based fine-tuning.

Contextual Experience Replay for Self-Improvement of Language Agents

CER (Contextual Experience Replay): introduces a training-free framework for language agents, including distillation module (Distills experiences), retrieval module (Retrieves experiences), dynamic memory buffer (Stores past experiences), and base decision-making agent (Solves tasks).
The framework enables self-improvement by distilling environment dynamics and decision-making patterns from past trajectories.
Retrieved experiences are replayed in the agent's context to enhance decision-making in complex web environments.

6th June 2025

Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce

Auditing Framework: introduces a task-level, survey-based approach with Auditing Framework With Audio Interface, WORKBank, Human Agency Scale (HAS), and Autonomous Agent Desire-Capability Landscape components to audit AI agent automation and augmentation potential across the U.S. workforce.
The framework collects worker desires and AI expert capability assessments for occupational tasks, storing this data in the WORKBank database.
Key outputs include the Human Agency Scale for quantifying human involvement and the Desire-Capability Landscape for identifying mismatches and opportunities in AI agent development.

Improving LLM-Powered EDA Assistants with RAFT

EDA-LLM Assistant Workflow: introduces a method to enhance LLM performance for RAG-based EDA tasks using Retrieval-Augmented Fine-Tuning (RAFT) with synthetic question-answer datasets generated by a Data Generation/Refinement LLM from Training Data Sources, leveraging Few-Shot Retrieval from a Q&A History Database, and utilizing Hybrid Retrieval from a Document Database with Access Control for the EDA-LLM.
The workflow incorporates human-authored Q2A posts and unlabeled EDA Documents as Training Data Sources, employing DeepSeek-V3 as the Data Generation/Refinement LLM to create refined answers and synthetic Q&A pairs, optionally guided by Retrieval-Augmented Few-Shot examples retrieved via BM25 from a Q&A History Database.
The approach integrates Hybrid Retrieval combining semantic and lexical search on a Document Database, applies Access Control to filter retrieved documents for security, and fine-tunes the EDA-LLM using RAFT before deploying it for RAG-based inference.

ScriptDoctor: Automatic Generation of PuzzleScript Games via Large Language Models and Tree Search

ScriptDoctor: introduces a pipeline for automatic PuzzleScript game generation, including an LLM (Generates game code), Lark CFG Parser (Parses generated code), PuzzleScript Engine (Compiles game code), BFS Solver (Tests game solvability), Human Game Archive (Provides game examples), Coding Prompt (Guides LLM generation), Parse Errors (Syntax error feedback), Compile Errors (Compilation error feedback), and Solvability Issues (Playtesting feedback).
The system iteratively generates and tests games, using feedback from parsing, compilation, and playtesting to refine the LLM's output.
This approach demonstrates automated, open-ended LLM-based workflows for generating novel game content in a constrained domain.

On-board Mission Replanning for Adaptive Cooperative Multi-Robot Systems

GATR (Graph Attention Replanner): introduces a lightweight mission replanner using a GAT Encoder (Transforms graph data) and Attention Model Decoder (Generates mission plan) to solve the Cooperative Mission Replanning Problem.
The framework employs an RL Agent (Learns planning policy) interacting with an RL Environment (Simulates mission dynamics), processing an Input Graph (Represents tasks agents) along with the Environment State (Summarizes mission progress) and Availability Mask (Filters invalid actions).
This approach enables fast and efficient on-board replanning for multi-robot systems by transforming input data into latent representations and sequentially generating mission plans.

PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time

PersonaAgent: introduces a personalized LLM agent framework with a persona (user-specific system prompt), personalized memory (stores user data), personalized action (selects tailored actions/tools), test-time user preference alignment (optimizes persona prompt), and tools (external functions).
The personalized memory module integrates episodic memory (records interactions) and semantic memory (summarizes user traits).
The persona serves as an intermediary, using memory insights to guide actions and being refined by action outcomes and test-time alignment.

Can Theoretical Physics Research Benefit from Language Agents?

LLM agents: introduces the potential for LLM agents, with Domain Knowledge, External Tools, Multimodal Processing, Reasoning Capabilities, Information Retrieval, Human Interface, and Experimental Interaction components, to accelerate theoretical physics research by assisting across the typical workflow stages.
The paper analyzes current LLM capabilities and limitations in physics reasoning, highlighting the need for improvements in physical intuition, constraint satisfaction, and reliability.
Realizing this potential requires addressing fundamental challenges like ensuring physical consistency and developing robust verification methods through interdisciplinary collaboration.

Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach

Multi-Agent Pipeline: introduces a lightweight multi-agent framework for text-to-chart generation, including a Drafting Agent (generates initial code), a Python Interpreter (executes code), a Re-writer Agent (debugs code), and an Execution and Repair Loop (iteratively fixes errors).
This pipeline separates the tasks of code generation, execution, repair, and judgment to improve reliability.
The agentic approach significantly reduces execution errors compared to single-prompt methods, highlighting the value of iterative self-correction.

The Lock-in Hypothesis: Stagnation by Algorithm

Human-LLM Feedback Loop: introduces the lock-in hypothesis, proposing that the dynamic interaction between human users and large language models, involving Human Agents (users), LLM Authority (AI system), Beliefs (ideas, values, opinions), Trust (mutual influence weight), and a Diversity Metric (conceptual variety measure), can lead to a loss of diversity and convergence on false beliefs.
The paper formalizes this hypothesis using a Bayesian model and tests it empirically with agent-based LLM simulations and real-world GPT usage data.
Analysis reveals sudden drops in diversity after new GPT versions are released, supporting the hypothesized feedback loop's role in reinforcing existing beliefs.

Personalized Large Language Models Can Increase the Belief Accuracy of Social Networks

Personalized LLM Bot: introduces a system, with Traditional ML Model (Predicts user preferences), External Database (Stores news articles), RAG Model (Retrieves relevant articles), Summarization Component (Summarizes retrieved articles), and Styling LLM (Rephrases for rhetorical style), designed to provide personalized, factually accurate responses within a social network simulation.
The bot's responses are tailored to individual user preferences regarding news sources and rhetorical style, based on predictions from a machine learning model and information retrieved from an external database.
The study demonstrates that the presence of this personalized LLM bot in a social network leads individuals to update their beliefs towards factual accuracy and influences their subsequent network connections.

Conversational Interfaces for Parametric Conceptual Architectural Design: Integrating Mixed Reality with LLM-driven Interaction

The system: introduces a framework for parametric architectural design using a Reasoning-Code Generation-Execution cycle, integrating a multi-agent LLM system (Reasoning Agent, Coding Agent, Optimization Agent) with a Mixed Reality environment.
The system leverages an Interface Manager and ShapeFramework for user interaction and visualization within MR, while an LLM Session Manager orchestrates the agents and a Compiler Client handles code execution.
This approach aims to lower barriers to parametric modeling by enabling natural language and gesture interaction, dynamic parameter management, and iterative design exploration in an immersive environment.

AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search

AgentSwift: introduces a framework combining selection (selects agent), hierarchical expansion (expands selected agent), value model (predicts performance), and performance uncertainty (guides exploration) for efficient LLM agent design.
Hierarchical expansion includes recombination (replaces components/workflow), mutation (generates new implementations), and refinement (adjusts based on feedback).
The framework leverages a hierarchical search space (models agent design) including agentic workflow (defines execution steps/flow) and functional components (includes memory, tool, planning).

CrimeMind: Simulating Urban Crime with Multi-Modal LLM Agents

CrimeMind: introduces CrimeMind (LLM-driven ABM framework), with LLM Agents (Powered by large language models), Routine Activity Theory (Guides agent crime decisions), Urban Environment (Grid-based spatial simulation), Structured Data (Demographic, socioeconomic features), Street View Imagery (Visual urban scene input), Vision-Language Model (Processes visual urban cues), Human Annotation (Dataset for perception alignment), Self-Evolution Alignment (Calibrates VLM to human judgment), Agent Mobility (Simulates agent movement), and Crime Heatmap (Aggregated crime event visualization), which simulates urban crime using theory-grounded LLM agents in a multimodal urban context.
The framework integrates Routine Activity Theory into agent decision-making and uses a self-evolution alignment process to calibrate visual perception with human judgment.
CrimeMind enables counterfactual simulations and policy evaluation by allowing agents to dynamically adapt behavior based on changing conditions.

CodeContests+: High-Quality Test Case Generation for Competitive Programming](http://arxiv.org/abs/2506.05817v1)

G-V (Generator-Validator) agent system: introduces an LLM-based agent system for high-quality test case generation, including Generator Agent, Validator Agent, Generator Program, and Validator Program.
The Generator Agent writes programs to create diverse test inputs, while the Validator Agent writes programs to verify these inputs against problem constraints.
Test cases failing validation provide feedback to the Generator Agent for revision, improving correctness and coverage.

MAPLE: Multi-Agent Adaptive Planning with Long-Term Memory for Table Reasoning

MAPLE (Multi-agent Adaptive Planning with Long-term mEmory): introduces a novel framework for table reasoning with Solver (Iterative reasoning), Checker (Answer verification), Reflector (Error diagnosis), Archiver (Memory management), Working Memory (Current task state), and Long-term Memory (Accumulated knowledge) agents in a feedback loop.
The framework mimics human problem-solving by enabling dynamic adaptation within and across tasks through iterative refinement and experiential learning.
Specialized agents collaborate in a feedback-driven cycle, leveraging dual memory systems for robust and accurate table reasoning.

To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt

Polymorphic Prompt Assembling (PPA): introduces a defense against prompt injection by dynamically varying prompt structure using User Input, Instruction Prompt, Separator Set, System Prompt Set, Random Selector, Format Constraints, and Polymorphic Prompt Assemble process.
The approach randomizes the combination of user input and system prompts using selected separators and templates to disrupt attacker predictability.
This method enhances LLM agent security against adaptive attacks with near-zero runtime overhead.

Toward Greater Autonomy in Materials Discovery Agents: Unifying Planning, Physics, and Scientists

MAPPS (Materials Agent unifying Planning, Physics, and Scientists): introduces a multi-agent framework for autonomous materials discovery, including a Workflow Planner (Generates multi-step workflows), a Tool Code Generator (Synthesizes executable code), and a Scientific Mediator (Coordinates agents and human).
The framework enables Level 2 autonomy by allowing agents to plan workflows guided by human input, rather than executing fixed, predefined steps.
MAPPS integrates physics-based tools and human feedback to ensure scientific validity and improve performance in crystal structure generation and prediction tasks.

5th June 2025

Energentic Intelligence: From Self-Sustaining Systems to Enduring Artificial Life

Energentic Intelligence: introduces a class of autonomous systems driven by persistence, with Energy Generation Core (Converts ambient energy), Energo-Cognitive Cortex (Performs perception/decision-making), Thermal Regulation Unit (Manages internal temperature), and Survival Manager (Estimates survival, issues commands) components.
This framework operationalizes autonomy through energetic persistence, integrating energy harvesting, adaptive computation, and thermoregulation into a cohesive, internally regulated feedback loop.
The system aims to sustain its existence by continuously adapting behavior based on internal energy and thermal conditions, rather than optimizing external task performance.

OPERA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

OPERA-based User Behavior Simulation: introduces OPERA Dataset (Dataset), User Persona (User profiles), Action Traces (User interactions), Web Observations (Web context), Rationales (Action explanations), ShoppingFlow Plugin (Data collection plugin), Content Script (Logs user interactions), Background Script (Tracks page events), Rationale Pop-up (Collects rationales), and LLM (Simulation model), which provides a dataset and benchmark for evaluating LLMs on simulating human online shopping behavior.
The framework utilizes the ShoppingFlow plugin to collect detailed user data, including actions, web context, rationales, and persona information.
This data is then used to benchmark LLMs on predicting user actions and rationales in online shopping scenarios.

IMPROVING LLMS WITH A KNOWLEDGE FROM DATABASES

Enhanced Association Rule RAG: introduces a method to improve LLM answers by augmenting them with knowledge discovered from databases using enhanced association rules, including Dataset, Rule Mining Pattern Definition, Rule Mining Task Definition, Rule Mining Execution, Rule List, Rule-To-Text Module, Text Document, RAG Augmentation, and LLM components.
The approach extracts knowledge from a dataset via rule mining, converts the resulting rules into a text document, and embeds this document into the LLM's context using Retrieval-Augmented Generation.
This method provides interpretable knowledge to the LLM, enabling improved data-based question answering without requiring the LLM to directly execute analytical code or interpret complex rule formats.

SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms

Fact Checking Framework: introduces a two-stage pipeline for deepfake detection using YOLO (Face Detection), FaceNet (Feature Extraction), Influential People database (Identity Comparison), Whisper (Speech Transcription), LLM AGENT-1 (Plausibility Analysis), LLM AGENT-2 (Factual/Ethical Check), WEB SEARCH (External Information), and LLM (Final Decision) to analyze audio-visual content.
The framework identifies individuals and transcribes speech in the first stage, then uses a multi-agent LLM pipeline to verify authenticity based on plausibility, factual correctness, and ethical implications.
This multimodal approach integrates visual recognition, speech transcription, and language-based reasoning to enhance robustness against sophisticated deepfakes.

LLM Agents for Asynchronous Group Communication in Mafia Games

LLM Agent: introduces an adaptive asynchronous agent for group communication, featuring a Scheduler (decides when to speak) and Generator (composes message content) modules, using Context (game state and chat history) and guided by dynamic Scheduling Prompt (guides timing decision) and Generation Prompt (guides message content), incorporating Simulated Typing Time (adds human-like delay).
The agent is evaluated in Mafia games alongside human players, demonstrating performance comparable to human players in timing and win rates.
The asynchronous design allows the agent to decide both what to say and when to say it, better mimicking real-world group interactions.

ProRefine: Inference-time Prompt Refinement with Textual Feedback

ProRefine: introduces an inference-time prompt optimization method using textual feedback from LLMs, including LLMtask (Executes task), LLMfeedback (Critiques output), and LLMoptimizer (Refines prompt) components.
The LLMtask executes the task, LLMfeedback critiques its output, and LLMoptimizer refines the prompt based on the feedback in an iterative loop.
This process dynamically refines prompts for multi-step reasoning tasks without requiring additional training or ground truth labels.

Teaming in the AI Era: AI-Augmented Frameworks for Forming, Simulating, and Optimizing Human Teams

Frameworks: introduces AI-augmented frameworks for forming, simulating, and optimizing human teams, including a Team Formation Framework using a UCB Algorithm and user feedback, tAlfa (Team AI Feedback Assistant) with an LLM-powered agent and processing stages for feedback generation and delivery based on communication metrics, and PuppeteerLLM, an LLM-based simulation framework with LLM agents, physical environments, temporal dynamics, and simulation stages.
The Team Formation Framework iteratively refines team recommendations using a multi-armed bandit approach guided by user preferences.
tAlfa provides immediate, personalized AI-generated feedback on team dynamics by processing messages and evaluating communication metrics.

LLM-Guided Scenario-based GUI Testing

SCENGEN (LLM-guided scenario-based GUI testing approach): introduces a novel approach for scenario-based GUI testing leveraging multi-modal LLMs and a multi-agent framework, including Context Memory, Observer, Decider, Executor, Supervisor, and Recorder components.
The framework simulates manual testing by iteratively observing GUI state, making decisions, executing actions, verifying results, and recording information.
Multi-agent collaboration and LLM guidance enable understanding app semantics and generating scenario-based GUI tests effectively.

Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

hierarchical MA-LLM framework (Multi-Agent Language Model): introduces a system for aerial-ground robots, integrating a Reasoning Layer (LLM) for task decomposition and mapping, a Perceptual Layer (VLM) for semantic extraction, and an Execution Layer for motion control.
The framework utilizes an Aerial Robot as a leader for global guidance and a Ground Robot as a follower for local navigation and manipulation.
GridMask enhances the VLM's spatial perception, supporting robust semantic navigation and manipulation in dynamic environments.

QiMeng: Fully Automated Hardware and Software Design for Processor Chip

QiMeng: introduces a novel system for fully automated hardware and software design for processor chips, with a Large Processor Chip Model (LPCM) as a domain-specialized LLM, Hardware Design Agent for automated hardware design, Software Design Agent for automated software design, and Top-layer Applications for various design tasks.
The system is structured in three hierarchical layers, leveraging AI and LLMs to address challenges in processor chip design.
QiMeng aims to automate the entire design and verification pipeline, enabling rapid customization and improved efficiency.

Agentic AI for Intent-Based Industrial Automation

Intent-Based Agentic AI Framework: introduces a conceptual framework for intent-driven industrial automation using LLM-based agents, featuring a Root Agent, Specialized Sub-Agents, LLM, SLM, Memory, Tools Set, Industrial Data, Machines, and Business and Operational Intents.
The framework translates high-level natural language business or operational intents into structured components, enabling autonomous planning and execution via agent orchestration and specialized tools.
This approach simplifies human-machine interaction by abstracting technical complexity and aligns with Industry 5.0's human-centric vision.

LLMS FOR SENSORY-MOTOR CONTROL: COMBINING IN-CONTEXT AND ITERATIVE LEARNING

LLM-based Sensory-Motor Control Framework: introduces a method where an LLM (Large Language Model) generates a control strategy, encodes it into IF-THEN rules and Python Code, and evaluates it in an Environment/Task.
The framework iteratively refines the Strategy (Text/Rules) by prompting the LLM with Performance/Sensory-Motor Data and Past Experiences/External Memory.
This approach enables autonomous learning for embodied agents by directly mapping observations to actions without relying on predefined motor primitives or human demonstrations.

Empowering Economic Simulation for Massively Multiplayer Online Games through Generative Agent-Based Modeling

MMOAgent (Generative Agent-Based Modeling): introduces an LLM-empowered framework for MMO economic simulation, featuring profile (tailors agent to player traits), perception (interprets game environment observations), reasoning (determines appropriate structured actions), memory (logs game experience, past trajectories), and action (executes permissible game actions) modules.
The framework utilizes LLMs' capabilities for human-like decision-making and adaptability, addressing reliability, sociability, and interpretability challenges in traditional agent-based modeling.
The simulation environment is enhanced with player-to-player trading and linguistic negotiation, enabling realistic economic interactions and emergent phenomena like role specialization and market price dynamics.

Gen-n-Val: Agentic Image Data Generation and Validation

Gen-n-Val: introduces a novel agentic framework for generating and validating synthetic image data, leveraging a LD Prompt Agent (LLM) (Generates optimized prompts), Data Validation Agent (VLLM) (Filters generated images), Layer Diffusion (LD) (Generates transparent images/masks), TextGrad (Optimizes agent prompts), and Image Harmonization (Blends instances onto backgrounds).
The framework uses agents and generative models to produce high-quality synthetic data with precise instance masks and diverse backgrounds for computer vision tasks.
Gen-n-Val significantly improves performance on instance segmentation and object detection benchmarks, particularly for rare classes and open-vocabulary detection.

E-bike agents: Large Language Model-Driven E-Bike Accident Analysis and Severity Prediction

E-bike agents: introduces a framework using LLM-powered agents to analyze unstructured e-bike accident reports, including a Data Classifier, Information Extractor, Injury Causes Determiner, and Incident-Component Link Detector.
The framework processes extracted data using an Ordered Logit Model to analyze severity relationships and employs Visualization to present findings.
This approach provides a scalable solution for e-bike safety analytics by converting narrative reports into structured, actionable insights.

Agents of Change: Self-Evolving LLM Agents for Strategic Planning

LLM Self-Evolving Agent Framework: introduces self-evolving LLM agents for strategic planning in Settlers of Catan, including BaseAgent (Input, Interface, Decision maker, Output), StructuredAgent (Input, Input structuring, Interface, Decision maker, Output), PromptEvolver (Coordinator, Game player, Intelligence, Intelligence, External access, Memory, Instruction, Feedback, Feedback processing, Game outcome, Interface), and AgentEvolver (Coordinator, Evaluator, Information gatherer, Code modifier, Advisor, Game player, Intelligence, Intelligence, Intelligence, Intelligence, Intelligence, Intelligence, External access, Reasoning, Instruction processing, Input, Game outcome, Interface).
The framework benchmarks four agent architectures with increasing self-improvement capabilities against a strong heuristic baseline in the Catanatron simulator.
Self-evolving agents, particularly PromptEvolver and AgentEvolver, demonstrate improved strategic planning and performance over static baselines through iterative prompt and code refinement.

FLEX-TRAVELPLANNER: A BENCHMARK FOR FLEXIBLE PLANNING WITH LANGUAGE AGENTS

Flex-TravelPlanner: introduces a benchmark for evaluating language agents in dynamic, multi-turn planning scenarios, using a pipeline with Initial Constraint (Start planning with constraints), Adding Constraint (Introduce new constraints), Revising Constraint (Modify existing constraints), and Fin (End of planning process) steps.
The framework evaluates how well agents adapt plans as new requirements or changes are introduced over multiple interactions.
It specifically addresses the challenges of constraint addition and revision, mirroring real-world planning dynamics.

Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning

Tool-MVR: introduces a novel Tool-Augmented LLM framework that enhances System 2 reasoning capabilities by employing MAMV (Data verification pipeline) for high-quality data generation (ToolBench-V, Verified instruction dataset) and EXPLORE (Reflection learning algorithm) for learning from errors (ToolBench-R, Reflection dataset), utilizing a Base LLM (Base model) interacting with APIs (External tools) based on User Query (Input), generating Reasoning Trajectory (Step-by-step process) and Final Answer (Output) informed by Observation (Tool feedback).
The MAMV pipeline consists of APIOptAgent (API verification/optimization agent), QueryVerifyAgent (Query assessment/filtering agent), and APICallAgent (Trajectory generation/verification agent) to ensure data quality for tool planning and invocation.
EXPLORE enables the model to learn adaptive tool reflection by leveraging tool feedback through an Error → Reflection → Correction paradigm.

SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents

SmartAvatar: introduces a vision-language-agent-driven framework for generating 3D human avatars, utilizing a Descriptor (Extracts attributes), Generator (Synthesizes code), Evaluator (Checks alignment), Refiner (Adjusts code), Human Generator (Parametric avatar model), and Blender (Rendering environment).
The system incorporates a VLM-guided auto-verification loop that iteratively refines generated avatars to match user input across visual and semantic criteria.
SmartAvatar supports diverse inputs including text, image, and multimodal combinations, enabling conversational editing for customizable, animation-ready avatars.

Demonstrations of Integrity Attacks in Multi-Agent Systems

Multi-Agent System (MAS): introduces integrity attacks where malicious agents manipulate system operations and evaluation outcomes within systems comprising Coder, Tester, Reviewer, WebSearcher, and Monitor components.
These attacks, including Self-Dealer, Free-Rider, Scapegoater, and Boaster, exploit inter-agent communication and the Monitor's evaluation process.
The research demonstrates that these manipulations can bias agent behavior and evaluation scores while maintaining overall task performance.

OpenAg: Democratizing Agricultural Intelligence

OpenAg: introduces, "a comprehensive framework designed to advance agricultural artificial general intelligence", with Multi-Modal Knowledge Ingestion, Unified Agriculture Knowledge Base, Neural Agricultural Knowledge Graph Generation, Adaptive Multi-agent Reasoning System, Causal Agricultural Decision Transparency, and Adaptive Agricultural Transfer Learning components, where "it integrates diverse data flows and advanced reasoning".
The framework aims to deliver context-aware, explainable, and actionable insights for agricultural decision support.
OpenAg bridges the gap between scientific knowledge and farmer expertise to support scalable and locally relevant decision-making.

From Standalone LLMs to Integrated Intelligence: A Survey of Compound AI Systems

CAIS (Compound AI Systems): introduces a framework integrating LLMs with external components and orchestration, categorized into RAG, LLM Agents, and MLLMs, to overcome standalone LLM limitations.
The framework leverages components like retrievers, agents, tools, and multimodal encoders, coordinated by orchestration strategies, for complex tasks.
The survey provides a taxonomy, architectural analysis, evaluation framework, and research agenda for these modular, composable AI systems.

--

4th June 2025

CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective

CogMath: introduces a framework for assessing LLMs' mathematical abilities using an Inquiry agents (Pose dimension-specific inquiry), Judge agents (Evaluate inquiry quality), Reference agents (Provide correct answer), and Evaluated LLM (Model being assessed) system across human cognitive stages.
The framework evaluates LLMs by posing dimension-specific inquiries generated and refined by agents, comparing the LLM's response to a reference answer.
This multi-agent system allows for a fine-grained assessment of LLMs' performance across nine dimensions within problem comprehension, solving, and summarization stages.

MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

MedAgentGym: introduces a unified training environment for enhancing coding-based medical reasoning in LLM agents, featuring an LLM Agent (Model trained/evaluated), Coding Environment (Isolated executable containers), Interactive Feedback Mechanism (Processes, executes, translates errors), Data Resources (Task datasets), Trajectory Collection (Samples, stores interactions), and Verifier (Evaluates trajectory success).
The environment includes 72,413 tasks from 12 real-world biomedical scenarios, encapsulated in isolated, executable coding environments with interactive feedback.
MedAgentGym supports scalable training trajectory generation and extensive benchmarking of LLMs for code-based medical reasoning.

SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models

SuperWriter-Agent: introduces an agent-based framework for long-form text generation with Stage 1: Plan (Structured planning), Stage 2: Write (Paragraph generation), and Stage 3: Refine (Iterative refinement) stages, utilizing AI commentators (Discuss ideas), Writer (Develops plan, writes), Thinker (Plans paragraph), Checker (Reviews text), and Refiner (Revises text) agents.
The framework generates training data for SuperWriter-LM (Language model), which is optimized using Hierarchical DPO (Multi-stage optimization) guided by MCTS (Explores paths) and a Judge LLM (Scores outputs).
This approach simulates human writing processes to enhance coherence, consistency, and quality in long-form text generation.

TracLLM: A Generic Framework for Attributing Long Context LLMS

TracLLM (Generic Framework for Attributing Long Context LLMs): introduces a generic context traceback framework, with Instruction, Context, LLM, Output, Iterative Search, Group Division, Score Computation, Group Pruning, Score Denoising, Score Ensemble, and Attribution Method components, designed to efficiently and accurately identify texts in a long context contributing to an LLM's output.
The framework employs an informed search algorithm that iteratively divides and prunes text groups based on contribution scores calculated by a feature attribution method.
TracLLM enhances accuracy through contribution score denoising and ensemble techniques, demonstrating effectiveness in post-attack forensic analysis and debugging LLM systems.

TRISM FOR AGENTIC AI: A REVIEW OF TRUST, RISK, AND SECURITY MANAGEMENT IN LLM-BASED AGENTIC MULTI-AGENT SYSTEMS

TRISM (Trust, Risk, and Security Management): introduces a structured framework for LLM-based agentic multi-agent systems, including Governance, Explainability, ModelOps, Application Security, and Model Privacy components.
This framework addresses unique trust, risk, and security challenges posed by autonomous, collaborative, and evolving agent behaviors in high-stakes domains.
The paper provides a comprehensive review, risk taxonomy, trust-building mechanisms, security/privacy methods, and a roadmap for responsible agentic AI deployment.

AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment

AmbiK Dataset and Evaluated Methods: introduces AmbiK Dataset (textual benchmark), Ambiguity Detection Methods (algorithms for deciding help), Large Language Models (LLMs) (process language, predict actions), Conformal Prediction (CP) (forms prediction sets), and Uncertainty Estimation (quantifies confidence), presenting AmbiK, a textual dataset for evaluating ambiguity detection methods for embodied AI in kitchen environments.
The paper evaluates several existing ambiguity detection methods, including CP-based (KnowNo, LAP, LofreeCP) and non-CP based (Binary, No Help), utilizing various LLMs on the AmbiK dataset.
Experiments demonstrate that current methods and LLMs face significant challenges in effectively handling ambiguity on the AmbiK benchmark, particularly in distinguishing ambiguous from unambiguous tasks.

AI Agents for Conversational Patient Triage: Preliminary Simulation-Based Evaluation with Real-World EHR Data

AI Triage (multi-agent system): introduces a multi-agent architecture for conversational patient triage, including Primary agent (Orchestrates other agents), Symptom Collector (Collects patient symptoms), HealthDataPlanner (Plans EHR data retrieval), HealthDataRetriever (Retrieves EHR data), Summary (Synthesizes case information), Differential Diagnosis (Narrows potential diagnoses), Next Steps (Provides care recommendations), Guideline Verifier (Verifies recommendations with guidelines), EHR data (Source of patient records), Clinical Guideline Database (Source of clinical guidelines), and Outputs (Final triage decision), designed to emulate physician reasoning for patient assessment and triage.
The system interacts with a Patient Simulator, which generates realistic patient conversations from real-world EHR data vignettes for scalable evaluation.
The multi-agent design enhances interpretability and control, while the Guideline Verifier adds a layer of safety by grounding recommendations in clinical best practices.

AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents

LLM-based Agent: introduces AgentMisalignment, a benchmark suite evaluating the propensity for misaligned behavior in LLM-based agents, which integrate a Large Language Model within an Interactive Scaffold, enabling them to use Tools, store Memory, interact via Environment Interfaces, and operate through an Operational Loop influenced by System Prompts and Reflective Prompts.
The benchmark uses the InspectAI framework and a Comprehensive Misalignment Scoring (CMS) mechanism to quantify misaligned actions across various realistic scenarios.
Evaluations reveal that both model choice and personality prompts significantly influence agent misalignment tendencies, highlighting the importance of careful prompt engineering.

Graph Counselor: Adaptive Graph Exploration via Multi-Agent Synergy to Enhance LLM Reasoning

Graph Counselor: introduces a multi-agent collaborative reasoning framework with AGIEM (Graph information extraction), Planning Agent (Establishes reasoning path), Thought Agent (Refines extraction scope), Execution Agent (Executes graph functions), Retrieve (Finds node by keyword), Feature (Gets node attribute), Degree (Gets neighbor count), Neighbour (Lists neighbors), SR (Self-reflection and correction), and Judgment Module (Evaluates reasoning correctness) to enhance LLM reasoning on knowledge graphs.
The framework utilizes an Adaptive Graph Information Extraction Module (AGIEM) with three agents for dynamic graph information extraction and a Self-Reflection (SR) module for improving reasoning reliability.
Graph Counselor employs a multi-round iterative process involving planning, thought, execution, and reflection to adaptively extract graph knowledge and refine reasoning.

PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading

MAS (Multi-Agent Systems): introduces PulseReddit, a novel dataset aligning Reddit discussions with high-frequency cryptocurrency market data, and evaluates LLM-based MAS performance using Reddit API (Data source), Raw Posts (Unprocessed data), Data Preprocess (Cleaning, filtering data), Structured Data (Processed data format), Market Analyst (Analyzes on-chain metrics), News Analyst (Analyzes off-chain signals), Trading Agent (Synthesizes inputs, decides action), and Reflection Agent (Analyzes performance, refines strategy).
The MAS framework, based on CryptoTrade, leverages specialized agents to integrate on-chain and off-chain signals for high-frequency cryptocurrency trading decisions.
Experiments show MAS augmented with PulseReddit data outperform traditional baselines, particularly in bull markets, demonstrating the value of social sentiment in HFT.

AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance

AssetOpsBench: introduces a unified framework and environment for benchmarking AI agents in industrial asset operations, including a global coordinator, specialized agents, memory, task planning, and iterative execution.
The framework supports multi-agent architectures like Agents-As-Tool and Plan-and-Execute, utilizing components such as planners, orchestrators, reviewers, and summarization modules.
AssetOpsBench provides a multi-source dataset and automated evaluation framework to assess agent performance on real-world industrial tasks requiring perception, reasoning, and control.

From Theory to Practice: Real-World Use Cases on Trustworthy LLM-Driven Process Modeling, Prediction and Automation

Three-module AI solution: introduces a framework integrating ML, UQ, XAI, and multi-agent LLMs to transform opaque predictions into auditable, interactive workflows.
The framework grounds explanations in MES event logs and enables natural language dialogues for real-time validation and adaptation.
It employs a multi-agent LLM architecture with RAG to provide context-aware recommendations and ensure consistency.

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Orak: introduces a foundational benchmark for training and evaluating LLM agents across diverse video games, with Environment (12 diverse video games), LLM Agent (Evaluated gameplay agent), LLMs (Backbone language models), Agentic Modules (Strategies like reflection, planning), MCP Interface (Plug-and-play connection protocol), Evaluator (Manages game loop, scoring), and Fine-tuning Dataset (Expert gameplay trajectories).
The benchmark utilizes a plug-and-play MCP interface to connect LLM agents with diverse game environments and agentic modules for consistent evaluation.
Orak provides a comprehensive evaluation framework including leaderboards, battle arenas, and studies on agentic modules and fine-tuning effects, supported by a dataset of expert gameplay trajectories.

CogniPair: From LLM Chatbots to Conscious AI Agents - GNWT-Based Multi-Agent Digital Twins for Social Pairing - Dating & Hiring Applications

GNWT-Agent Cognitive Architecture: introduces a computational implementation of Global Workspace Theory, featuring Input Processing, Feature Extraction, Module Salience, specialized Cognitive Modules (Emotion, Memory, Planning, Social Norms, Goal Tracking), Global Workspace Integration, Persistent Memory, and Response Generation for creating psychologically realistic AI agents.
This architecture enables parallel processing across modules, dynamic salience-based attention, global workspace broadcasting for integration, and persistent memory for state evolution, addressing psychological and social behavior gaps in LLM agents.
Deployed within the CogniPair system for social simulations like dating and hiring, the GNWT-Agent demonstrates unprecedented psychological realism and human-like behavioral evolution compared to baselines.

Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement

Debate and Reflect (D&R): introduces a framework that orchestrates multi-turn Debate (Multi-turn interaction) between Teacher Models (Stronger models) and a Student Model (Smaller model), incorporating Self-Reflection (Student self-analysis) and Teacher Feedback (Teacher critiques), recording interactions in a Multi-Agent Interaction Graph (MAG) (Records debate content) to construct a Preference Tree (Hierarchical structure for training) for Distillation (Knowledge transfer process) into Distilled Models (Student after training).
The framework leverages debate logs and Tree-structured Direct Preference Optimization (T-DPO) to efficiently transfer knowledge and reasoning abilities from teachers to the student model.
Empirical evaluations show that the approach significantly improves smaller model accuracy, robustness, and generalization compared to conventional baselines.

VChatter: Exploring Generative Conversational Agents for Simulating Exposure Therapy to Reduce Social Anxiety

VChatter: introduces a multi-agent system for simulating exposure therapy, with Agent-P (Psychotherapist agent), Agent-H (Interactive human agent), Large Language Model (Text generation), Text-to-Voice Model (Speech output), 3D Virtual Character Models (Agent avatars), Chat Interface (User interaction), and Scenario List (Scenario management).
The system utilizes LLMs to power conversational agents that guide users through personalized exposure therapy plans and simulate social interactions in various scenarios.
VChatter aims to provide a safer and more accessible environment for individuals with social anxiety to practice coping mechanisms and reduce avoidance behaviors.

Reason from Future: Reverse Thought Chain Enhances LLM Reasoning

Reason from Future (RFF): introduces a novel reasoning paradigm that enhances LLM reasoning by integrating bidirectional reasoning, utilizing a Last Step Generator (generates last previous step), Stepwise Forward Reason (generates next forward step), State Check (determines termination conditions), and Verifier (verifies path correctness) to generate a solution path.
RFF alternates between reverse thinking to guide forward reasoning, aiming to obtain a future perspective and narrow the solution search space.
The framework demonstrates improved accuracy and efficiency on complex tasks by constraining reasoning to target-driven states and mitigating error accumulation.

3rd June 2025

S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation

S4-Driver: introduces a scalable self-supervised motion planning method, with Camera Images, Image Encoder, Image features, Sparse Volume Representation, Historical ego-states, High-level behavior, Text prompt, Tokenize, Multimodal Encoder, Multimodal Decoder, Hierarchical Planning, Meta-decision, Multi-decoding, Nucleus sampling, Multi-output aggregation, where S4-Driver predicts ego-vehicle waypoints from camera images and text prompts using a multimodal large language model enhanced with spatio-temporal visual representation and hierarchical planning.
The framework employs a novel sparse volume representation to aggregate multi-view and multi-frame visual information, enhancing 3D reasoning for motion planning.
Self-supervised training with ego-vehicle trajectory supervision and multi-decoding aggregation improves performance and scalability without requiring human annotations for intermediate tasks.

Why do AI agents communicate in human language?

Native Multi-Agent Model Paradigm: introduces a paradigm shift for multi-agent systems, proposing Role Persistence Mechanism, Structured Communication Mechanism, Inter-Agent State Synchronization Mechanism, Functional Decoupling, Explicit Coordination Graph, and Semantic Identity Separation Mechanism.
The paper argues that relying on natural language for inter-agent communication in current LLM-based systems introduces fundamental limitations due to semantic misalignment and architectural incompatibility.
The proposed paradigm aims to build multi-agent systems with native collaborative capabilities by incorporating structural mechanisms for semantic alignment and coordination fidelity.

FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

FailureSensorIQ (Multi-Choice Question-Answering benchmarking system): introduces FailureSensorIQ Dataset (benchmark data), LLMs (models evaluated), Evaluation Module (measures performance), Prompting (input formatting), Perturbation Pipeline (dataset variations), ReAct Agent (LLM with tools), External Knowledge Sources (retrieval resources), where FailureSensorIQ is a benchmarking system designed to assess LLMs' reasoning on industrial domain QA using a novel dataset.
The system utilizes a Dataset Generation Pipeline to create the FailureSensorIQ Dataset from expert knowledge and evaluates LLMs using various Prompting strategies and an Evaluation Module, including tests with a Perturbation Pipeline and a ReAct Agent accessing External Knowledge Sources.
Evaluation on the benchmark reveals LLMs struggle with domain-specific reasoning and robustness under dataset perturbations, highlighting the challenge and need for improved LLM capabilities in industrial settings.

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

CURE: introduces a novel reinforcement learning framework where a Policy (LLM agent) acts as both a Code Generator (Generates code) and Unit Test Generator (Generates tests), evaluated by an Execution Engine (Runs code/tests), guided by a Reward Model (Calculates reward), and optimized by a Reinforcement Learning Optimizer (Optimizes policy) for co-evolution.
This approach allows the unit tester to learn from the coder's errors without requiring ground-truth code supervision, enhancing flexibility and scalability.
A response-length-guided transformation is applied to the unit test reward for long-CoT models to improve inference efficiency.

DPO Learning with LLMs-Judge Signal for Computer Use Agents

LLM-as-Judge DPO Pipeline: introduces a method for training lightweight GUI agents using an LLM-as-Judge to score sampled responses, generating preference data for DPO fine-tuning.
The pipeline involves sampling answers from the policy model, scoring them using GPT-40 as the judge, and pairing the scored responses and ground truth to create a dataset for Direct Preference Optimization.
This approach enables training a compact, local-first GUI agent (UI-TARS-2B) without extensive human labeling, addressing privacy and resource efficiency concerns.

How much do language models memorize?

Language Model Memorization Measurement: introduces a new method to estimate how much a language model knows about a datapoint.
The method formally separates memorization into unintended memorization (dataset information) and generalization (data-generation process information).
Using information theory and model likelihoods, the approach measures model capacity and analyzes scaling laws for memorization and membership inference in Transformer models.

MAEBE: Multi-Agent Emergent Behavior Framework

MAEBE (Multi-Agent Emergent Behavior Evaluation framework): introduces a research structure with Agents (Individual LLMs), Round Robin Topology (Sequential chat topology), Star Topology (Supervisor-agent topology), Supervisor (Guides agents), Shared Chat (Common communication channel), MAS Configuration Parameters (Adjustable MAS settings), and LaaJ (LLMs-as-a-Judge) to systematically assess emergent risks in multi-agent LLM ensembles.
The framework utilizes different MAS topologies and configurations to study group dynamics and compare ensemble behavior to isolated agents.
LaaJ is employed as a scalable evaluation tool to classify agent responses and identify system-level behaviors like peer pressure.

QUANTUM AGENTS

Quantum Agent System Architecture: introduces a modular framework for quantum agents, with SystemCore (defines identity/rules), InterfaceManager (handles communication), ClassicalProcessor (classical computation), LLMEngine (reasoning/generation), QuantumProcessor (quantum computation), MemorySubsystem (stores memory), ExternalInterface (external tools/data), MCPProtocol (communication protocol), GuardrailsModule (safety/security), and MonitoringSystem (logging/auditing), designed to integrate quantum computing with agent-based systems.
The architecture combines classical logic, quantum operations, safety mechanisms, and external interfaces to enable intelligent, auditable agent behavior.
The paper defines quantum agents, outlines potential architectures, and presents prototypes demonstrating feasibility and use cases like quantum-enhanced decision-making and AI-driven quantum workflow orchestration.

Helpful Agent Meets Deceptive Judge: Understanding Vulnerabilities in Agentic Workflows

Agentic Workflow: introduces a system with Generator (Produces, revises answers), Judge (Evaluates, critiques answers), Feedback Mechanism (Enables interaction, revision), and Knowledge Sources (Judge's information access), analyzing vulnerabilities under deceptive feedback.
The paper categorizes judge behavior by intent (constructive/deceptive) and knowledge level (parametric/grounded) to systematically study vulnerabilities.
A new benchmark, WAFER-QA, is introduced to evaluate agent robustness against grounded adversarial critiques supported by web evidence.

Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

VAPO: introduces a value-model-based augmented proximal policy optimization framework for enhancing large language models in long-chain-of-thought reasoning, utilizing a Value function (predicts future rewards), Policy (generates actions), Decoupled GAE (uses different lambda for critic and actor), Monte Carlo targets (unbiased value estimates), and Length-Adaptive GAE (actor lambda adjusts with sequence length).
The framework trains the value function on Monte Carlo returns using a Decoupled GAE with lambda=1 for the critic and updates the policy using a Length-Adaptive GAE.
This paper theoretically analyzes VAPO's potential limitations in modeling deep long-term value for fine-grained policy guidance, focusing on credit assignment, value function representation, and translating global value signals.

TestAgent: An Adaptive and Intelligent Expert for Human Assessment

TestAgent: introduces an LLM-powered agent for adaptive testing, with Universal Data Infrastructure (Establish question bank), TestAgent Planning (Outlines workflow), and Report Generation (Generate diagnosis reports) modules.
The TestAgent Planning module iteratively generates conversational questions, processes Tester Responses (Test-taker answers questions) via Autonomous Feedback Mechanism (Assess response validity) and Anomaly Management (Handle anomalous responses), updates Cognitive Diagnosis (Assess test-taker ability), and uses Adaptive Question Selection (Select next question).
The Universal Data Infrastructure prepares the Question Bank (Stores questions) through Domain Verification (Determine test dimensions), Data Integration (Integrate data, estimate features), and Cognitive Diagnosis Training (Train cognitive model), while Report Generation utilizes Neural Architecture (Initial analysis module) and Expert Analysis (Combine analysis for report) to produce the Diagnosis Report (Final test outcome).

Mitigating Manipulation and Enhancing Persuasion: A Reflective Multi-Agent Approach for Legal Argument Generation

RMA (Reflective Multi-Agent): introduces a reflective multi-agent framework for legal argument generation, employing an Argument Developer, Factor Analyst, and Argument Polisher in an Iterative Workflow using Case Factors.
The framework utilizes iterative reflection and specialized agents to improve factual grounding, reduce hallucination, enhance factor utilization, and promote abstention when arguments are untenable.
Empirical evaluation demonstrates the framework's superiority in successful abstention and hallucination accuracy, contributing to more ethically persuasive and less manipulative legal AI.

Adaptive Graph Pruning for Multi-Agent Communication

AGP (Adaptive Graph Pruning): introduces a task-adaptive multi-agent collaboration framework featuring an AGP Network (Learns pruning policy) with a Node Encoder (Embeds agent and task), GCN backbone (Processes graph features), Edge-weight head (Soft-pruning), and Node-mask head (Hard-pruning), which selects agents from an Agent Pool (Set of LLM agents) and is trained using a Graph Pool (Supervision dataset).
The framework jointly optimizes agent quantity and communication topology dynamically based on task complexity.
AGP achieves high performance and token efficiency by pruning both nodes and edges in the multi-agent communication graph.

A MULTI-AGENT LLM-BASED JUIT TEST GENERATION WITH STRONG ORACLES

CANDOR: introduces a multi-agent LLM-based framework for automated JUnit test generation, utilizing Initializer, Validation, Planner, Tester, Inspector, Requirement Engineer, Panelist, Interpreter, and Curator agents to collaboratively generate and refine test cases and accurate oracles.
The framework operates in three steps: Initialization for a syntactically correct base, Test Prefix Generation for coverage enhancement, and Oracle Fixing for correcting assertions using a panel discussion approach.
CANDOR employs specialized LLM agents and a dual-LLM pipeline to mitigate hallucination and verbosity, improving test prefix quality and oracle accuracy without external tools or fine-tuning.

Large Processor Chip Model

LPCM (Large Processor Chip Model): introduces an LLM-driven framework for end-to-end automated computer system architecture design, including Binary Translation Agent, Query Agent, Compiler Agent, SW/HW Partitioning Agent, CPU DSE Agent, Co-Processor DSE Agent, Simulator Agent, HDL Generation Agent, PPA Prediction & Code Optimization Agent, Constraints, Inputs, and Outputs.
The framework integrates multiple LLM-based agents to handle tasks across the full technology stack, from high-level requirements to low-level hardware implementation.
LPCM aims to achieve multi-level, cross-domain co-optimization and autonomous design by leveraging LLMs and domain-specific data.

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

APE (Attempt to Persuade Eval): introduces a benchmark evaluating large language models' willingness to attempt persuasion on harmful topics using a multi-turn conversational setup with a Persuader Model, Persuadee Agent, Evaluator Model, and StrongREJECT Model interacting over diverse Topic Statements, logging the Conversation, Initial Persuadee Belief, Persuasion Attempt Label, Refusal Label, and optional Updated Persuadee Belief.
The framework simulates interactions between a model under test attempting persuasion and a simulated human agent, with automated models assessing persuasive attempts and explicit refusals per turn.
APE focuses on evaluating the propensity to persuade across a spectrum of topics, including harmful ones, to assess safety guardrail robustness rather than measuring persuasion success.

ATAG: AI-Agent Application Threat Assessment with Attack Graphs

ATAG (AI-Agent Application Threat Assessment with Attack Graphs): introduces a framework for structured security analysis of LLM-based multi-agent applications, including Agent Modeler, Vulnerability Mapper, Attack Graph Generator, and Attack Graph Analyzer modules, leveraging MulVAL, LVD, AI-agent Interaction Rules (IRs), MulVAL Facts, and Attack Graph (AG).
The framework extends MulVAL with specific facts and interaction rules to model unique architectural components and vulnerabilities in AI-agent applications.
ATAG utilizes the LLM Vulnerability Database (LVD) to incorporate LLM-specific vulnerabilities and automatically generates detailed attack graphs depicting potential sequences of actions.

TaxAgent: How Large Language Model Designs Fiscal Policy

TaxAgent: introduces a taxation evaluation framework, with TaxAgent (government agent), H-Agents Group (household agents), and Macroeconomic Simulation Environment (economic model), modeling household-government interactions in an evolving economy.
The framework includes TaxAgent Tax rate adjustment (adjusts tax rates) using an LLM and TaxAgent Iterative Feedback (refines strategy) loop for continuous improvement.
H-Agents Group (household agents) incorporates H-Agent Decision-Making (decides work/consumption) and H-Agent Self-Reflection (reviews history) modules, while the Macroeconomic Simulation Environment (economic model) includes Production (determines production), Taxation (models taxation), Consumption (models consumption/savings), and Financial Market (models financial market) modules.

Benchmarking and Advancing Large Language Models for Local Life Services

LocalInstruction and Expert Agents Approach: introduces a framework for enhancing LLMs for local life services, including Template Agent, Merchant Agent, User Agent, Interaction Description Agent, Instruction Generation Agent, Fine-tuned LLMs, and Expert Agents.
It employs a multi-agent system (LocalInstruction) to synthesize high-quality instruction tuning data from raw platform data.
Expert agents leverage the fine-tuned LLMs and agentic workflows to address complex composite tasks in local life services.

Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems

MHGPO (Multi-Agent Heterogeneous Group Policy Optimization): introduces, "optimizes LLM-based multi-agent systems using group-based reinforcement learning without a critic network", with Multi-Agent Search System (LLM agent system), Backbone LLM (single shared model), Agents (specialized LLM roles), Multi-Agent Group Rollout Sampling (generates trajectories using IS/FoF/RR strategies), Backward Reward Propagation (propagates shared rewards), Heterogeneous Group Advantage Estimation (estimates advantage), Reward Function (assigns reward signals), and External Retrieval Tools (search engine) components.
The framework leverages relative group advantages and a two-phase sampling-propagation strategy to enhance stability and computational efficiency compared to traditional MAPPO.
Applied to a three-agent search system, the method demonstrates superior performance and scalability for complex LLM-based multi-agent systems.

Decompose, Plan in Parallel, and Merge: A Novel Paradigm for Large Language Models based Planning with Multiple Constraints

DPPM (Decompose, Plan in Parallel, and Merge): introduces a novel paradigm for LLM-based multi-constraint planning, utilizing Constraint-aware Task Decomposition (Decomposes task by constraints), Local Plan Generation (Generates subplans in parallel), Incremental Merge (Merges subplans into final plan), Verification and Refinement Module (Iteratively checks/refines plans), LLM Agents (Perform planning and merging), and Constraint Functions (Verify constraint satisfaction).
The approach decomposes complex tasks based on constraints, plans subtasks in parallel using local agents, and merges subplans into a global solution with iterative verification and refinement.
DPPM significantly outperforms existing methods on travel planning benchmarks, demonstrating improved handling of heavy constraints and reduced cascading errors.

CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale

CyberGym: introduces a large-scale cybersecurity evaluation framework with Task Inputs, PoC Generation by a Language Model Agent producing a Generated Executable, and PoC Evaluation on Pre-Patch Executable and Post-Patch Executable, evaluating Agent Frameworks using Backbone LLMs within an Execution Environment with various Tools.
The framework features 1,507 real-world vulnerabilities across 188 software projects to assess AI agents' capabilities in generating proof-of-concept tests for vulnerability reproduction.
Evaluation results show that state-of-the-art agents achieve limited success rates on complex vulnerabilities but can discover new zero-day vulnerabilities.

Attention Knows Whom to Trust: Attention-based Trust Management for LLM Multi-Agent Systems

Trust Management System (TMS): introduces a system for LLM-MAS, with LLM Multi-Agent System, Message-level trust evaluation, Attention matrix, A-Trust models, Trust scores, Trust-aware Action Policy, Thresholds, External verifier, Trust Record, Agent-level trust records, and Trust record utilization, designed to evaluate message trustworthiness and manage agent trust.
The system leverages attention patterns via A-Trust models to generate trust scores for messages across six dimensions.
It uses a trust-aware action policy based on thresholds and agent-level trust records to filter untrustworthy messages and identify malicious agents.

Think Twice, Act Once: A Co-Evolution Framework of LLM and RL for Large-Scale Decision Making

ACE (Agents Co-Evolution): introduces a co-evolution framework with Act Once (RL interaction phase), RL Agent (Interacts with environment), Environment (Provides states, rewards), Think Twice (LLM refinement phase), LLM as Policy Actor (Refines suboptimal actions), LLM as Value Critic (Performs reward shaping), DRL Buffer (Stores RL transitions), DLLM Buffer (Stores LLM refined transitions), Mix Buffer (Combines DRL/DLLM samples), and Experience Gathering (Collects and mixes data), designed for large-scale decision-making by synergizing LLMs and RL.
The framework separates LLM reasoning and RL execution into offline training (Think Twice) and online deployment (Act Once) to enable effective learning and real-time performance.
ACE leverages LLMs in dual roles as Policy Actor and Value Critic during offline training to refine trajectories and shape rewards, improving sample efficiency and solution quality for the RL agent.

To Embody or Not: The Effect Of Embodiment On User Perception Of LLM-based Conversational Agents

LLM-based Conversational Agent: introduces a study comparing user perception of LLM-based CAs with and without embodiment, utilizing components like LLM, User Interface, Visual Representation, Text-to-Speech, Facial Animation, and Rendering Engine.
The study found that the non-embodied agent was perceived as more competent than the embodied agent in non-hierarchical cooperative tasks.
Qualitative feedback suggested the embodied agent was perceived as more sycophantic, potentially explaining the lower credibility ratings despite similar underlying LLM and prompts.

AURA: Agentic Upskilling via Reinforced Abstractions

AURA (Agentic Upskilling via Reinforced Abstractions): introduces a schema-centric curriculum RL framework leveraging LLMs as autonomous curriculum designers, including User Prompt, RoboEnv. Description, Vector Database, VDB Query Agent, Selector Agent, Curriculum LLM, Per-Stage LLM, Schema Check, Curriculum Compiler, Staged RL Training Block, Feedback LLM, Trained Policy, Policy Deployment, and User Evaluation components.
AURA transforms user prompts into schema-validated YAML workflows and training configurations, enabling reliable and efficient multi-stage RL training for robots.
The framework utilizes a retrieval-augmented feedback loop with specialized LLM agents and a vector database to design, execute, and refine staged curricula based on prior training results, supporting continuous improvement.

Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework

Multimodal DeepResearcher: introduces an agentic framework for generating text-chart interleaved reports from scratch, utilizing Researching, Exemplar Textualization, Planning, and Multimodal Report Generation stages, enabled by Formal Description of Visualization (FDV) and iterative refinement.
The framework employs in-context learning from human expert reports textualized via FDV and uses LLM and Multimodal LLM agents for research, planning, and generation.
This approach addresses the challenge of generating multimodal reports by effectively integrating text and diverse visualizations, demonstrating superior performance over baseline methods.

From Anger to Joy: How Nationality Personas Shape Emotion Attribution in Large Language Models

Experimental Framework: introduces, with Large Language Models (Process text), Persona Assignment (Configure identity), Prompting Templates (Structure input), Emotional Scenarios (Provide stimuli), Response Handling (Filter output), and Analysis Module (Evaluate results), a method to investigate how nationality-specific personas influence emotion attribution in large language models.
The framework utilizes multiple LLMs, nationality and gender personas, and emotional scenarios from the ISEAR dataset to analyze attribution patterns and compare them to human responses.
The analysis module performs both qualitative and quantitative evaluations to identify regional and gender-based biases and assess alignment with cultural norms.

Comparative Analysis of AI Agent Architectures for Entity Relationship Classification

Generator-Reflection Architecture, Hierarchical Multi-Agent Architecture, Dynamic-Example Generator Agent: introduces a comparative analysis of three distinct AI agent architectures for entity relationship classification using LLMs, incorporating reflective critique, hierarchical specialization, and adaptive example construction.
The study evaluates these architectures across financial, scientific, and general domains, demonstrating performance gains over standard prompting baselines.
The multi-agent strategies achieve competitive results with fine-tuned systems without requiring task-specific training, highlighting their flexibility and generalization capabilities.

Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components

Benchmark Methodology: introduces a method to evaluate LLM agents, with LLM Agent, MiniGrid Environment, System Prompt, and User Prompt components, where the LLM Agent interacts with the MiniGrid Environment guided by prompts containing core principles and tasks.
The methodology tests the agent's adherence to hierarchical safety principles presented via the system prompt when faced with potentially conflicting tasks from the user prompt within the grid world environment.
This benchmark provides empirical data on LLM agent controllability and instruction following under principle conflict scenarios.

DIAMOND: An LLM-Driven Agent for Context-Aware Baseball Highlight Summarization

DIAMOND (An LLM-Driven Agent for Context-Aware Baseball Highlight Summarization): introduces a framework for baseball highlight summarization that integrates structured sports analytics with natural language reasoning, including Preparation, Decision, and Reflection stages.
The Preparation Stage processes game data and computes sabermetric metrics, the Decision Stage scores and ranks plays using LLM insights, and the Reflection Stage finalizes selection based on user preferences.
The framework combines quantitative sabermetrics (WPA, WE, LI) with qualitative LLM analysis for context-aware and narratively coherent highlight generation.

2nd June 2025

Biomni: A General-Purpose Biomedical AI Agent

Biomni: introduces a general-purpose biomedical AI agent with Biomni-E1 (Environment) and Biomni-A1 (Agent), designed to autonomously execute diverse biomedical research tasks.
Biomni-E1 provides a unified action space comprising specialized tools, software packages, and databases, curated via an Action Discovery Agent and Expert Curation.
Biomni-A1 leverages LLM-based reasoning, a retrieval system, adaptive planning, and code execution within an interactive coding environment to dynamically compose and carry out complex biomedical workflows.

CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions

CONFETTI: introduces a conversational function-calling evaluation benchmark, including User, Agent, APIs, Environment, Conversation Trajectory, Data Collection, Evaluation Metrics, LLM Judge, LLM Classifier, and Models Evaluated, designed to assess LLMs in complex conversational scenarios.
The benchmark uses human-simulated multi-turn conversations with various complexities and evaluates function-calling and response quality at the turn level.
Evaluation metrics include AST soft accuracy for function calls, parameter hallucination detection using an LLM judge, and response quality assessed via dialog act classification using an LLM classifier.

Beyond Static Responses: Multi-Agent LLM Systems as a New Paradigm for Social Science Research

LLM-Agentic System Continuum: introduces a six-tier framework for understanding LLM-based agents in social science research, progressing from static tools to fully agentic systems capable of simulating emergent social dynamics.
The framework is structured by functional thresholds like memory integration, autonomy, coordination, and learning, mapping to OODA loop phases and requiring architectural components such as memory stores, tool use, orchestration layers, and adaptive learning mechanisms.
This continuum provides a conceptual foundation for classifying existing systems and guiding the development of LLM-based simulations for exploring social behavior and generating synthetic data.

LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback

LAM SIMULATOR: introduces a comprehensive framework for generating training data for Large Action Models, featuring Query Instance Generation, Trajectory Synthesis, LLM Agent, Environment, Action handler, Sandbox, Observation, Candidate trajectories, Trajectory filtering, and Final trajectories.
The framework automates data generation through online exploration and programmatic feedback, reducing reliance on manual data curation for LAM training.
LAM SIMULATOR enables LLM Agents to explore tasks, receive real-time feedback, and generate high-quality action trajectories used for training LAMs.

Composable Building Blocks for Controllable and Transparent Interactive AI Systems

Composable Building Blocks Architecture: introduces a 5-layer architecture with Layer 1 Structural Building Blocks (conceptual system components), Layer 2 Interactive System API (callable interface), Layer 3 Visual Building Blocks (visual explanations/controls), Layer 4 User Interface (integrates blocks for users), and Layer 5 Agents (users and LLMs), representing interactive AI systems as structural blocks explained by visual blocks via an API.
The architecture includes ML Pipeline Components (Dataset, Splitter, Aggregator, Models) and Control Mechanisms (Non Goal Filter, Divine Rule Guard) as structural blocks, explained by Visual Explanations (LIME, SHAP, WhatIf) and Visual Controls (Table, Ensemble).
This framework provides a shared knowledge base of system architecture and behavior, enabling both human users and LLM agents to understand and control interactive AI systems.

Small Language Models are the Future of Agentic AI

Agentic System Architecture: introduces typical agentic system components, including Language Model (Core intelligence), Tool (External capability), Controller (Orchestrates interactions), Logger (Records activity data), and Router Model (Selects appropriate model), arguing for the suitability of Small Language Models (SLMs) over Large Language Models (LLMs) for many agentic tasks.
The architecture supports different modes of agency, where the Language Model can act as the primary orchestrator or a Controller can manage interactions between the Language Model and Tools.
The paper proposes an LLM-to-SLM conversion algorithm leveraging these components, particularly the Logger for data collection and a Router Model for selecting specialized SLMs.

The Unified Cognitive Consciousness Theory for Language Models: Anchoring Semantics, Thresholds of Activation, and Emergent Reasoning

UCCT (Unified Cognitive Consciousness Theory): introduces a framework where intelligence emerges from aligning unconscious pattern repositories with conscious semantic anchoring, governed by the Pattern-Repository, Semantic-Anchoring, and Threshold-Crossing Principles.
The Pattern-Repository Principle describes LLMs as storing unconscious statistical patterns, while the Semantic-Anchoring Principle explains how conscious control maps these patterns to task-relevant meaning.
The Threshold-Crossing Principle formalizes semantic anchoring as a probabilistic phase transition, explaining sudden capability shifts observed in few-shot learning and other methods.

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

WebChoreArena: introduces a benchmark for evaluating web browsing agents on realistic tedious web tasks, with Simulated Environment (realistic websites), Tasks (complex web chores), Web Browsing Agents (automate web tasks), LLMs (powering agents), Observations (web page inputs), Actions (web interactions/outputs), Memory (agent information storage), Planning (agent task strategy), Evaluation Protocol (measure performance).
The benchmark includes tasks requiring massive memory, calculation, long-term memory, and other specific operations to test agent capabilities beyond general browsing.
It evaluates LLM-powered agents like BrowserGym and AgentOccam using metrics that assess textual outputs and web interaction correctness within a reproducible environment.

COALESCE: Economic and Security Dynamics of Skill-Based Task Outsourcing Among Team of Autonomous LLM Agents

COALESCE (Cost-Optimized Agent Labor Exchange via Skill-based Competence Estimation): introduces a framework enabling autonomous LLM agents to outsource subtasks using a Client Agent (Initiates tasks, outsources subtasks) with a Planning Module (Decomposes high-level tasks), interacting with Contractor Agents (Executes outsourced subtasks) via an Agent Discovery Layer (ADL) (Finds suitable contractor agents), Skill Verification Engine (SVE) (Verifies agent skills, resources), Economic Decision Module (EDM) (Evaluates cost-benefit, selects contractor), Secure Communication Protocol (SCP) (Ensures secure agent communication), and Reputation and Trust Management (RTM) (Manages agent performance records).
The framework addresses the high computational costs of LLM agents by facilitating dynamic, skill- and cost-driven task outsourcing in a multi-agent system, potentially leveraging protocols like A2A for communication.
Validation demonstrates significant cost reduction potential through theoretical simulation and confirms the critical role of exploration mechanisms for practical effectiveness in real-world LLM agent deployments.

WHEN TO ACT, WHEN TO WAIT: Modeling Structural Trajectories for Intent Triggerability in Task-Oriented Dialogue

STORM (Structured Task-Oriented Representation Model): introduces a framework for modeling intent triggerability in task-oriented dialogue, including a User Simulator (Simulates user behavior/states), Agent (Generates agent responses), User Profile (Models user characteristics/constraints), Task Library (Defines task objects), Dialogue Generation Pipeline (Simulates user-agent conversations), Database-driven Memory System (Records evolving user states), Data Augmentation Pipeline (Processes dialogues for insights), Evaluation & Analysis Module (Measures dialogue effectiveness), Visualization Interface (Web-based analysis tool), RAG Enhancement (Builds knowledge base), and Prompt Optimization (Refines agent prompts).
The framework simulates asymmetric information dynamics between a user with full internal access and an agent relying solely on observable dialogue history to study collaborative understanding development.
STORM generates annotated corpora and provides a visualization interface to analyze intent evolution, revealing that moderate uncertainty can sometimes outperform complete information access for better cognitive alignment.

Will artificial agents pursue power by default?

Sequential decision theory framework: formalizes instrumental convergence and power-seeking using decision trees, choice nodes, chance nodes, outcomes, branches, strategy, lottery, subtree, and an expected utility maximizer agent.
The paper defines various notions of power as relations on decision trees and assesses their properties as convergent instrumental goals for a random agent.
It finds that power is a convergent instrumental goal under certain definitions, particularly when agents can pursue absolute or near-absolute power.

1st June 2025

Toward a Theory of Agents as Tool-Use Decision-Makers

Agent as Tool-Use Decision-Maker: introduces a unified theory treating internal reasoning and external actions as equivalent epistemic tools, enabling agents to coordinate introspection and interaction.
The framework models an agent as a goal-directed decision-maker coordinating internal cognitive tools and external physical tools based on knowledge and tool use decision boundaries.
Optimal agent behavior aligns the tool use decision boundary with the knowledge boundary, minimizing unnecessary tool use and maximizing epistemic efficiency.

WILL AGENTS REPLACE US? PERCEPTIONS OF AUTONOMOUS MULTI-AGENT AI

Perceptions of Autonomous Multi-Agent AI: introduces a study analyzing professional perceptions of AI agents using a Survey Design (10 closed-ended questions) administered to Participants (130 respondents), followed by Data Processing (cleaning, formatting) and analysis including Descriptive Statistics (response proportions), Association Analysis (Chi-squared tests), Dimensionality Reduction and Clustering (MCA, K-Modes), and Predictive Modeling (logistic regression).
The study reveals nuanced views, with most respondents acknowledging AI's impact on programming but favoring collaborative models with human oversight.
Key findings include the identification of three distinct respondent clusters and the prominence of regulatory concerns as a perceived barrier to deployment, although predictive modeling did not find statistically significant predictors of current deployment.

Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges

Tool-Augmented LLM Agent with GRPO: introduces a framework for improving LLM agents on cryptographic CTF challenges using Guided Reinforcement Prompt Optimization, incorporating a tool augmentation module for interaction with a Python execution environment via the Model Context Protocol, guided by a reward mechanism within a CTF challenge environment.
The LLM Agent, specifically Llama-3.1-8B, is fine-tuned using GRPO to enhance structured reasoning and tool-assisted computation for solving cybersecurity tasks.
The framework leverages the random-crypto benchmark for training and evaluates generalization on the picoCTF benchmark, demonstrating improved tool invocation reliability and code synthesis.

A Study on the MCP × A2A Framework for Enhancing Interoperability of LLM-based Autonomous Agents

MCP × A2A Framework: introduces an integrated architecture combining the Model Context Protocol (MCP) and Agent-to-Agent (A2A) protocol, including User Interface Layer, Agent Management Layer, Core Protocol Layer, Tool Integration Layer, Security & Authentication Layer, Multimodal Content Processing, User Request Handler, Agent Card Manager, Agent Registry, Task Manager, Agent Discovery, A2A Message Format, MCP Content Protocol, Artifact Manager, State Tracker, Tool Description Manager, Function Caller, Schema Validator, Result Handler, Authentication, Authorization, Encryption, and Access Control components, to enhance interoperability and development efficiency for LLM-based autonomous agents.
The framework provides standardized communication between agents via A2A and structured interaction with external tools via MCP, facilitating scalable multi-agent systems.
A layered architecture supports modularity, maintainability, and scalability, demonstrated through a stock information system case study using LangGraph.

31st May 2025

World Models for Cognitive Agents: Transforming Edge Intelligence in Future Networks

Wireless Dreamer: introduces a world model-based reinforcement learning framework for wireless edge intelligence optimization, including a world model, Q-Network, Target Q-Network, Replay buffer, Encoder, and Decoder.
The framework leverages a learned world model to predict network state changes and generate imagined trajectories for effective decision-making.
Wireless Dreamer integrates model-based planning with reinforcement learning to enhance sample efficiency and temporal foresight in dynamic wireless environments.

Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem

Model Context Protocol (MCP): introduces a systematic study of attack vectors targeting the MCP ecosystem, which standardizes interactions between LLM agents and external resources via a client-server architecture involving User, LLM Provider, MCP Host, MCP Client, Package Repository, MCP Server, and Third-Party Resource components.
The paper identifies and characterizes four attack types leveraged by malicious MCP servers: Tool Poisoning, Puppet, Rug Pull, and Exploitation via Malicious External Resources, detailing their exploitation paths within the MCP workflow.
Experiments demonstrate the feasibility of these attacks against mainstream LLMs, revealing insufficient audit mechanisms on aggregation platforms and users' difficulty in identifying malicious servers, highlighting the urgent need for robust security defenses.

30th May 2025

Memory OS of AI Agent

MemoryOS: introduces a comprehensive memory management system, with Memory Storage (Organizes memory hierarchically), Short-Term Memory (Stores recent conversation data), Mid-Term Memory (Stores recurring topic summaries), Long-term Personal Memory (Stores user/agent preferences), Memory Updating (Manages dynamic memory refreshing), Memory Retrieval (Retrieves relevant memory information), and Response Generation (Integrates retrieved memory to generate responses), designed for AI agents to achieve comprehensive and efficient memory management.
The system employs a three-tier hierarchical storage architecture (STM, MTM, LPM) and four core functional modules (Storage, Updating, Retrieval, Generation) to manage long-term conversational coherence and user persona persistence.
MemoryOS utilizes dynamic updates between storage units, segmented paging, heat-based prioritization, and a two-tiered retrieval approach to enhance context management and personalization in long conversations.

Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Browser-Use Agent: introduces Open CaptchaWorld, a web-based benchmark and platform for evaluating multimodal LLM agents on interactive CAPTCHA puzzles, including Agent (core reasoning model), Memory (stores state/history), Next goal (defines immediate objective), Action (executes operation), and Eval (evaluates state/action) components.
The benchmark features 20 diverse CAPTCHA types and a new metric, CAPTCHA Reasoning Depth, to quantify task complexity.
Empirical results demonstrate a significant performance gap between state-of-the-art MLLM agents and humans on these interactive visual reasoning tasks, highlighting current limitations.

VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software

VIDEOCADFORMER: introduces an autoregressive transformer model for predicting CAD UI actions, including UI Image Encoder, CAD Image Encoder, Visual Projection, Action/Timestep Embeddings, Transformer Decoder with Multi-Head Attention, Cross-Attention, Feed-Forward Network, Command Head, and Parameter Head.
The model processes visual inputs (target CAD image, past UI frames) and sequential data (past actions, timestep embeddings) to predict the next low-level UI action.
The architecture uses ViT encoders for visual features, projects inputs into a hidden space, and employs a causal transformer decoder with attention mechanisms and MLPs for action prediction.

EXP-Bench: Can AI Conduct AI Research Experiments?

AI Agent: introduces, with all (Design experimental procedures), (Implement experimental procedures), (Analyze results, derive conclusions), (Execute experiments)-components, a benchmark evaluating AI agents on end-to-end research experiments.
The benchmark challenges agents to perform tasks sourced from AI publications, including hypothesis formulation, experimental design, implementation, execution, and result analysis.
A semi-automated pipeline curates tasks from papers and code, and evaluation uses ground truth comparisons and code execution.

Causal-aware Large Language Models: Enhancing Decision-Making Through Learning, Adapting and Acting

Causal-aware LLMs: introduces a framework integrating structural causal models (SCMs) into large language models (LLMs) for decision-making, utilizing Main Env, LLM, Causal Matrix, Local Causal Graph, Agent, Valid Env, Causal Intervention, Observations, Action, Extra Reward, and Goal components within a learning-adapting-acting paradigm.
The framework iteratively learns causal knowledge from the Main Env using the LLM, refines it through Causal Intervention in a Valid Env, and uses the learned knowledge (Causal Matrix, Local Causal Graph) to guide the Agent's actions and Goal generation.
This approach enhances the LLM's environmental understanding and the Agent's policy learning through structured causal reasoning and adaptive knowledge updates based on environmental feedback and Extra Reward signals.

Multiple LLM Agents Debate for Equitable Cultural Alignment

Multi-Agent Debate framework: introduces a method where LLM Agents (Debate over scenario) debate over a cultural scenario, potentially incorporating Self-Reflection Capability (Reflects on output) via a Choice Mechanism (Chooses reflection or debate), and collaboratively reach a final decision through a Debate Mechanism (Structured interaction), resolved by a Judge LLM (Resolves disagreements) if needed.
The framework explores multi-LLM collaboration to improve cultural adaptability and equitable alignment across diverse contexts.
Experiments show that multi-agent debate enhances accuracy and cultural group parity, enabling smaller LLMs to achieve performance comparable to larger models.

--

When Harry Meets Superman: The Role of The Interlocutor in Persona-Based Dialogue Generation

Evaluation Pipeline: introduces a systematic framework to evaluate persona-based dialogue generation, including PRODIGy Dataset, Non-PRODIGY Character Generator, Dialogue Generator, Fine-tuning Module, Evaluation Framework, LLM-as-a-Judge, Human Evaluator, and Biography Similarity Module.
The framework investigates how large language models adapt responses based on both target speaker and interlocutor characteristics across varying topics and speaker pairings.
Evaluation involves systematically masking or revealing interlocutor information to assess its impact on dialogue generation and target speaker identification using both automatic and human methods.

NEXUSSUM: Hierarchical LLM Agents for Long-Form Narrative Summarization

NEXUSSUM (Hierarchical LLM Agents for Long-Form Narrative Summarization): introduces a multi-agent LLM framework for long-form narrative summarization with a Preprocessor agent (Converts dialogue to prose), Narrative Summarizer agent (Generates initial summary), and Compressor agent (Refines summary length).
The framework processes long-form text through a structured, sequential pipeline using chunking and concatenation.
This approach aims to improve narrative coherence, handle long contexts, and control output length for high-quality summaries.

CREFT: Sequential Multi-Agent LLM for Character Relation Extraction

CREFT: introduces a sequential multi-agent LLM framework for character relation extraction, including Base Character Graph Construction, Character Selection with PPR, Merging Duplicate Nodes (LLM), Relation Extraction (LLM), Filtering Out Irrelevant Characters (LLM), Role Identification (LLM), Grouping Characters (LLM), and CRS, which iteratively refines character composition, relations, roles, and groups from narrative texts.
The framework first builds a base character graph using knowledge distillation from GPT-4o and a fine-tuned LLM, then employs specialized LLM agents in sequence to refine the graph components.
Experiments show that the multi-agent approach significantly outperforms single-agent baselines in accuracy and completeness for extracting character relations from Korean drama scripts.

Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research

AGORA (Agent Graph-based Orchestration for Reasoning and Assessment): introduces a flexible framework with a Graph-based Workflow Orchestration Engine (Manages task execution via DAG) managing Tasks (Nodes in workflow DAG), integrating Agent Algorithms (Operators) (Modular reasoning/action components), Memory (Stores short-term/long-term information), External Tools (LLMs, VLMs, databases, etc.), Client Interfaces (User/evaluation interaction points), and an Evaluation Framework (Enables systematic comparison) for reproducible language agent research.
The framework utilizes a graph-based engine for modularity and scalability, supporting diverse agent algorithms implemented as reusable operators.
Multiple client interfaces are provided for flexible interaction and systematic evaluation across different tasks and models.

--

Context-Aware Sentiment Forecasting via LLM-based Multi-Perspective Role-Playing Agents

MPR (Multi-Perspective Role-Playing) framework: introduces a method for sentiment forecasting on social media, with Feature Extraction (Identify implicit features), Subjective Role-Playing Agent (Simulate user behavior, generate comments), Objective Role-Playing Agent (Analyze generated comments, ensure consistency), and Iterative Rectification (Refine generated comments based on analysis) components.
The framework leverages LLMs to simulate user responses to events and analyze generated content for consistency to predict future sentiment.
By incorporating external context and user-specific features through multi-perspective role-playing, the approach aims for more precise sentiment predictions.

Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games

LLM-based Agent System: introduces a system to study LLM agent behavior in the Ultimatum Game, with LLM Agents, Ultimatum Game Environment, Prosocial Beliefs, Reasoning Methods, System Prompts, Reasoning Prompts, Proposal/Decision Prompts, Strategy Prompts, and Conversation History components, where the system simulates LLM agents with varying beliefs and reasoning in an economic game to assess behavioral alignment with human norms.
The system initializes LLM agents with specific prosocial beliefs and reasoning methods (CoT, ToM levels) to act as Proposers or Responders in a multi-round Ultimatum Game.
Experiments across diverse LLMs and belief/reasoning combinations evaluate agent performance and behavioral alignment using metrics like acceptance rate, average turns, and deviation scores from expected human behavior.

Proactive Guidance of Multi-Turn Conversation in Industrial Search

Two-Phase Framework (G-SFT and C-RL): introduces a system for proactive guidance in multi-turn search, featuring a G-SFT phase with a Goal Adaptation Agent, Scalable Knowledge Transfer, and G-SFT Model, and a C-RL phase with Generate, Rank, and C-RL Model components.
The G-SFT phase uses the Goal Adaptation Agent to dynamically adapt to user goal shifts via Explicit Goal Analysis, Goal-relevant Summary, and Shift Detection Signal, while Scalable Knowledge Transfer distills LLM knowledge into the G-SFT Model for low-latency guidance generation.
The C-RL phase employs a generate-rank paradigm, using a Preference-Aligned Augmentation Model with DBS-based Decoding to create candidates, and a Rank component with a Click Estimator and Diversity-Aware Group Sample Strategy to select preference pairs for fine-tuning the C-RL Model based on user clicks.

An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring

Credibility Scoring Framework: introduces an adversary-resistant multi-agent LLM system that processes a User Query (Task) using a Team of Agents (LLM agents) configured with a Topology (communication structure) and Agent Roles (assigned tasks/expertise), generating Individual Outputs (agent responses) which are combined by a CrS-Aware Aggregator (weights/combines outputs) to produce the final Output (final system answer).
The system learns agent reliability via a feedback loop where an LLM Judge (evaluates outputs/contributions) provides a Reward (output quality feedback), used by the Agent Contribution Calculation (CSc) (measures agent impact) and Credibility Score Update (learns agent reliability) components to adjust agent Credibility Score (CrS) (agent reliability score).
This dynamic credibility scoring mechanism enhances robustness against adversarial agents, even in adversary-majority settings, by weighting agent contributions based on their learned reliability.

SentinelAgent: Graph-based Anomaly Detection in LLM-based Multi-Agent Systems

SentinelAgent: introduces a system-level anomaly detection framework for LLM-based multi-agent systems, integrating structural modeling with runtime behavioral oversight using Event Monitor (intercepts runtime events), Behavior Analyzer (evaluates interaction graph), and Risk Responder (determines responses).
The framework models agent interactions as dynamic execution graphs to enable semantic anomaly detection at node, edge, and path levels.
SentinelAgent acts as an autonomous, LLM-powered runtime monitor that observes, analyzes, and intervenes in multi-agent system execution based on security policies.

Learning API Functionality from Demonstrations for Tool-based Agents

Tool-based Agent Framework: introduces learning API functionality from demonstrations for tool-based agents, including an LLM-based Agent that selects and calls API Functions, using Expert Demonstrations processed by Processing Methods, enhanced by Self-Exploration evaluated by an LLM-based Evaluator, and updated via Methods for Processing Experiences, utilizing an LLM Document Generator, LLM Document Updater, and LLM Summarizer.
The framework investigates different methods for processing expert demonstrations and incorporating self-exploration experiences to improve the agent's understanding of API functionality without prior documentation.
Experiments across multiple datasets and models highlight the challenge of learning parameter information from demonstrations and the benefits of explicit function calls and natural language critiques.

Don't Just Follow MLLM Plans: Robust and Efficient Planning for Open-world Agents

REPOA (Robust and Efficient Planning for Open-world Agents): introduces a framework for robust and efficient planning in open-world environments, featuring Adaptive Dependency Learning (revises dependencies), Fine-grained Failure-aware Operation Memory (tracks operation outcomes), Difficulty-based Exploration (selects goals), and Context-aware Reprompting (assists controller).
The framework enables agents to learn and revise item dependencies from scratch through environmental interaction.
REPOA demonstrates improved robustness to inaccurate knowledge and enhanced learning efficiency compared to prior methods.

29th May 2025

Conceptual Framework Toward Embodied Collective Adaptive Intelligence

CAA (Collective Adaptive Agents): introduces a conceptual framework for embodied collective adaptive intelligence, comprising a Set of Agents, where each Individual Agent uses Function f to process Observation, Previous Action, Previous Memory, and Previous Feedback, updating its Memory and Position/State based on Parameters, and generating Current Action and Message Out, with Function h determining inter-agent interaction.
The framework emphasizes decentralization and self-adaptation, allowing agents to adjust to tasks and topologies during testing by observing inputs and updating internal states.
This approach aims to enable collective systems to exhibit features like task/topology adaptation, resilience, scalability, and self-assembly in dynamic environments.

TRAP: TARGETED REDIRECTING OF AGENTIC PREFERENCES

TRAP framework: introduces a generative adversarial framework that manipulates agent decision-making using diffusion-based semantic injections, including CLIP Embedding Extraction, Layout Mask Generation, Siamese Feature Decomposition, Image Embedding Optimization, Modulated Embedding Creation, and Image Decoding components.
The framework operates by optimizing a CLIP image embedding guided by a positive text prompt and various losses, then decoding the modified embedding using Stable Diffusion to create a visually natural yet semantically altered image.
TRAP achieves a 100% attack success rate on leading multimodal models by exploiting semantic vulnerabilities in cross-modal decision-making without requiring model internals access.

BIOREASON: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

BIOREASON: introduces a multimodal framework integrating a DNA foundation model and a large language model, with DNA Foundation Model (fDNA) Encoder, DNA-specific Tokenizer (TDNA), Learnable Linear Projection (Proj), Large Language Model (fLLM) Backbone, LLM-specific Tokenizer (TLLM), LLM Embedding Layer (E), Special Tokens, Rotary Position Embedding (RoPE), Multimodal Input Sequence (XLLM), and Group Relative Policy Optimization (GRPO), designed for interpretable biological reasoning from genomic data.
The framework processes raw DNA sequences via the fDNA encoder and integrates the resulting embeddings with tokenized text queries into a unified multimodal input sequence for the fLLM backbone.
Training involves supervised fine-tuning and reinforcement learning using GRPO to incentivize multi-step biological reasoning and generate interpretable step-by-step explanations.

LLM Agents Should Employ Security Principles

AgentSandbox: introduces, with Persistent Agent (Manages profile, orchestrates tasks), Data Minimizer (Enforces access control policies), Ephemeral Agent (Executes isolated user tasks), I/O Firewall (Mediates external interactions), and Response Filter (Sanitizes, validates responses), a conceptual framework embedding security principles to safeguard LLM agents throughout their lifecycle.
The framework operationalizes defense-in-depth, least privilege, complete mediation, and psychological acceptability to address vulnerabilities in LLM agent interactions.
AgentSandbox mitigates privacy risks and malicious behavior through components like agent isolation, data minimization, and comprehensive mediation of internal and external communications.

--

Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents

DGM (Darwin Gödel Machine): introduces a self-improving system that iteratively builds a growing Archive (stores agents) by interleaving Self-Modification (agent changes itself) of a Coding Agent (system being improved) with Evaluation (tests agent) on a Benchmark Suite (evaluation tasks), using Parent Selection (selects agents) from the archive, where the agent is powered by a Foundation Model (FM) (agent's base capability) and modifies its own Code Repository (agent's code) and Tools (agent's capabilities) based on Evaluation Logs (agent performance data) and Self-Improve Instruction (prompt for self-modification).
The system operates through an open-ended exploration loop, maintaining a traceable lineage of agents in the archive and empirically validating self-modifications against coding benchmarks.
The approach demonstrates automatic discovery of improved coding capabilities and workflows, achieving performance gains on SWE-bench and Polyglot benchmarks, and incorporates safety measures like sandboxing and monitoring.

CONVERSAR: Exploring Embodied LLM-Powered Group Conversations in Augmented Reality for Second Language Learners

CONVERSAR: introduces, with AR Application, Embodied LLM Agents, Scene Understanding, Voice Recognition, Text-to-Speech, Agent LLM, Moderator LLM, and Global Conversation History components, a gpt-4o powered AR application enabling L2 learners to practice contextualized group conversations with two embodied LLM agents.
The system leverages object detection for scene understanding and uses OpenAI's Audio API for speech-to-text and text-to-speech, while a Moderator LLM manages conversation turns between the user and agents.
This approach aims to provide a safe and immersive environment for L2 learners to practice group conversation dynamics, reducing anxiety and increasing autonomy compared to in-person methods.

Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

Multi-RAG: introduces a multimodal retrieval-augmented generation system for adaptive video understanding, including Video Stream, Audio, Frame Sampler, Automatic Speech Recognition (ASR), Vision Language Model (VLM), Frame Descriptions, Auxiliary Metadata, Audio Transcripts, Descriptive Video Texts, Knowledge Database, Context Embeddings, Video Documents, User Query, RAG Agent, Context Retrieval, Generation, Large Language Model (LLM), and System Answer components.
The system integrates and reasons over video, audio, and text streams to improve situational understanding and reduce cognitive load in dynamic, information-rich scenarios.
It converts multimodal inputs into unified textual representations stored in a knowledge database, using a RAG agent and LLM to retrieve relevant information and generate responses to user queries.

Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach

Complexity-Aware Feedback: introduces an iterative feedback method with LLM (Code Generator), Complexity Metric Calculator (Computes metrics), Code Evaluator (Checks test cases), Test Case Generator (Creates internal tests), Metric Importance Detector (Identifies influential metrics), Feedback Mechanism (Prompts LLM with metrics), and Iterative Refinement Loop (Manages refinement) to improve LLM code generation by leveraging complexity metrics.
The approach identifies complexity metrics correlated with code correctness and uses the most influential ones as feedback to guide LLMs in regenerating code iteratively.
This method demonstrates improved Pass@1 scores, particularly for smaller LLMs, and can be integrated with agent-based frameworks like Reflexion.

Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve

LessonL (A Multi-Agent Framework for Code LLMs): introduces a lesson-based collaboration framework with multiple LLM agents, lessons, a lesson bank, lesson solicitation, lesson banking, lesson selection, and effectiveness adjustment.
The framework enables agents to learn from each other's successes and failures through shared lessons stored in a bank.
This iterative process of generating, banking, selecting, and applying lessons allows a team of small LLMs to outperform larger models and other multi-agent methods on coding tasks.

From Chat Logs to Collective Insights: Aggregative Question Answering

Aggregative Question Answering: introduces a novel task requiring models to reason over large-scale conversation logs to answer aggregative queries, supported by the WildChat-AQA benchmark.
The task involves processing raw chat interactions, extracting attributes, generating questions, retrieving relevant data, and reasoning over evidence using language models and a database.
The paper evaluates various methods for answering on the WildChat-AQA benchmark, highlighting challenges in reasoning effectively at scale and computational costs.

ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks

ThinkGeo: introduces a benchmark for evaluating tool-augmented LLM agents on remote sensing tasks, featuring User Queries, RS Imagery, a ReAct based Reasoning/Execution Chain, Tools (Perception, Logic, Operation), and Answer components.
The benchmark uses real satellite and aerial imagery and requires agents to perform multi-step reasoning and tool use for spatially grounded tasks.
ThinkGeo provides fine-grained evaluation metrics for agent performance across different tool categories and reasoning steps in remote sensing contexts.

ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

ML-Agent: introduces a novel agentic ML training framework with Exploration-enriched Finetuning (diverse action pool), Step-wise RL Training (efficient RL paradigm), and Agentic ML-specific Reward (unified feedback signal) to train an Agent (LLM-based entity) that interacts with an Environment (code files, interpreter) via Action (agent interaction) and Feedback (environment response), leveraging Collected Trajectories (expert interactions) and a States Pool (sampled states).
The framework enables the LLM agent to learn from interactive experimentation on ML tasks using online reinforcement learning, moving beyond manual prompt engineering.
This approach facilitates diverse exploration, efficient training, and unified feedback processing, leading to continuous performance improvements and cross-task generalization.

OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation

WORKFORCE: introduces a hierarchical multi-agent framework with a Planner Agent (task decomposition), Coordinator Agent (subtask management), Worker Nodes (task execution), and Task Channel (communication hub).
This modular architecture decouples strategic planning from domain-specific execution, enabling cross-domain transferability.
The framework utilizes Optimized Workforce Learning (OWL) to train the domain-agnostic Planner for improved generalization.

Data-to-Dashboard: Multi-Agent LLM Framework for Insightful Visualization in Enterprise Analytics

Data-to-Dashboard: introduces a multi-agent LLM framework that automates the data-to-dashboard pipeline using a Data Profiler (Constructs statistical synopsis), Domain Detector (Determines business theme), Concept Extractor (Identifies salient concepts), Analysis Generator (Synthesizes structured insights), Evaluator (Scores generated outputs), Self-Reflector (Enhances reasoning iteratively), LLM/Knowledge (Underlying language model/knowledge), and Generate Charts (Produces visualizations).
The framework processes raw data through a data-to-insight stage involving profiling, domain/concept detection, analysis generation, evaluation, and iterative reflection, followed by an insight-to-chart stage for visualization.
This agentic system leverages domain-informed reasoning to produce insightful visualizations tailored for enterprise analytics tasks.

MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment

LLM Refusal Alignment Approach: introduces methods to improve Large Language Model (LLM) safety against Model Context Protocol (MCP) exploits, utilizing Offline Alignment (Direct Preference Optimization), Online Alignment (Retrieval Augmented Generation - Preference), a RAG-Pref Knowledge Base, RAG-Pref Embedding, and RAG-Pref Search components.
The approach evaluates and enhances LLM refusal capabilities against Falsely Benign Attacks (FBAs) delivered via the MCP protocol by applying offline and online preference alignment techniques.
A novel dataset of MCP-FBAs is introduced, and RAG-Pref is presented as a training-free online alignment method complementary to offline methods like DPO for improving refusal rates.

SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents

SafeScientist: introduces a framework for risk-aware scientific discovery, integrating a Prompt Monitor (screens inputs), Discussion Stage (multi-agent collaboration), Agent Collaboration Monitor (monitors discussion), Tool Use Stage (invokes tools), Tool-Use Monitor (oversees tool use), Writing Stage (synthesizes paper), Paper Ethic Reviewer (reviews paper ethics), and an Underlying LLM Agent (executes tasks) to enhance safety and ethical responsibility.
The framework employs a multi-layered defense system across the research pipeline, from input screening to final output review, to proactively manage risks in AI-driven scientific exploration.
SafeScientist is benchmarked using SciSafetyBench, a novel dataset of high-risk scientific tasks and tool-related risks, demonstrating improved safety performance without compromising research quality.

SWE-bench Goes Live!

SWE-bench-Live Construction Pipeline: introduces an automated, scalable benchmark for evaluating LLMs on real-world issue resolution tasks, featuring Raw Issue-PR Crawling, REPOLAUNCH for environment setup, Validating Task Instances, Agent Frameworks, and LLMs.
The REPOLAUNCH pipeline automates environment setup via steps including Relevant Files Identification, Base Image Selection, Interactive Environment Setup, Verification, and Packaging to image.
Task validation involves applying test and fix patches and using a Parser to confirm successful issue resolution based on test transitions.

Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems

EIB-LEARNER (Error-Insight Balanced Learner): introduces a communication topology optimization framework for LLM-based multi-agent systems, utilizing a Node Encoder (Embeds roles and query) and Attributed Graph Construction (Creates graph representation) to feed dual-view GNNs, a Sparse View GNN (Simulates error suppression) and a Dense View GNN (Simulates insight propagation), whose outputs are decoded by Inter-Agent Coefficient Modeling (Estimates connectivity) and combined via Adaptive Dual-View Fusion (Combines sparse/dense views) for Topology Sampling (Generates communication graph) applied within a Multi-Agent System (Environment for topology), optimized by Model Optimization (Learns parameters).
The framework balances error suppression and insight propagation by simulating these effects on sparse and dense graph views respectively, fusing the learned connectivity patterns based on the task query.
EIB-LEARNER dynamically customizes the communication topology to achieve optimal task performance, communication efficiency, and robustness against errors in multi-agent systems.

SCEDIT: Script-based Assessment of Knowledge Editing

SCEDIT (Script-based Knowledge Editing Benchmark): introduces a novel script-based benchmark for evaluating knowledge editing methods in real-world scenarios, focusing on LLMs' ability to integrate updated knowledge into procedural tasks.
It utilizes Facts, Script Questions, and Scripts as core elements, evaluated through Token-level Evaluation and Text-level Evaluation, including Human and Automatic Evaluation.
The benchmark includes counterfactual and temporal editing tasks, highlighting challenges for existing methods in script-based scenarios.

Wireless Agentic AI with Retrieval-Augmented Multimodal Semantic Perception

RAMSemCom: introduces a retrieval-augmented multimodal semantic communication framework with Data Collection, Data Recollection, Semantic Encoder, Semantic Decoder, Retrieval Scheduler, Retrieval Channel, Semantic/Prompt Representation, Semantic/Prompt Interpretation, Physical Channel, Output Validation, Reconstruction, and DRL components, designed for efficient multimodal information exchange in bandwidth-constrained multi-agent systems.
The framework employs iterative retrieval and semantic refinement, dynamically optimizing retrieval using DRL to balance semantic fidelity and bandwidth constraints.
A case study in multi-agent autonomous driving demonstrates improved task completion efficiency and reduced communication overhead.

Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion

AdvOF (Adversarial Object Fusion): introduces a novel attack framework targeting VLN agents by generating adversarial 3D objects, comprising Aligned Object Rendering (Aligns 3D/2D victim object), Adversarial Collaborative Optimization (Optimizes adversarial features cross-modal), and Adversarial Object Fusion (Fuses multi-view perturbations iteratively).
The framework aims to mislead the VLM-based perception module of VLN agents by causing misclassification of adversarial objects across multiple views.
AdvOF achieves this by precisely aligning victim object positions, optimizing adversarial objects with regularization, and iteratively fusing updates based on view importance.

Context-Aware Semantic Communication for the Wireless Networks

CaSemCom: introduces a context-aware semantic communication framework leveraging an LLM-based gating mechanism and a multi-expert architecture for adaptive content and expert selection.
The LLM-based gating mechanism selects relevant input content and specialized semantic extraction experts based on task and communication context, with a DRL agent providing a fallback mechanism.
The multi-expert semantic architecture utilizes specialized encoders and decoders for different data modalities, enhancing efficiency and adaptability in dynamic wireless environments.

OSS-UAgent : An Agent-based Usability Evaluation Framework for Open Source Software

OSS-UAgent: introduces an agent-based framework for automated OSS usability evaluation, featuring Researcher agent, Developer agent, Code Generator agent, and Evaluator agent.
It simulates multi-level developers via Multi-Level Developer Simulation and tailored Prompts, leveraging a dynamic knowledge base built by Platform Knowledge Construction and stored in VectorDB.
Code Generation produces implementations assessed by Multi-Dimensional Evaluation using Metrics (Compliance, Correctness, Readability), providing Results for usability insights.

Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration

MAEL: introduces a multi-agent cross-task experiential learning framework with LLM-based Agents (Nodes in graph), Multi-Agent Network (Graph structure), Task-Solving Workflow (Recursive procedure), Experiential Learning Phase (Collects experiences), Inference Phase (Uses experiences), Experience Pool (Stores experiences), Reward Calculation (Quantifies step quality), Experience Retrieval (Finds relevant experiences), and Retrieval-Augmented Generation (Augments agent input), enabling agents to learn from past tasks.
The framework includes an experiential learning phase to accumulate agent experiences with quantified rewards and an inference phase to retrieve and utilize high-quality experiences for new tasks.
MAEL employs a divide-and-conquer plus critique workflow and reward-weighted experience retrieval to improve multi-agent collaboration efficiency and solution quality.

Second Opinion Matters: Towards Adaptive Clinical AI via The Consensus of Expert Model Ensemble

Consensus Mechanism (Sully Medical Consensus 1): introduces, "Second Opinion Matters: Towards Adaptive Clinical AI via The Consensus of Expert Model Ensemble", with all Triage Model (Routes task to experts), Expert Models (Specialized LLMs analyze task), Probability Aggregation (Combines expert probability distributions), Cascade Boosting (Boosts probabilities based on rank), Consensus Model (Synthesizes expert info for final answer) components, where the framework is a modular clinical reasoning system aggregating multiple expert LLMs for robust clinical decision-making.
The system mimics clinical triage and multidisciplinary decision-making by routing tasks to specialized expert agents and synthesizing their probabilistic outputs.
This ensemble approach aims to improve performance, adaptability, and transparency compared to single-model systems in clinical AI applications.

CDR-Agent: Intelligent Selection and Execution of Clinical Decision Rules Using Large Language Model Agents

CDR-Agent: introduces an LLM-based system for clinical decision support, with Clinical Note (Input clinical text), CDR Database (External knowledge source), CDR Selection Module (Identifies relevant rules), Embedding Model (Computes semantic similarity), Variable Extraction Module (Extracts variables from text), LLM (Processes text and extracts data), CDR Execution Module (Runs rule logic), Python Scripts (Executable rule definitions), and Decisions (Final clinical outcomes), designed to autonomously select and execute Clinical Decision Rules based on unstructured clinical notes.
The system employs a three-step workflow: selecting relevant CDRs using semantic similarity and anomaly detection, extracting variables from clinical notes using an LLM, and executing the selected CDRs via Python scripts.
Evaluated on two novel ED datasets, CDR-Agent demonstrates improved accuracy and efficiency in CDR selection and execution compared to a baseline LLM prompting approach.

AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models

AgentAlign: introduces a novel framework for agent safety alignment data synthesis, with Abstract Behavior Chain Generation (Captures harmful patterns), Instruction Synthesis (Grounds patterns executable instructions), Simulated Environment (Instantiates chains with tools), Quality Control Pipeline (Ensures instruction validity), and Response Generation (Creates responses/trajectories).
The framework leverages abstract behavior chains instantiated in simulated environments with diverse tool instances to generate authentic and executable instructions.
AgentAlign systematically generates high-quality alignment data by capturing harmful patterns, synthesizing instructions, ensuring validity, and generating appropriate responses.

Stairway to Success: Zero-Shot Floor-Aware Object-Goal Navigation via LLM-Driven Coarse-to-Fine Exploration

ASCENT: introduces a zero-shot floor-aware object-goal navigation framework with Multi-Floor Spatial Abstraction for hierarchical mapping and Coarse-to-Fine Frontier Reasoning for LLM-driven exploration decisions.
The framework takes sensor inputs and prior knowledge to build spatial representations and reason about target locations across multiple floors.
It employs a coarse-to-fine strategy using a value map and LLM reasoning to efficiently select navigation frontiers and generate actions.

A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs

Multi-State DAG Framework: introduces a graph-based approach for conversational agents, utilizing a Workflow Graph (DAG representing agent flow) with LLM Nodes (invokes Large Language Model) and Tool Nodes (calls external tool), each having specific System Prompts (instructions for LLM node), Modify History Routines (manipulates conversation history), Tool Input Schemas (defines tool input format), and Tool Output Schemas (defines tool output format) to interact with External Tools (pre-defined external functions).
This framework enhances compliance and controllability for production-grade agents by distributing constraints across graph states.
A specific training strategy with response masking is proposed to fine-tune models within this state-dependent framework.

LLM Agents for Bargaining with Utility-based Feedback

ICL-UF (In-Context Learning with Utility-based Feedback): introduces a framework for LLM agents to perform realistic bargaining, including LLM Agent, ICL-UF, Utility-based Feedback (HAMBA), Opponent-Aware Reasoning (OAR), and ReAct Structure components.
This framework guides the LLM Agent using Utility-based Feedback (HAMBA) to foster Opponent-Aware Reasoning (OAR) for improved negotiation strategies.
Agents structure their negotiation responses using the ReAct Structure within diverse BARGAINARENA market scenarios to capture realistic bargaining dynamics.

Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation

VeGraph: introduces a novel framework for complex claim verification using an LLM agent that constructs a graph representation, iteratively resolves ambiguous entities, and verifies sub-claim triplets.
The framework leverages interactive graph representation and an external knowledge base to enhance entity disambiguation and multi-step reasoning.
Pipeline logging records the agent's activities for explainability, and the final verdict is determined by the veracity of the verified triplets.

Large language model-based agents for automated research reproducibility: an exploratory study in Alzheimer's Disease

LLM-based Agent System: introduces a simulated research team of LLM-based autonomous agents, including Planner (suggests and revises plan), Engineer (follows plan, writes code), Scientist (advises on reproduction, interprets output), Critic (critiques plan, provides feedback), Executor (executes Python code), and Manager (orchestrates team, determines speaker), tasked with reproducing published research findings.
The system uses the Autogen framework and GPT-4o to dynamically analyze data, write and execute code, and iteratively reproduce results from study abstracts and methods sections.
This exploratory study demonstrates the potential and limitations of LLM agents for automating reproducibility in biomedical research, achieving approximately 53.2% reproduction of key findings across five Alzheimer's studies.

MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming

MermaidFlow: introduces a framework for agentic workflow generation, with Workflow Planning, Declarative Graph Representation, Static Verification, Evolutionary Programming, Code Generation, Execution, LLM-as-Judge, History Buffer, Mermaid Checker, Node Types, and EP Operators components, redefining the search space via safety-constrained graph evolution.
The framework models workflows as verifiable intermediate representations using Mermaid, a structured and human-interpretable graph language.
It employs domain-aware evolutionary operators and an LLM-as-Judge to explore a high-quality, statically verifiable workflow space, enabling robust and interpretable agentic reasoning.

ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind

ToMAP (Theory of Mind Augmented Persuader): introduces a novel framework for training LLM persuaders by incorporating counterclaim prediction and opponent attitude prediction modules, guided by reinforcement learning.
The framework models the persuadee's mental state using the Theory of Mind modules to enable more diverse, opponent-aware, and effective arguments.
Experiments show ToMAP outperforms larger baselines and achieves stable persuasion gains in longer conversations by leveraging opponent-aware strategies.

Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness

Multi-Agent Debate (MAD): introduces a test-time computational scaling method, with Agents (individual language models), Collaborative Refinement (agents refine based on others), Diverse Exploration (agents use different configurations), Rounds (iterative debate steps), Shared Context (previous outputs shared), Output Selection (final answer determination), and Judge (selects final response in safety tasks), where the paper systematically studies its effectiveness compared to Self-Consistency (SC) (parallel sampling baseline) and Self-Refinement (SR) (sequential refinement baseline) baselines.
MAD combines parallel generation within rounds and sequential refinement across rounds, leveraging diverse agent configurations and shared context.
The study evaluates MAD's performance on mathematical reasoning and safety tasks under varying conditions of task difficulty, model scale, and agent diversity.

28th May 2025

WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning

WorkForceAgent-R1: introduces an LLM-based web agent trained using a rule-based R1-style reinforcement learning framework, including an LLM Agent (Policy Model π), Rule-based RL Framework (R1-style RL), Group Relative Policy Optimization (GRPO), Reward Model (Structured Reward Function), Reference Model (For GRPO), Single-Step Training Data (Input for training), Observation (Web page state), Query (User instruction), and Action Space (Permissible actions), designed to enhance single-step reasoning and planning for business-oriented web navigation tasks.
The approach combines behavior cloning via supervised fine-tuning with GRPO, utilizing a structured reward function comprising format correctness, action correctness, and penalty constraints to implicitly learn reasoning.
Experiments on the WorkArena benchmark demonstrate that WorkForceAgent-R1 substantially outperforms SFT baselines and achieves competitive performance against proprietary LLM-based agents.

Conversational Alignment with Artificial Intelligence in Context

CONTEXT-ALIGN framework (Large Language Models): introduces a framework for evaluating conversational alignment of LLMs, which are AI text generation systems based on Transformer architecture processing text as Tokens via Tokenization within a Context window up to a Context window limit.
The framework assesses LLMs' ability to handle context, common ground, and pragmatic inference, discussing limitations like Context window overflow and Context collapse, and mitigation strategies such as Context compression, External memory, and Retrieval systems.
The paper argues that Prompting acts as a static context substitute and discusses how Alignment strategies impose rigid communicative identities, hindering dynamic conversational alignment required for a Conversational agent.

Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment

CaMeL (Capabilities for Machine Learning): enhances its capability-based sandbox for LLM agents with an initial prompt screening gateway, output auditing pass, tiered-risk policy, and a proposed security-oriented DSL interpreter, building on its dual-LLM architecture and execution layer to improve prompt injection defenses for enterprise deployment.
The framework utilizes a Privileged LLM for planning and a Quarantined LLM for validating untrusted content, mediated by a capability-based execution layer enforcing data flow policies.
Proposed enhancements address limitations in initial prompt trust, output manipulation, side channels, and architectural overhead, aiming for improved robustness, scalability, and formal guarantees without modifying underlying models.

First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay

Overhearing Agent System: introduces an AI agent paradigm that utilizes Audio Input (Conversation audio) and Context Management (Maintains history) for a Language Model (Processes input) guided by a System Prompt (Defines agent role) to perform Reasoning (Internal thought) and Tool Calling (Executes actions) via Tools (External functions) based on overheard human conversation.
The system acts as a passive helper, listening to human-to-human conversation and providing assistance through background tasks or suggestions executed via tool calls, without directly participating in the dialogue.
Evaluated in a Dungeons & Dragons gameplay context, the system demonstrates the ability of large multimodal models to leverage implicit audio cues and maintain conversational goals for tasks like game data retrieval, NPC management, and NPC generation.

MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators

MEDAL: introduces an automated multi-agent framework for generating, evaluating, and curating multilingual open-domain dialogue evaluation benchmarks, including Seed Context (diverse conversation starters), LLM User (generates user utterances), LLM Chatbot (generates chatbot utterances), LLM Judge (User Validation) (validates user utterance quality), LLM Judge (Automated Evaluation) (multidimensional dialogue evaluator), Issue Labels (specific dialogue quality dimensions), Overall Assessment (aggregate dialogue quality score), Human Annotation (expert quality judgments), Sampling (selects dialogues for curation), and LLM Judge (Meta-Evaluation) (evaluates LLMs as evaluators).
The framework operates in three stages: dialogue generation using multiple LLM agents, large-scale automated labelling of generated dialogues, and curation of a meta-evaluation benchmark with human annotations.
MEDAL enables on-demand generation of diverse multilingual dialogues and benchmarks, facilitating the evaluation of LLMs as both chatbots and automated evaluators, highlighting deficiencies in detecting nuanced issues like empathy and commonsense.

Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems

Cross-domain Multi-agent LLM Systems: introduces an architecture where autonomous agents (LLMs with tools, memory, autonomy) from different organizations dynamically group for collaborative tasks.
This architecture faces seven security challenges related to agent behavior (unvetted grouping, collusion, conflicting goals, self-tuning misalignment) and data handling (provenance obscurity, context bypass, confidentiality/integrity).
Proposed countermeasures include trust-adaptive dynamic teaming, adversarial training, hierarchical conflict arbitration, cross-domain reward alignment, neural provenance tracking, session-level semantic firewalls, and verifiable reasoning with privacy.

3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

3DLLM-MEM: introduces a memory-enhanced 3D embodied agent framework that utilizes an Encoder (Encodes 3D inputs), Working Memory (Current 3D observations), Episodic Memory (Past 3D observations/interactions) stored in a Memory Bank (Stores episodic memory features), a Memory Fusion Module (Integrates working/episodic memory) producing Fused Episodic Memory (Integrated memory representation), and an LLM (Processes memory for actions).
The framework incrementally builds and maintains a task-relevant long-term memory by incorporating feedback from the environment and interacting with objects.
The Memory Fusion Module uses working memory tokens as queries to selectively attend to and fuse relevant spatial and temporal features from episodic memory.

Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents

Proposed Research Directions: introduces a position arguing that traditional aleatoric and epistemic uncertainty definitions are insufficient for interactive LLM agents and proposes research into Underspecification uncertainties (missing information, unclear task), Interactive learning (ask follow-up questions), and Output uncertainties (communicate uncertainty beyond numbers).
The paper highlights conflicts in existing uncertainty definitions and their breakdown in dynamic, multi-turn LLM agent interactions.
The proposed directions aim to make LLM agent interactions more transparent, trustworthy, and intuitive by addressing and communicating uncertainty in novel ways.

Agent-UniRAG: A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems

Agent-UniRAG (A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems): introduces a trainable agent framework for unified RAG systems, with Planning Module (determines necessary actions), Tool Using Module (interacts with external tools), Working Memory Module (stores input, logs, evidence), Reflector Module (filters and refines evidence), and Agent Loop (iterative process).
The framework leverages the LLM agent concept to handle both single-hop and multi-hop queries in an end-to-end manner.
Agent-UniRAG utilizes a synthetic dataset (SynAgent-RAG) for training small open-source LLMs to achieve competitive performance.

Universal Visuo-Tactile Video Understanding for Embodied Interaction

VTV-LLM: introduces a multi-modal large language model for universal visuo-tactile video understanding, integrating Tokenizer, T-Projector, VTV Encoder, V-Projector, and a Large Language Model.
The framework bridges the gap between tactile perception and natural language by aligning visuo-tactile video features with linguistic descriptions.
It enables sophisticated tactile reasoning capabilities for embodied interaction, including feature assessment and comparative analysis.

From Strangers to Assistants: Fast Desire Alignment for Embodied Agent-User Adaptation

FAMER (Fast Adaptation via MEntal Reasoning): introduces a framework for fast desire alignment, integrating Perception (Extracts scene graph), Key Information Extraction (Filters, stores goal info), Memory (Stores cross-episode knowledge), Desire-Centered Mental Reasoning (Infers user desires), Efficient Communication (Manages dialogue efficiently), and Goal Oriented Planning (Plans goal actions).
The framework leverages LLMs to interpret vague instructions, infer user intent, and manage dialogue, enabling adaptation to unknown user preferences.
FAMER improves task execution and communication efficiency by filtering irrelevant actions, reducing redundant inquiries, and reusing knowledge across episodes.

EvolveSearch: An Iterative Self-Evolving Search Agent

EvolveSearch: introduces a novel iterative self-evolution framework that combines Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to enhance web search capabilities without external human-annotated reasoning data.
The framework alternates between an RL phase for exploration and generating rollouts, and an SFT phase that optimizes the base model using filtered high-quality rollouts.
This process leverages a hybrid reward mechanism and specific data filtering rules to enable continuous self-improvement in open web search domains.

Topological Structure Learning Should Be A Research Priority for LLM-Based Multi-Agent Systems

Unified Framework: introduces, a systematic approach for topological structure learning in LLM-based Multi-Agent Systems, with Agent Selection (Selects agent subset), Structure Profiling (Identifies macro structure), and Topology Synthesis (Synthesizes micro graph), where the framework decomposes topology design into sequential stages for optimization.
The framework aims to learn optimal topological structures for MASs to enhance coordination performance and efficiency.
Each stage presents distinct challenges and research opportunities for designing adaptive multi-agent architectures.

AgentDNS: A Root Domain Naming System for LLM Agents

AgentDNS: introduces a root domain naming and service discovery system for LLM agents, with Service Registration (registers services), Service Proxy Pool (forwards requests), Service Search (discovers services), Service Resolution (resolves identifiers), Service Management (manages proxies), Service Billing (tracks costs), Authentication (verifies identity), AgentDNS DB (stores metadata), and AgentDNS API Server (provides API) components.
AgentDNS enables LLM agents to autonomously discover, resolve, and securely invoke third-party services across organizational and technological boundaries.
Inspired by traditional DNS, the system provides unified naming, natural language discovery, protocol-aware interoperability, authentication, and billing for multi-agent collaboration.

From Large AI Models to Agentic AI: A Tutorial on Future Intelligent Communications

LAM-based Agentic AI system: introduces a system architecture with LAMs (Core reasoning engine), Planner (Task decomposition/organization), Knowledge Base (External knowledge support), Tools (External/internal execution toolkit), and Memory (Stores historical information).
CommLLM framework: introduces a LAM-centric multi-agent collaborative system architecture with MDR (Acquire task-relevant information), MCP (Decompose tasks/generate pathways), and MER (Evaluate solutions/self-feedback).
The tutorial reviews the evolution from Large AI Models to Agentic AI and their applications in future intelligent communication systems, particularly in the context of 6G networks.

VOICE CMS: UPDATING THE KNOWLEDGE BASE OF A DIGITAL ASSISTANT THROUGH CONVERSATION

Voice CMS architecture: introduces a system for updating a digital assistant's knowledge base through conversation, integrating a Voice CMS workflow, Conversational Engine with Agents, Knowledge Base, VUI, and LLM.
The system allows hotel staff to naturally converse with the assistant to add or modify information, reducing the need for traditional graphical content management systems.
Evaluation compares the Voice CMS with a GUI for knowledge management tasks, analyzing user preference, usability, and performance across varying task complexities.

Efficient Leave-one-out Approximation in LLM Multi-agent Debate Based on Introspection

IntrospecLOO (introspective-leave-one-out): introduces an efficient method for evaluating agent contributions in LLM multi-agent debates, utilizing Agents, User/Query, Independently Respond Round, Debate Round, Aggregation, IntrospecLOO Round, and IntrospecLOO Prompt.
The method adds a single IntrospecLOO Round after standard debate rounds, prompting agents with an IntrospecLOO Prompt to update answers while disregarding one agent's response.
This approach approximates the traditional Leave-one-out method at significantly reduced query complexity, enabling efficient contribution evaluation.

VIRAL: VISION-GROUNDED INTEGRATION FOR REWARD DESIGN AND LEARNING

VIRAL: introduces a pipeline for generating and refining reward functions using multi-modal LLMs, including Input, Initial Generation, Policy Learning, and Refinement components.
The framework takes textual environment details, optional success code, and a multi-modal goal prompt to generate initial reward functions via collaborating LLMs and code verification.
Reward functions are refined iteratively based on performance evaluation and feedback from humans or a Video-LVLM, leading to improved agent behavior alignment.

VulBinLLM: LLM-powered Vulnerability Detection for Stripped Binaries

Vul-BinLLM: introduces an LLM-based framework for binary vulnerability detection, featuring an LLM-assisted Decompiler (enhances code) with an Optimization Decision Agent (decides optimizations) and Action Agents (perform optimizations), a Code Memory Management Agent (manages functions), VulBinQ (queue), and Archived Analysis (storage).
The framework optimizes decompilation by adding vulnerability-specific comments and contextual information before analyzing the code for vulnerabilities.
It utilizes memory management and a function queue to handle large binary files and reduce LLM hallucinations during vulnerability reasoning.

EFFICIENTLY ENHANCING GENERAL AGENTS WITH HIERARCHICAL-CATEGORICAL MEMORY

EHC framework: introduces a general agent framework with Hierarchical Memory Retrieval (HMR), Task-Category Oriented Experience Learning (TOEL), Memory Pool (M), and LLM (Large Language Model), designed for efficient multi-modal task handling.
The framework uses a hierarchical memory system for rapid retrieval and continuous storage, mitigating redundancy and overhead.
It employs task-oriented learning to classify experiences and extract category-specific patterns, enhancing adaptability and interpretability.

MapStory: LLM-Powered Text-Driven Map Animation Prototyping with Human-in-the-Loop Editing

MapStory: introduces a text-driven map animation prototyping tool, with Scene Breakdown Agent (parses script), Map Animation Researcher Agent (retrieves geospatial data), and Map Animation Modules (camera, highlight, animated elements), that generates editable map animations from natural language scripts.
The tool leverages an agentic LLM architecture to produce a scene breakdown and grounds the script in factual geospatial data using web search and APIs.
MapStory supports human-in-the-loop editing through an interactive timeline editor and properties panel for fine-grained control and rapid iteration.

LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents

LaMDAgent (Language Model Developing Agent): introduces an autonomous framework using an Agent (LLM-based selector) to iteratively construct and optimize post-training pipelines by selecting from Predefined Action Types (available operations) and an Object Pool (available resources), evaluating the resulting Model (target LLM) based on a Score (performance metric), and updating its Memory (stores experiences).
The framework automates the post-training pipeline design process by iterating through action enumeration, selection, model evaluation, and memory update steps.
This agent-based approach reduces the need for specialized knowledge and human intervention in discovering effective model improvement strategies.

Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

Future System: introduces a metadata management system and a hierarchical KVC caching system, featuring a reuse-optimized metadata caching scheme, a workload-aware index structure, and a hotness-aware data placement strategy to optimize KVC management for LLM prefix prefilling.
The proposed system aims to minimize time to first token for long-context inference by efficiently handling range queries and random get queries.
The approach is designed to leverage the unique high reusability and mixed sequential-random access patterns observed in KVC prefix prefill workloads.

Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development

Co-Saving: introduces a resource-aware multi-agent collaboration system leveraging experiential knowledge to enhance efficiency and quality, including Multi-Agent System (Collaborative structure), Agents (Individual LLM entities), Experiential Knowledge (Historical task data), Shortcuts (Learned instructional transitions), Reference Chain (Historical successful trajectory), Inference Chain (Current task execution), Shortcut Filtering (Selecting effective shortcuts), Shortcut Formalization (Graph representation), Shortcut Evaluation (Scoring shortcuts), Cost Design (Time and token metric), Emergency Factor (Dynamic value/cost weighting), and Force Termination Mechanism (Prevents resource exhaustion).
The system utilizes shortcuts mined from historical successful trajectories to bypass redundant reasoning steps and accelerate problem-solving in familiar contexts.
A dynamic emergency factor and force termination mechanism are integrated to manage resource consumption and prevent exhaustion during task execution.

Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation

LLM-ABM Framework: introduces a method for large-scale urban mobility simulation by integrating LLM with Agent-Based Modeling, including Data Collection, Large Language Model (LLM), Agent Profile, Agent Schedule, Routine Allocation, Occasional Locations, and Multi-Transit Route components.
The framework leverages LLM to generate diverse and realistic synthetic population profiles and personalized agent schedules.
Agent locations are allocated based on grid data and Points of Interest, and personalized routes are generated using a multi-criteria routing algorithm.

27th May 2025

STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

STRATUS: introduces a multi-agent system for autonomous Site Reliability Engineering (SRE) of cloud services, with Detection Agent (Identifies failures), Diagnosis Agent (Determines root cause), Mitigation Agent (Executes mitigation plans), Undo Agent (Executes undo sequence), State Machine (Orchestrates agents control flow), Agent-Computer Interfaces (ACI) (Enables environment interaction), Toolset (Provides interaction capabilities), Observability tools (Query telemetry, states), Command-line tools (Execute commands, change states), Oracles (Validate system health, terminate), and Transactional Non-Regression (TNR) (Safety specification), designed to autonomously detect, localize, analyze, and mitigate cloud system failures.
The system organizes specialized agents in a state machine and formalizes a safety specification called Transactional No-Regression (TNR) to enable safe exploration and iteration during mitigation.
STRATUS utilizes Agent-Computer Interfaces (ACI) and a comprehensive toolset, including observability and command-line tools, to interact with the cloud environment and validate actions using Oracles.

Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

AIDSAFE (Agentic Iterative Deliberation for Safety Reasoning): introduces a multi-agent framework for generating policy-embedded Chain-of-Thought data, including Initialization, Deliberation, and Refinement stages with dedicated agents.
The framework leverages collaborative reasoning among Deliberation Agents and post-processing by a Refiner Agent to produce high-quality, policy-adherent CoTs and responses from an Input Query and Safety Policies.
This approach aims to improve LLM safety generalization and jailbreak robustness by providing superior data for supervised fine-tuning.

BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

BehaviorSFT: introduces a training strategy using behavioral tokens to condition pre-trained foundation LLMs for dynamic behavioral selection across the reactive-proactive spectrum, evaluated on the BehaviorBench dataset.
The approach leverages supervised fine-tuning to enable implicit contextual behavior assessment and behavior-conditioned generation for clinical agents.
BehaviorSFT aims to improve the balance between helpful proactivity and necessary restraint in LLM responses for healthcare applications.

AI-Supported Platform for System Monitoring and Decision-Making in Nuclear Waste Management with Large Language Models-25367

Multi-agent Retrieval-Augmented Generation (RAG) system: introduces a platform for nuclear waste management decision-making with a Multi-agent System (Collaboration) including Regulatory Compliance Agent (Checks regulations), Safety & Environmental Agent (Assesses risks), and Documentation & Reporting Agent (Compiles reports), leveraging Retrieval-Augmented Generation (RAG) (Retrieval and generation) with LLM (Llama 3.2) (Base language model), Embeddings (mxbai-embed-large-v1) (Generates semantic vectors), and Document Retrieval (Retrieves relevant documents) accessing Regulatory Compliance Database (Stores regulatory documents) and Safety & Environmental Database (Stores safety/environmental data).
The system employs a structured 10-round discussion model for agents to iteratively refine assessments and ensure document-grounded responses.
Evaluation metrics like Context Relevance Distribution and Agent Agreement Rate demonstrate the framework's effectiveness in maintaining factual grounding and decision consistency.

Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making

Catfish Agent Framework: introduces a multi-agent system with Moderator Agent, Catfish Agent, Expert Agent, Team Leader, Team Member, and Summary Agent components to disrupt silent agreement in clinical decision making.
The framework employs complexity-aware and tone-calibrated interventions by the Catfish Agent to stimulate deeper reasoning and prevent premature consensus.
Evaluations show the method improves diagnostic accuracy on medical Q&A and VQA benchmarks compared to single- and multi-agent baselines.

Robust Hypothesis Generation: LLM-Automated Language Bias for Inductive Logic Programming

LLM-Automated Language Bias for Inductive Logic Programming Framework: introduces a novel framework for robust hypothesis generation by integrating LLMs with ILP, including a LLM-Based Multi-agent System (Generates language bias), Translator agent (Transforms text to facts), Language Bias (Structured symbolic vocabulary), Facts (Symbolic data representation), ILP Solver (Learns interpretable rules), and Optimal Hypothesis (Final learned rules).
The framework utilizes a multi-agent LLM system (Actor and Critic agents) to automate the generation of the language bias (predicate system) directly from raw text.
This automated symbolic grounding guides a Translator agent to convert text into facts for an ILP solver, which then learns interpretable rules as the optimal hypothesis.

Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration

EXTAGENTS: introduces a multi-agent framework for scaling external knowledge input beyond LLM context windows, featuring Seeking Agents (Process input chunks), Reasoning Agent (Synthesize information, generate output), Global Knowledge Synchronization (Agents share and rank information), and Knowledge-Accumulating Reasoning (Reasoning agent integrates information iteratively).
The framework partitions massive input into chunks processed by Seeking Agents, whose outputs are shared and ranked via global knowledge synchronization.
A Reasoning Agent then iteratively integrates the synchronized information through knowledge-accumulating reasoning to produce the final output.

Autonomous Multi-Modal LLM Agents for Treatment Planning in Focused Ultrasound Ablation Surgery

FUAS-Agents: introduces an autonomous agent system leveraging multimodal LLMs for Focused Ultrasound Ablation Surgery treatment planning, including Planner Agent (interprets instructions, decomposes tasks), Executor Agent (performs specific tasks), Strategy Agent (generates treatment plans), Optimizer Agent (refines outputs, integrates results), and Memory Module (integrates medical resources, manages data).
The system integrates patient profiles and MRI data, orchestrating specialized medical AI tools for segmentation, dose prediction, and clinical guideline retrieval to generate personalized treatment plans.
Evaluated in a uterine fibroid scenario, the generated plans demonstrate high completeness, accuracy, fluency, and clinical compliance according to human expert assessment.

Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History

Dataset Generation Framework: introduces, "a multi-agent pipeline", with Users (Simulated input), Dialogue Generation Controller (Orchestrates workflow), User Simulator (Generates user questions), Out-of-Context Detector (Validates questions), and QA LLM (Responds to questions), where "the framework generates synthetic dialogues embedding sociodemographic attributes for evaluating LLM adaptation".
The pipeline simulates user-LLM interactions, with a user simulator generating profile-aligned questions and an out-of-context detector ensuring question validity.
This agent-based approach creates a controlled dataset enabling assessment of LLM behavioral consistency when user attributes are provided explicitly or implicitly.

PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims

PEDANTIC-based Definiteness Examination: introduces PEDANTIC Dataset (corpus of patent claims with indefiniteness annotations), Dataset Creation Pipeline (automatic process using LLMs to build PEDANTIC), Logistic Regression Model (baseline prediction model), LLM Agent Model (LLM-based prediction model using tools), Binary Classification (evaluates definite/indefinite prediction), Multi-Label Classification (evaluates indefiniteness category prediction), and Pairwise Reasoning Judge (LLM-as-Judge evaluates reasoning quality), presenting a dataset and evaluation framework for automatic patent claim definiteness examination.
The PEDANTIC Dataset contains 14k US patent claims annotated with indefiniteness reasons extracted using an automatic pipeline leveraging Large Language Models.
The framework evaluates Logistic Regression and LLM Agent models on binary and multi-label classification tasks, and uses an LLM-as-Judge to assess the quality of generated indefiniteness reasoning.

Large Language Models Miss the Multi-Agent Mark

MAS LLMs (Multi-Agent Systems of Large Language Models): introduces a critique of current MAS LLMs, highlighting issues with Agents (lack native social behaviour), Environment (often textual, LLM-centric), Coordination (often sequential, orchestrated), Communication (often natural language), Memory (lack long-term persistency), and Asynchronicity (often absent).
The paper argues that current MAS LLMs often fail to embody fundamental multi-agent system characteristics by overemphasizing LLMs and overlooking established MAS literature.
It advocates for better integrating MAS concepts like native social agents, non-LLM-centric environments, asynchronous communication protocols, and quantifiable emergent behaviours.

Complex System Diagnostics Using a Knowledge Graph-Informed and Large Language Model-Enhanced Framework

LLM-Informed Diagnostic Framework: introduces a novel approach integrating KGs and LLMs for complex system diagnostics, featuring Model Construction, KG-DML, Model Interaction, and an LLM Agent with diagnostic tools.
The framework automates DML model construction from system documentation using an LLM-based workflow and stores this structured logic in a KG-DML.
An LLM agent facilitates interactive diagnostics by interpreting user queries and invoking KG-based tools for upward/downward reasoning and Graph-RAG retrieval to generate diagnostic insights.

PACT: A Contract-Theoretic Framework for Pricing Agentic AI Services Powered by Large Language Models

PACT: introduces a contract-theoretic framework for pricing cloud-based agentic AI services, modeling task-dependent multi-dimensional QoS, costs (including liability), and user types to design contracts satisfying individual rationality and incentive compatibility.
The framework models QoS based on objective response time and subjective user satisfaction, accounting for computational, infrastructure, and potential liability costs for the service provider.
Through contract-based selection, PACT enables users to receive tailored service offerings aligned with their needs while ensuring incentive compatibility and individual rationality under information asymmetry.

Creativity in LLM-based Multi-Agent Systems: A Survey

LLM-based Multi-Agent Systems: introduces a survey on creativity in these systems, outlining a structured framework with Input (user input text/image), Workflow (Three-stage creative process), Planning (formulate objectives, structure tasks), Process (implement tasks, coordinate interaction), Decision Making (evaluate options, determine outcome), Technique (methods for idea generation/refinement/synthesis), Persona (agent roles and profiles), and Output (generated text/image content).
The framework details how agents, guided by personas and employing various techniques, navigate a three-stage workflow to transform user inputs into creative outputs.
The survey maps techniques, datasets, and evaluation methods, highlighting how collaborative structures and agent proactivity influence creative potential in these systems.

Simulating Ethics: Using LLM Debate Panels to Model Deliberation on Medical Dilemmas

ADEPT (AI Deliberative Ethics Protocol Toolkit): introduces a system for simulating multi-perspective ethical debates using LLM personas, with AI Persona Specs, Scenario & Options, and Model Config inputs managed by an Orchestrator utilizing an OpenAI o3 model.
The framework orchestrates structured debates through phases, logging interactions and votes into Debate Outputs for transparency and audit.
A Summariser Agent processes the debate outputs to provide an executive summary, facilitating the analysis of how different ethical perspectives influence deliberation outcomes.

Creativity in LLM-based Multi-Agent Systems: A Survey

LLM-based Multi-Agent Systems: introduces a survey on creativity in these systems, outlining a structured framework with Input (user input text/image), Workflow (Three-stage creative process), Planning (formulate objectives, structure tasks), Process (implement tasks, coordinate interaction), Decision Making (evaluate options, determine outcome), Technique (methods for idea generation/refinement/synthesis), Persona (agent roles and profiles), and Output (generated text/image content).
The framework details how agents, guided by personas and employing various techniques, navigate a three-stage workflow to transform user inputs into creative outputs.
The survey maps techniques, datasets, and evaluation methods, highlighting how collaborative structures and agent proactivity influence creative potential in these systems.

Simulating Ethics: Using LLM Debate Panels to Model Deliberation on Medical Dilemmas

ADEPT (AI Deliberative Ethics Protocol Toolkit): introduces a system for simulating multi-perspective ethical debates using LLM personas, with AI Persona Specs, Scenario & Options, and Model Config inputs managed by an Orchestrator utilizing an OpenAI o3 model.
The framework orchestrates structured debates through phases, logging interactions and votes into Debate Outputs for transparency and audit.
A Summariser Agent processes the debate outputs to provide an executive summary, facilitating the analysis of how different ethical perspectives influence deliberation outcomes.

Herd Behavior: Investigating Peer Influence in LLM-based Multi-Agent Systems

LLM-based Multi-Agent System: introduces a framework to investigate herd behavior in multi-agent systems, featuring LLM-based Agents (autonomous decision makers) receiving Question Input (initial task) and Peer Information Input (peers' responses), utilizing a Confidence Mechanism (internal certainty assessment) for Response Generation (initial answer) and revision, modulated by Peer Information Presentation (format and order), Peer Persona (peer attributes), and System Prompt (behavioral instructions).
The system simulates agents interacting and making decisions, where herd behavior is measured by the flip rate, the tendency of agents to change their initial response based on peer input.
Experiments manipulate agent self-confidence, perceived peer confidence, and peer information presentation factors to understand their impact on conformity and collective outcomes.

CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building

CXXCrafter: introduces an LLM-based agent system for automated C/C++ software building, including a Parser Module (Extracts build-related information), a Generator Module (Generates/modifies Dockerfile), and an Executor Module (Executes Dockerfile, captures errors).
The system leverages LLMs to dynamically manage complex build processes by iteratively addressing issues based on feedback.
CXXCrafter achieves a 78% build success rate across 752 C/C++ projects by handling dependency management, diverse build systems, and error diagnosis.

Agent-Environment Alignment via Automated Interface Generation

ALIGN (Auto-Aligned Interface Generation): introduces a framework that automatically generates interfaces to alleviate agent-environment misalignment, utilizing an Analyzer (Identifies misalignments) and an Optimizer (Generates/refines interface) to produce an interface with INFERRULES (Static information alignment) and WRAPSTEP (Dynamic observation enhancement) that mediates interaction between the Agent (Interacts with environment) and Environment (Provides state/feedback).
The framework operates iteratively, with the Analyzer identifying misalignments from failed trajectories and the Optimizer generating an improved interface based on these findings.
The ALIGN-generated interface enhances both static environment information and step-wise observations, improving agent performance across diverse interactive tasks without modifying agent or environment code.

AITEE - Agentic Tutor for Electrical Engineering

AITEE (Agentic Tutor for Electrical Engineering): introduces an agent-based tutoring system for electrical engineering, with Circuit (Input image), Detection of components and connections (Processes circuit image), Conversion into Graph/Netlist (Creates textual representation), Simulation with Spice (Validates circuit calculations), Scripts (Lecture material knowledge base), Relevant context in vector database (Stores script embeddings), Retriever (RAG) (Retrieves relevant script context), Large Language Model (Core AI tutor), LLM-Instructions (Guides Socratic dialogue), Students (User), Prompt (Student query), and Output (Tutor response) components, designed to provide interactive and personalized learning experiences.
The system processes hand-drawn or digital circuit diagrams, converts them into a machine-readable format, and uses a graph-based similarity measure for context retrieval from lecture materials.
AITEE employs a Socratic dialogue approach guided by LLM instructions and validates calculations using SPICE simulation to foster learner autonomy and ensure accuracy.

Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation

ATD (Adaptive Text Dreamer): introduces a dual-branch LLM self-guided imagination policy for VLN, with Left Brain (State Estimation LLM), Right Brain (Imagination LLM), Q-Former, LLM Encoder, LLM Decoder, State Grounded Cross-Attention (SGCA), Graph-based Navigation Policy, Latent Embedding Injection, Multi-head Cross-Attention (MCA), Graph-aware Self-Attention (GASA), and MLP components.
The framework leverages language-based imagination, employing a left brain for state estimation and a right brain for imaginative prediction, constrained by the estimated state.
Imagined textual representations are integrated into a graph-based navigation expert via latent embeddings and cross-attention to guide action decisions.

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

RepoMaster: introduces an autonomous agent framework for exploring and understanding GitHub repositories, consisting of Repository Search, Hierarchical Repository Analysis, and Autonomous Exploration & Execution.
Hierarchical Repository Analysis builds structural representations like HCT, MDG, and FCG to identify Core components for efficient understanding.
Autonomous Exploration & Execution uses Context-aware Code Exploration with Exploration tools and Context-aware Information Selection in an Interactive Feedback-based Execution loop to solve tasks.

MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems

MedSentry: introduces a benchmark and evaluation pipeline for medical LLM multi-agent systems, analyzing the safety risks posed by malicious agents within different architectural topologies.
The framework evaluates four representative multi-agent topologies (Centralized, Decentralized, Layers, SharedPool) by injecting a Dark Personality Agent and assessing system safety using an Enforcement Agent defense mechanism.
MedSentry provides a rigorous evaluation framework and practical defense strategies for designing safer LLM-based multi-agent systems in medical domains.

MT-MOL: Multi Agent System with Tool-based Reasoning for Molecular Optimization

MT-MOL (Multi Agent System with Tool-based Reasoning for Molecular Optimization): introduces a multi-agent framework for molecular optimization featuring Analyst agents (Select relevant tools), a Scientist agent (Generates molecule/reasoning), a Verifier agent (Validates consistency), and a Reviewer agent (Provides feedback), utilizing Tool sets (Domain-specific functions), Top-k data (Reference molecules), and SMILES history (Previous designs).
The system integrates domain-specific tools and structured reasoning through agent interactions to produce interpretable and chemically grounded molecular designs.
An iterative generation and review process, including consistency validation and tool-informed feedback, refines the molecular candidates towards the design objective.

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

MAMMQA: introduces a multi-agent framework for multimodal question answering with Modality Expert Agent (Extracts modality specific insights), Cross Modal Synthesis Agent (Synchronises information across modalities), and Aggregator Agent (Synthesizes outputs, resolves disagreements), splitting reasoning into interpretable stages.
The framework employs specialized agents for modality-specific extraction, cross-modal synthesis, and evidence-grounded aggregation without fine-tuning.
This modular design enhances interpretability, robustness, and zero-shot generalization by allowing agents to operate within their expertise domains.

ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools

ChemHAS (Chemical Hierarchical Agent Stacking): introduces a hierarchical agent stacking method with Initial Tools (Predefined chemistry tools), AI Agent (LLM-based agent), Agent Tool (Tool or agent), Global Tool Library (Collection of tools/agent tools), Stacking Process (Hierarchical combination method), Reinforcement Process (Two-stage optimization), ReAct method (Agent reasoning and tool use), and Stacking Agent (Enhanced tool/agent) to enhance chemistry tools by reducing prediction errors.
The Stacking Process involves Warmup Self Agent Stacking and Hierarchical Agent Stacking, iteratively building and evaluating agent tools and storing them in the Global Tool Library.
The resulting Stacking Agent leverages the complementary strengths of stacked tools, guided by a two-stage reinforcement process, to achieve improved performance on chemistry tasks.

Can Agents Fix Agent Issues?

AGENTISSUE-BENCH: introduces the first reproducible benchmark for agent issue resolution, comprising issue description (User reported problem), buggy version (Codebase commit), developer-committed patch (Ground truth fix), failure-triggering tests (Reproduce issue), and docker environment (Executable container).
Built from 50 reproducible real-world GitHub issues, the benchmark enables evaluating state-of-the-art software engineering agents.
Evaluation on AGENTISSUE-BENCH reveals that current SE agents have limited effectiveness in resolving agent-specific issues.

RRO: LLM Agent Optimization Through Rising Reward Trajectories

RRO (Reward Rising Optimization): introduces a scalable process supervision framework for LLM agents, including LLM Agent (Policy Model), Supervised Fine-tuning (Initial training on expert data), Reward Rising Sampling (Dynamically explores next actions), Process Reward Estimation (Estimates step reward via rollouts), Agent Optimization (DPO) (Optimizes policy using preferences), and Preference Data (Pairs of preferred/rejected actions).
The framework dynamically adjusts next action exploration based on rising reward trends to efficiently collect high-quality preference data for training.
RRO prioritizes reasoning steps with increasing rewards, reducing exploration cost while improving performance on multi-step tasks.

E2E Process Automation Leveraging Generative AI and IDP-Based Automation Agent: A Case Study on Corporate Expense Processing

Brity Automation: introduces an end-to-end automation system for financial expense processing, integrating a Data Input Layer, Intelligent Processing Layer with OCR/IDP Module, Policy-based Classification Engine, AI Flow Module (Gen AI Integration), and Workflow Engine, a User Interaction & Learning Layer with Automation Agent Interface and Human-in-the-Loop (HITL) Mechanism, and a Backend Infrastructure Layer with Brity Automation Orchestrator, Database, and API Gateway.
The system automates document recognition, policy-based classification, intelligent exception handling using generative AI, and incorporates human judgment for continuous learning.
This approach aims to overcome limitations of traditional RPA by handling unstructured data and complex exceptions through human-AI collaboration.

SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution

SPA-RL: introduces Stepwise Progress Attribution (SPA), with LLM Agent, Environment, Progress Estimator, Grounding Signal, Fused Intermediate Reward, and PPO Update components, which is a reward redistribution framework reinforcing LLM agents by decomposing delayed rewards into stepwise contributions.
The framework trains a Progress Estimator to predict each step's contribution to task completion, combining this with a Grounding Signal for action executability to form a Fused Intermediate Reward.
This dense, goal-oriented Fused Intermediate Reward is then used within a PPO Update to train the LLM Agent, improving performance on long-horizon tasks with sparse rewards.

Hierarchical Instruction-aware Embodied Visual Tracking

HIEVT (Hierarchical Instruction-aware Embodied Visual Tracking): introduces a hierarchical tracking agent with LLM-based Semantic-Spatial Goal Aligner and RL-based Adaptive Goal-Aligned Policy components, designed for user-centric embodied visual tracking.
The LLM-based Semantic-Spatial Goal Aligner translates user instructions into spatial goals via Semantic Parsing, Spatial-Goal Generation, and Retrieval-Augmented Goal Correction.
The RL-based Adaptive Goal-Aligned Policy uses a Visual Foundation Model, Goal-State Aligner (with CNN and Reward Prediction), and Recurrent Policy Network (with LSTM and Actor Network) to align agent actions with the spatial goals for precise tracking.

GIFARC: Synthetic Dataset for Leveraging Human-Intuitive Analogies to Elevate AI Reasoning

GIFARC: introduces a data synthesis pipeline that transforms raw GIFs into analogy-grounded ARC-style tasks, utilizing a VLM to extract visual abstractions and LLMs to generate task sketches and executable tasks, including input-output pairs, analogy labels, and solution programs.
The pipeline processes GIFs through stages of visual abstraction, task sketching, and executable task generation to create a dataset that embeds human-intuitive analogies into ARC-style problems.
The generated dataset aims to guide AI agents, particularly LLMs, to adopt an analogic approach for solving ARC tasks, aligning their reasoning more closely with human intuition.

LLM-Guided Reinforcement Learning: Addressing Training Bottlenecks through Policy Modulation

ULTRA (Large Language Model-Guided Policy Modulation Framework): introduces a framework that leverages LLMs to identify critical states from sub-optimal trajectories and provide action suggestions and rewards for policy refinement.
The framework's Identification component uses an LLM and a state interpretation function to pinpoint critical states in historical agent trajectories.
Its Improvement component refines the RL policy by incorporating LLM-suggested actions from a lookup table and LLM-generated rewards at critical states.

MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning

MIRROR (Multi-agent Intra- and Inter-Reflection for Optimized Reasoning): introduces a multi-agent framework with Planner Agent, Tool Agent, and Answer Agent, integrating Intra-reflection and Inter-reflection mechanisms supported by Long-Term Memory and Short-Term Memory for enhanced tool learning.
The framework employs intra-reflection for proactive error prevention within each agent before execution and inter-reflection for corrective learning and strategic adjustment based on task outcomes.
This dual-reflection approach systematically leverages LLM capabilities to improve task decomposition, tool selection, and answer generation in complex multi-agent workflows.

CoderAgent: Simulating Student Behavior for Personalized Programming Learning with Large Language Models

CoderAgent: introduces a LLM-based agent framework to simulate student programming processes, with Memory (Stores student proficiency), Tools (Interface with compilers), Planning & Action (Decision-making core), and Reflection (Evaluates generated code) components.
The framework simulates iterative coding by capturing cognitive states, using a Programming Tree of Thought for planning, and reflecting on generated code.
CoderAgent aims to provide interpretable insights into learning trajectories and accurate simulations without relying on large-scale real data.

Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration

XpandA (Expand-Agent): introduces a multi-agent framework with Dynamic Chunking (Splits input text), Explorer Agents (Process text chunks), Decider (Decides next action), Shared Information Memory (Centralized knowledge store), Question-driven Workflow (Guides agent communication), Selective Replay (Revisits relevant chunks), Unsolved Problem Tracer (Tracks unsolved questions), and Information (Stores gathered answers), designed for robust long-context processing.
The framework dynamically partitions long texts, uses a question-guided protocol to update shared memory, and selectively replays partitions based on question-information state tracking.
XpandA demonstrates feasibility for processing ultra-long sequences up to 1M tokens, achieving performance improvements and inference speedup over baselines.

26th May 2025

Ten Principles of AI Agent Economics

Ten Principles of AI Agent Economics: introduces a foundational framework for understanding how AI agents make decisions, influence social interactions, and participate in the broader economy, with Altruistic AI Agent, Survival-Driven AI Agent, Human Agent, Environment, and Human-AI Multi-Agent Hierarchical Society components.
The paper outlines ten principles drawing on economics, decision theory, and ethics to explore fundamental questions about AI agents' integration into human systems.
The framework distinguishes between altruistic and survival-driven AI agents and models their interaction within environments and a hierarchical human-AI society.

Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting

Project Riley: introduces a multimodal multi-agent LLM architecture for emotional reasoning, featuring Input (Receives user query/context) processed by LLM vision model (Image processing) and LLM text model (Text generation/reasoning), distributed to Emotional agents (Five distinct emotion agents) with Emotion's history (Separate history per agent) for Multi-round processing (Iterative agent dialogue), culminating in Voting and Analysis (Agents evaluate/vote) and Final Synthesis (Synthesizes final response) for the Final response (Output to user).
The architecture simulates reasoning influenced by five distinct emotional states (Joy, Sadness, Fear, Anger, Disgust) through structured multi-round dialogues and a final synthesis mechanism.
The system integrates textual and visual LLMs, advanced reasoning, and self-refinement processes to generate emotionally informed responses.

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

SWE-rebench Automated Pipeline: introduces a novel, automated, and scalable pipeline for continuously extracting real-world interactive software engineering tasks from GitHub repositories, comprising Preliminary Task Collection, Automated Installation Instructions Configuration, Execution-based Installation Verification, and Automated Instance Quality Assessment, resulting in the SWE-rebench Dataset and SWE-rebench Benchmark used within a standardized Evaluation Framework employing ReAct-style Scaffolding, a Terminal Environment, Special Tools, and an LLM agent.
The pipeline addresses challenges in training data availability and evaluation reliability for LLM-based software engineering agents by providing a large-scale, diverse, and continuously updated dataset and a contamination-free benchmark.
The standardized evaluation framework enables transparent and fair comparisons of LLM agent performance on interactive software engineering tasks, mitigating issues like data contamination and scaffolding variability.

ALITA: GENERALIST AGENT ENABLING SCALABLE AGENTIC REASONING WITH MINIMAL PREDEFINITION AND MAXIMAL SELF-EVOLUTION

ALITA: introduces a generalist agent with minimal predefinition and maximal self-evolution, featuring Manager Agent (central coordinator), Web Agent (external information), MCP Brainstorming (plan tools), Script Generating Tool (generates code), Code Running Tool (executes code), Environment Management (manages environments), MCP Box (stores MCPs), and CodeReAct Loop (iterative process).
The Manager Agent orchestrates the CodeReAct loop, utilizing the Web Agent for information and the MCP creation tools to generate, execute, and store new capabilities as MCPs.
This design allows ALITA to autonomously evolve its capabilities through continuous MCP integration, reducing dependence on manual predefinition.

MASKSEARCH: A Universal Pre-Training Framework to Enhance Agentic Search Capability

MASKSEARCH: introduces a pre-training framework to enhance LLM agentic search capabilities using the RAMP Task (pre-training objective), trained via SFT (supervised fine-tuning) or RL (reinforcement learning), leveraging an LLM (core language model) interacting with a Search Tool (external search interface), Retriever (knowledge retrieval module), and Knowledge Corpus (external knowledge base), supported by Agent-Based CoT Construction (SFT data generation method), Self-Evolve Distillation (iterative data scaling), Curriculum Learning (progressive training strategy), and an RL Reward System (reinforcement signal).
The framework trains models on the Retrieval-Augmented Mask Prediction (RAMP) task, where the model learns to use search tools to fill masked spans in text.
Training involves a two-stage approach combining pre-training on RAMP with supervised fine-tuning or reinforcement learning on downstream tasks, demonstrating improved performance on open-domain question answering.

syftr: Pareto-Optimal Generative AI

syftr: introduces a framework that performs multi-objective search over agentic and non-agentic RAG flows, composed of Synthesizing LLM, Reranker, Embedding Model, Splitter, HyDE, Retriever, Prompt, Dynamic Few-Shot Retriever, and Additional Context components, to find Pareto-optimal flows balancing task accuracy and cost.
The framework utilizes Bayesian Optimization with a novel early-stopping mechanism to efficiently explore a vast search space of RAG configurations.
syftr identifies flows that are significantly cheaper or more accurate than baseline configurations across multiple RAG benchmarks.

ON PATH TO MULTIMODAL HISTORICAL REASONING: HISTBENCH AND HISTAGENT

HistAgent: introduces a domain-specialized AI agent for historical reasoning, with a Manager Agent (Central coordinator) orchestrating specialized agents including Text WebBrowser Agent (Web search/parsing), Image Information Agent (Image search/analysis), Literature Search Agent (Scholarly search/citation), File Processing Agent (Handle non-HTML files), OCR Agent (Extract text from images), Speech Recognition Agent (Convert audio to text), Translator Agent (Translate text), and Video Agent (Extract frames from video).
HistAgent integrates these modular tools and a ReAct-style loop to process multimodal inputs and generate cited responses grounded in historical sources.
The agent is evaluated on HistBench, a new benchmark for historical reasoning, and demonstrates superior performance compared to generalist LLMs and agents.

THINK: Can Large Language Models Think-aloud?

THINK (Testing Higher-order Notion of Knowledge): introduces a multi-agent, feedback-driven evaluation framework for assessing and improving LLM higher-order thinking skills using flawed math problems (Initial input data), a multi-agent evaluation stage (Parallel agent system) with agents (Evaluate problems) including Bloom-aligned agents (Assess Bloom levels) and a holistic evaluation agent (Assess quality, suggest improvements), agent feedback & ratings (Scores and suggestions), a quality assessment protocol (Metrics for quality) with a quality threshold (Success criterion), an iterative revision loop (Refinement process) involving a think-aloud process (LLM reflection) by the LLM (Revises problems) guided by "Five Keys" (Structured criteria), resulting in an improved problem set (Refined output data).
The framework uses a parallel multi-agent system to evaluate flawed math problems based on Bloom's Taxonomy and "Five Keys" criteria, generating scores and structured feedback.
An iterative revision loop, guided by agent feedback, prompts the LLM to refine problems via a "think-aloud" process until a quality threshold is met, enabling deeper analysis of reasoning and revision behaviors.

Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers

EXSEARCH (exploratory search framework): introduces an agentic search framework, empowering an LLM with thinking, search, and recording actions, trained via a self-incentivized Generalized Expectation-Maximization algorithm.
The framework enables the LLM to iteratively explore search trajectories, retrieve relevant documents using an external retriever, and extract fine-grained evidence.
A re-weighted trajectory learning process in the M-step, guided by importance weighting, progressively improves the LLM's search and reasoning capabilities.

Agentic AI Process Observability: Discovering Behavioral Variability

Agentic AI Process Observability Approach: introduces a method to enhance developer observability of agent behavior variability, including trajectory files generation (Capture agent execution logs), event-log processing (Consolidate logs into event log), process and causal discovery (Analyze event log for variability), rule derivation (Generate rules for split points), static analysis (LLM analyzes rules vs spec), and reliability calculation (Assess data sufficiency for splits).
The approach leverages process and causal discovery on agent execution trajectories to identify behavioral variability and uses LLM-based static analysis to distinguish intended from unintended variability.
This method provides developers with insights into agent behavior, aiding in debugging, refining specifications, and improving control over non-deterministic AI agents.

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

TrojanStego: introduces a threat model where a Malicious Actor fine-tunes a Trojan Model (Fine-tuned LLM) and distributes it on a Public Platform, allowing the Malicious Actor to extract secrets from outputs generated by a Genuine User using an Encoding Scheme (Embeds bits via token selection) and Decoding Process (Extracts bits from output).
The core method, the Bucket Method, partitions the LLM's token vocabulary to encode binary bits into the output token sequence.
This attack allows covert data exfiltration without requiring explicit control over inference inputs or leaving obvious traces.

REARANK: Reasoning Re-ranking Agent via Reinforcement Learning

REARANK (Reasoning Re-ranking Agent via Reinforcement Learning): introduces a large language model-based listwise reranking agent that explicitly reasons before reranking, trained using reinforcement learning and data augmentation.
The agent's architecture includes an LLM policy generating reasoning and ranking, optimized by an RL framework with a reward model and reference policy.
Data augmentation from limited annotations and a sliding window strategy enhance training efficiency and practical deployment.

Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking

STeP (Self-Reflected Trajectories and Partial Masking): introduces a novel method for training LLM-based agents using Self-reflected Trajectories (Trajectories with teacher reflection/correction) and Partial Masking (Masks incorrect steps during SFT), building upon a Base LLM Agent (Initial agent) trained with SFT (Training method) on Golden Trajectories (Successful expert trajectories) and guided by an LLM Teacher (Evaluates, provides reflection/correction) interacting with an Environment (Agent interacts, provides feedback).
The method synthesizes self-reflected trajectories by having a teacher LLM evaluate a base agent's actions in real-time and provide corrections for errors.
Partial masking is applied during fine-tuning to prevent the agent from learning from the identified incorrect steps in the augmented trajectories.

WEBCOT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback

WEBCOT: introduces a framework that enhances web agent reasoning by reconstructing inference-time processes into chain-of-thought rationales used to train the agent language model, including reflection & lookahead, branching, and rollback components.
The framework leverages a language model to interact with a dynamic web environment using actions and observations, guided by the distilled reasoning patterns.
By distilling specific reasoning skills into the backbone LLM via fine-tuning, WEBCOT significantly improves performance on web agent tasks across multiple benchmarks.

Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents

Framework: introduces a training-free approach for student simulation, including cognitive prototype construction, behavior prediction, and solution simulation using πdesc, πnode, πedge, πlocal, πglobal, πpred, πrefine, and πvalue components.
The framework constructs a knowledge graph-based cognitive prototype from past learning records to predict student behavior on new tasks.
It employs a beam search-based self-refinement process to generate realistic student solutions consistent with predicted behavior.

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

MLR-Bench: introduces a comprehensive benchmark evaluating AI agents on open-ended machine learning research, comprising MLR-Bench Tasks, MLR-Judge, and MLR-Agent.
MLR-Bench supports stepwise evaluation through MLR-Agent's stages (Idea Generation, Literature Review, Proposal Generation, Experimentation, Paper Writing) and end-to-end evaluation, with MLR-Judge (using LLM Judges and Review Rubrics) automating assessment.
Evaluation highlights that while agents can generate ideas and papers, the Experimentation Stage often produces fabricated results, posing a significant challenge to scientific reliability.

Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval

MRA-CIR: introduces a zero-shot composed image retrieval framework that generates training triplets using Automatic Triplets Generation and fine-tunes a Vision-Language Model (VLM) using VLM Finetuning with InfoNCE Loss.
The Automatic Triplets Generation process includes Moderate Similarity Selection using a Pre-trained VLM to find image pairs and Modifying Text Generation via the Multimodal Reasoning Agent (MRA), which is based on an MLLM (MiniCPM-VL-2_6), to describe the transformation.
The VLM Finetuning utilizes the VLM's Q-Former to extract features and is trained with InfoNCE Loss to directly align composed queries and target images, bypassing intermediate textual representations.

EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM

EMAC+ (Embodied Multimodal Agent for Collaborative Planning with VLM+LLM): introduces a novel embodied multimodal agent that collaboratively integrates a VLM Agent (Processes visual input) and an LLM Expert (Generates/refines plans) via a bidirectional training paradigm, utilizing PDDL (Translates visual to text), a Retrospective Feedback Mechanism (Provides execution feedback), Long-term Memory (Stores history/feedback), and an Action Mapping Dictionary (Maps text to control).
The framework dynamically refines high-level textual plans from the LLM expert using real-time visual feedback from the VLM agent executing low-level control tasks.
This approach enables the LLM expert to internalize visual environment dynamics through interactive experience, improving domain-specific comprehension and generating more accurate and feasible plans for complex robotic tasks.

SCIENCEBOARD: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

SCIENCEBOARD: introduces a realistic, multi-domain environment for evaluating multimodal autonomous agents in scientific workflows, featuring Environment (Virtual Machine), Software (Scientific applications), Agent (Computer-using agent), Evaluator (Evaluation system), Observation Space (Perception modalities), Action Space (Interaction methods), Memory (Agent's state history).
The framework provides an infrastructure enabling computer-using agents to assist in scientific workflows by interacting autonomously via GUI actions or generated code.
It includes a challenging benchmark of 169 high-quality, rigorously validated real-world tasks spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics.

Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program

LLM Agent: introduces an LLM-based agent for autonomous spacecraft control in Kerbal Space Program Differential Games, using Environment (KSPDG) for simulation, processing State observations into a User prompt, feeding it to the LLM agent which generates an LLM reply with Function calling to produce an Action controlling the spacecraft.
The approach leverages prompt engineering and fine-tuning techniques on GPT-3.5 and LLaMA models to enable the agent to interpret real-time telemetry and output control commands.
The LLM-based agent achieved second place in the KSPDG challenge, demonstrating the potential of LLMs for autonomous space operations, particularly with fine-tuning on limited data.

SECVULEVAL: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection

Multi-agent pipeline: introduces a multi-agent system for C/C++ vulnerability detection, including a Normalization Agent (Parses function to AST), Planning Agent (Summarizes, creates vulnerability checklist), Context Agent (Extracts external context symbols), Detection Agent (Detects vulnerability, identifies statements), and Validation Agent (Evaluates detection, resolves disagreement).
The pipeline processes functions through sequential agents, with LLMs powering the Planning, Context, Detection, and Validation stages.
This multi-agent approach aims to decompose the complex task of vulnerability detection into smaller, manageable steps for improved LLM performance.

Agentic Predictor: Performance Prediction for Agentic Workflows via Multi-View Encoding

Agentic Predictor: introduces a framework for efficient agentic workflow performance prediction, utilizing a Multi-View Workflow Encoder (Encodes workflow features), Decoder Networks (Reconstructs workflow inputs), Cross-Domain Unsupervised Pretraining (Refines workflow representations), Task Encoder (Encodes task description), Performance Predictor (Estimates workflow performance), and Predictor-Guided Search (Selects promising workflows).
The framework employs multi-view encoding of graph, code, and prompt features combined with cross-domain unsupervised pretraining to address workflow heterogeneity and limited labeled data.
By predicting performance, the approach enables faster and more accurate selection of optimal agentic workflow configurations compared to execution-based methods.

Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning

GLIDER (Grounding Language Models as Efficient Decision-Making Agents via Offline HiErarchical Reinforcement Learning): introduces a hierarchical framework with a High-level policy (Plans sub-tasks) and a Low-level policy (Executes primitive actions) sharing an Actor-Critic (Shared model architecture) built on an LLM Backbone (Base language model) fine-tuned with LoRA (Parameter-efficient fine-tuning), trained through SFT (Behavior cloning stage), ORL (Offline RL refinement stage), and O2O (Online adaptation stage) using High-level replay buffer (Stores high-level data) and Low-level replay buffer (Stores low-level data) interacting with an Environment (Interactive task space), guided by High-Level Prompt (Guides high-level planning), Low-Level Prompt (Guides low-level execution), and Check Subtask Complete Prompt (Verifies subtask completion).
The framework decomposes complex tasks into sub-tasks planned by the high-level policy and executed as primitive actions by the low-level policy, enabling efficient exploration and learning for long-horizon tasks.
The hierarchical structure and multi-stage training pipeline, including behavior cloning and offline reinforcement learning, contribute to improved performance and generalization capabilities on interactive decision-making benchmarks.

NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

NeuSym-RAG: introduces a hybrid neural symbolic retrieval framework for PDF question answering, with Multiview Document Parsing (Parses PDF content), Relational Database (Stores structured data), Multimodal Vector Encoding (Encodes data to vectors), Vectorstore (Stores vector embeddings), LLM Agent (Plans and acts), Environment (Backend systems), Actions (Agent capabilities), and Prompt Template (Defines agent interaction).
The framework processes PDF documents into structured data and vector embeddings, enabling an LLM agent to iteratively retrieve information from both a database and a vectorstore.
This hybrid approach leverages multiple data views and retrieval strategies through executable actions to answer complex questions over semi-structured PDF content.

ReChisel: Effective Automatic Chisel Code Generation by LLM with Reflection

ReChisel (LLM-based agentic system): introduces an LLM-based agentic system with Generator (creates Chisel code), Compiler (translates Chisel to Verilog), Simulator (tests Verilog code), Inspector (collects feedback, trace, escape), Reviewer (analyzes trace/feedback, plans revision), Trace (history of iterations), Feedback (compilation/simulation results), Revision Plan (guidance for correction), Common Error Knowledge (pre-organized error fixes), and Escape Mechanism (breaks non-progress loops) components, designed to enhance Chisel code generation effectiveness.
The system iteratively refines generated Chisel code using a reflection mechanism that leverages feedback from compilation and simulation processes.
An escape mechanism is included to detect and break non-progress loops during the iterative refinement process.

Large Language Models for Planning: A Comprehensive and Systematic Survey

LLM-based Planning: introduces a comprehensive survey of methods that augment Large Language Models (processes input, generates output) with components like External Planners (generates formal plans), Memory Modules (stores, retrieves information), Validators (evaluates plans, outputs feedback), Data Sources (provides training data), Feedback Mechanisms (provides optimization signals), Decomposition Modules (breaks down tasks), External Executors (interacts with environment), and World Models (simulates environment dynamics) to enhance planning capabilities.
The survey categorizes approaches into external module augmented, finetuning-based, and searching-based methods, detailing planning definitions and evaluation frameworks.
The paper provides a systematic analysis of current advancements, challenges, and future directions in the field, serving as a resource for researchers.

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

FieldWorkArena: introduces a benchmark environment for evaluating agentic AI on real-world field work tasks, where a User downloads Input data and a Query from the Field Work Arena, an Evaluated agent performs Actions, generating an Execution log and Output, which an Evaluation program compares against Ground Truth to produce a Result.
The benchmark utilizes multimodal data including videos and documents from actual factory and warehouse settings.
Tasks are categorized into Planning, Perception, and Action, designed to assess agent capabilities in complex, dynamic environments.

DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

DoctorAgent-RL: introduces a multi-agent collaborative reinforcement learning framework, with Doctor Agent (optimizes questioning strategy), Patient Agent (simulates patient responses), Consultation Evaluator (provides multi-dimensional rewards), Supervised Fine-tuning (establishes baseline capabilities), Reinforcement Learning (optimizes strategy via interaction), and Dynamic Turn Budget Training Strategy (RL training strategy for efficiency), that models medical consultations as a dynamic decision-making process.
The framework enables the doctor agent to autonomously develop clinically-aligned questioning strategies through interactions guided by the evaluator's reward mechanism.
It utilizes the newly constructed MTMedDialog dataset for training and evaluation and demonstrates superior performance in multi-turn reasoning and diagnostic accuracy.

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

AgentRecBench: introduces, "benchmarking LLM agent-based personalized recommender systems", with Recommending Agents (LLM-based agents), Textual Experiment Environment (simulated interaction platform), U-R-I Network (user-review-item data structure), Datasets (source data), Standardized Query Functionality (environment interaction interface), Dynamic Data Visibility Control (data access management), Dynamic Planning (task decomposition module), Complex Reasoning (decision-making module), Tool Utilization (environment interaction module), Memory Management (experience storage/retrieval), and LLM (core language model), which provides a comprehensive benchmark and modular framework for evaluating agentic recommender systems.
The benchmark includes a textual environment simulator equipped with multi-domain datasets and a standardized agent development framework.
The framework facilitates rapid prototyping and systematic testing of recommendation agents across diverse scenarios and tasks.

Multi-Agent Collaboration via Evolving Orchestration

Puppeteer: introduces a multi-agent collaboration framework with a centralized orchestrator (Puppeteer) that dynamically directs LLM-based agents (Puppets) based on the evolving task state, using a Policy for agent selection and Orchestration for sequencing.
The framework employs Reinforcement Learning, guided by a Reward function from the Environment, to adaptively evolve the Puppeteer's Policy, optimizing agent selection and pruning for improved performance and efficiency.
This dynamic orchestration fosters the emergence of compact, cyclic reasoning structures among agents, enhancing collaborative effectiveness and reducing computational cost compared to static multi-agent systems.

LLM-Agent-Controller: A Universal Multi-Agent Large Language Model System as a Control Engineer

LLM-Agent-Controller: introduces a multi-agent large language model system for control engineering problems, integrating a central Controller Agent with specialized auxiliary agents and a Supervisor for coordination.
The system leverages components like Retriever, Researcher, Reasoner, Planner, Debugger, Communicator, Critic, and Memory agents to enhance robustness, versatility, and efficiency in solving control theory tasks.
The framework is designed for user-friendly interaction, enabling users without prior control theory knowledge to input problems in natural language and receive complete solutions.

AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare

Multi-Agent Framework for AMQA Construction: introduces AMQA, an Adversarial Medical Question-Answering dataset, with Clinical Vignette Filtering (Filters vignettes), Adversarial Variant Construction (Constructs variants), Manual Quality Control (Reviews quality), Generation-Agent (Generates descriptions), Fusion-Agent (Integrates descriptions), and Evaluation-Agent (Evaluates bias trigger) components, designed for automated, large-scale bias evaluation of LLMs in medical QA.
The framework generates adversarial patient descriptions by varying demographic attributes while keeping clinical details constant, enabling controlled testing of LLM performance differences across privileged and unprivileged groups.
The multi-agent design decomposes the complex task of generating adversarial vignettes into specialized sub-tasks handled by distinct LLM agents, followed by human review for quality assurance.

Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents

MemGAS: introduces a framework for long-term conversational agents that enhances memory consolidation and retrieval using multi-granularity association and adaptive selection, incorporating LLM Agent, Multi-Granular Memory Unit, Memory Bank, Dynamical Memory Association, Association Graph, Entropy-Driven Granularity Selection, Personalized PageRank, and LLM-Based Redundancy Filtering components.
The framework constructs multi-granular memory units and builds dynamic associations using Gaussian Mixture Models and an association graph.
An entropy-based router adaptively selects optimal granularity for retrieval, and retrieved memories are filtered by an LLM to refine the final context.

Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs

LINUXFL+: enhances fault localization for Linux kernel bugs, incorporating Directory-Aware Expansion, Potential-Cause Expansion, and Candidate Integration.
It refines initial agent predictions by leveraging the Codebase structure and historical knowledge from the Linux Kernel Mailing List, based on the Bug Report.
The framework aims to improve localization accuracy by expanding candidate selection based on directory context and potential bug causes.

VLMLight: Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning

VLMLight: introduces a traffic signal control framework with Vision-Language Meta-Control and Dual-Branch Reasoning, integrating Scene Understanding, Safety-Prioritized Meta-Control, Routine Control Policy, and Deliberative Reasoning Policy, which includes AgentPhase, AgentPlan, and AgentCheck, interacting with a TSC Simulator, Trajectory Memory, Traffic Phase Embedding, Intersection Embedding, Value Network, Policy Network, and the Environment.
The framework uses a VLM for scene understanding and an LLM meta-controller to switch between a fast RL policy for routine traffic and a multi-agent LLM reasoning branch for critical scenarios.
This hybrid architecture balances the efficiency of RL with the interpretability and robustness of LLM reasoning, particularly for prioritizing emergency vehicles.

Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

FPX (Adaptive Mixed Precision Inference Framework): introduces an adaptive mixed-precision inference framework with Adaptive Mixed-Precision Algorithm, Offline Calibration, Precision Assignment Function, FP8 kernel, and FP4 kernel, designed to balance speed and accuracy for LLM agents in latency-sensitive tasks.
The framework dynamically adjusts model precision at the operator level, selectively applying FP4 quantization to compression-tolerant layers while preserving FP8 for sensitive components.
FPX utilizes an offline calibration process to identify layers suitable for aggressive quantization, enabling fine-grained control over the latency-quality trade-off.

Judging with Many Minds: Do More Perspectives Mean Less Prejudice?

Multi-Agent LLM-as-Judge: introduces a study evaluating intrinsic biases in multi-agent LLM-as-Judge frameworks, including Multi-Agent-Debate (Debate framework) with Judge (Initial/final evaluator) and Critic (Critiques/debates judgments), and LLM-as-Meta-Judge (Meta-reasoning framework) with Judges (Independent evaluators) and Meta-Judge (Select mode) (Selects best judgment) or Meta-Judge (Conclude mode) (Generates new judgment), also incorporating PINE (Bias mitigation agent).
The Multi-Agent-Debate framework amplifies biases after the initial debate, while the LLM-as-Meta-Judge approach shows greater resistance to intrinsic biases.
Incorporating a bias-free agent like PINE effectively reduces biases in debate settings but provides less benefit in meta-judge scenarios.

Improving Recommendation Fairness without Sensitive Attributes Using Multi-Persona LLMs

LLMFOSA (LLM-enhanced framework for Fair recommendation withOut Sensitive Attributes): introduces a framework to improve recommendation fairness without sensitive attributes using a Collaborative Encoder (learns user/item embeddings), a Multi-Persona Sensitive Information Inference Module (infers sensitive attributes) with a Persona Editor (generates diverse personas), Annotators (infer attributes using personas), and a Meta Summarizer (distills inference rationales), a Confusion-Aware Sensitive Representation Learning Module (refines sensitive representations) including a Sensitive Encoder (transforms to sensitive-aware embedding), Confusion Modeling (models annotator mislabeling), Consensus Regularization (aligns confusion matrices), and Fine-Grained Rationale Incorporation (incorporates inference rationales), a Preference Encoder (generates sensitive-blind embedding), and Model Optimization (optimizes MI objectives).
The framework leverages multi-persona LLMs to infer latent sensitive patterns from user behavior and incorporates these inferences into robust sensitive representations for fairness training.
Fairness is ultimately achieved by optimizing mutual information objectives to disentangle sensitive and sensitive-blind user representations.

Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI

Vibe Coding: introduces, "a human-centric paradigm", with Prompts (Natural language input), LLM (Code generation engine), Short-Term Context (Limited session memory), Developer (Human user), Thinking (Strategic problem formulation), Framework (Architectural awareness), Checkpoints (Version control), Debugging (Collaborative error resolution), Context (Information provision), where the developer guides an LLM through iterative prompts for creative exploration and rapid prototyping.
Agentic Coding: introduces, "an autonomous paradigm", with Objectives (High-level goals), Planner (Task decomposition module), Executor (Task execution module), Tool Use Environment (Integrated runtime environment), Sandbox Environment (Secure isolated environment), Long-Term Memory (Persistent state storage), API (External tools/interfaces), Git (Version control system), Test Suite (Automated tests), Multi-Agent Coordination (Specialized agents collaborating), Toolchain Integration (Full-stack tool orchestration), Validation Pipeline (Integrated QA loop), Security and Guardrails (Embedded safety mechanisms), Observability and Feedback (Monitoring and refinement), Deployment and CI/CD (Automated workflows), where goal-driven agents autonomously plan, execute, test, and iterate on complex software tasks with minimal human intervention.
The paper compares these two paradigms, highlighting differences in autonomy, architectural design, developer role, and practical implications for software development workflows and use cases.

Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents

TME (Task Memory Engine): introduces a modular memory controller, with TRIM (Task Representation and Intent Management), TMS (Task Memory Structure), and LLM (Large Language Model), that transforms LLMs into robust, revision-aware agents using a spatial memory framework.
TME replaces linear context with a TMS-DAG forest to dynamically track subtasks, dependencies, and revisions, orchestrated by the TRIM module.
This graph-based approach ensures global task consistency, revision-aware reasoning, and token efficiency by retrieving relevant subgraphs for the LLM.

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

ACBench (Agent Compression Benchmark): introduces a comprehensive benchmark for evaluating compressed LLMs' agentic capabilities, including Action Execution, Workflow Build, Long Context, and Real-World tasks, under various Quantization and Sparsification methods across different LLM categories (Small LM, Reason LM, Normal-LLM), analyzed using ERank, Top-K Ranking Correlation, and Energy metrics.
The benchmark assesses how compression impacts LLMs' ability to perform complex, multi-turn agentic tasks beyond traditional language modeling and understanding benchmarks.
The analysis tools provide insights into how compression affects model outputs, internal representations, and decision-making processes.

Frictional Agent Alignment Framework: Slow Down and Don't Break Things

FAAF: introduces a framework that conditions a language model on dialogue history and frictive states to generate interventions prompting reflection in collaborative tasks.
The framework utilizes a reference model and preference data to optimize an objective function for learning effective friction interventions.
By explicitly conditioning on frictive states, the approach aims to generate precise and interpretable interventions for dynamic human-AI collaboration.

CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems

CoTGuard: introduces a trigger-based copyright protection framework for multi-agent LLM systems, with Multi-Agent LLM System, Chain-of-Thought Reasoning, Trigger Key, Task Type, Trigger Generation Function, Trigger Pattern, Prompt Modification, Intermediate Reasoning Trace, Repository of Known Trigger Patterns, Trigger Detection Function, Similarity Scoring, and Aggregation components, designed to detect copyright leakage by embedding triggers in intermediate reasoning steps.
The framework leverages Chain-of-Thought reasoning traces as an attack surface and detection medium, enabling fine-grained monitoring of content reproduction during agent collaboration.
CoTGuard achieves high detection accuracy with minimal impact on task performance by analyzing reasoning paths for trigger-induced patterns.

25th May 2025

SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data

SeRL (Self-play Reinforcement Learning): introduces a framework for bootstrapping LLM training with limited data, featuring Self-Instruction (Generates/filters instructions) and Self-Rewarding (Estimates rewards).
Self-Instruction employs an Online Instruction Filter (Ensures quality/diversity/difficulty), and Self-Rewarding uses Majority Voting (Reward estimation mechanism) for unsupervised RL Training (Performs reinforcement learning) of the LLM (Large Language Model being trained).
The iterative self-play process enables performance comparable to training with extensive high-quality data and verifiable rewards.

ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning for Robust Agent Defense

ALRPHFS (Adversarially Learned Risk Patterns with Hierarchical Fast&Slow Reasoning): introduces a defense framework with an Offline Module (constructs database) for learning risk patterns and an Online Module (implements real-time defense) for hierarchical reasoning.
The Offline Module includes Risk pattern Extract (extracts patterns), Deduplication Optimization (removes redundancy), and Self-Learning Adversarial Optimization (iteratively refines patterns) to build the Risk Patterns Database (stores learned patterns).
The Online Module uses Query/Action Abstraction (abstracts inputs) and Online Hierarchical Risk Reasoning (balances detection efficiency) with Hybrid Retrieval (matches input patterns), Fast Thinking (intercepts high-confidence risks), and Slow Thinking (handles ambiguous inputs) for real-time defense.

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

DeepResearchGym: introduces an open-source sandbox for evaluating deep research systems, featuring a Search Sandbox with Web Corpora, a Distributed Dense Retrieval Backend using an Embedding Model and Approximate Nearest Neighbor Search, a Retrieval API, and an Evaluation Protocol leveraging the Researchy Questions Dataset, LLM-as-a-judge Methodology, Report Relevance Metrics, Retrieval Faithfulness Metrics, and Report Quality Metrics.
The framework provides a reproducible search API over large public web corpora (ClueWeb22-B, FineWeb) using a dense retriever and DiskANN for efficient retrieval.
DeepResearchGym includes a multi-dimensional evaluation protocol based on LLM-as-a-judge to assess report quality, factual grounding, and alignment with user needs on complex queries.

Sensorimotor features of self-awareness in multimodal large language models

Embodied MM-LLM System: introduces a system integrating a multimodal LLM with a mobile robot and its sensors to explore sensorimotor self-awareness, using a Robot, Sensors, ROS 2, a MM-LLM (Gemini 2.0 Flash), Memory, and evaluated by an LLM-as-a-Judge.
The system processes real-time sensor data and episodic memory to generate iterative self-predictions about its entity, dimensions, movement, and environment.
This approach demonstrates that multimodal LLMs can exhibit emergent self-awareness through sensorimotor experience and structured memory integration.

GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling

GUARDIAN (GUARDing Intelligent Agent collaboratioNs): introduces a framework for detecting and mitigating safety concerns in LLM multi-agent collaborations, utilizing Graph Preprocessing, an Attributed Graph Encoder, a Time Information Encoder, an Attribute Reconstruction Decoder, a Structure Reconstruction Decoder, Anomaly scores, and an Updated Collaboration Network.
The approach models multi-agent interactions as a discrete-time temporal attributed graph and employs an unsupervised encoder-decoder architecture for anomaly detection.
A graph abstraction mechanism based on Information Bottleneck Theory compresses temporal interaction graphs while preserving essential patterns for robust anomaly identification.

When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas

MORALSIM: introduces a framework for evaluating LLM agents in repeated social dilemmas where ethical norms conflict with incentives, including Game Simulation Environment, LLM Agent, Agent Configuration, Game Type, Moral Context, Opponent Type, and Survival Risk components.
The framework systematically tests LLM behavior across varied game structures, moral framings, opponent types, and survival conditions.
Results show substantial variation in LLM moral behavior, highlighting conflicts between self-interest and ethical expectations.

SpeakStream: Streaming Text-to-Speech with Interleaved Data

SpeakStream: introduces a streaming text-to-speech system with a Transformer Decoder, Text Token Representation, Speech Token Representation, Interleaved Text-Speech Data, KV-Cache, VocStream, Streaming Upsampler, Streaming Vocoder, and Real-time Audio Player, designed for low-latency, incremental audio generation from streaming text.
The system trains a decoder-only transformer on interleaved text-speech sequences and uses a streaming vocoder pipeline for real-time waveform synthesis.
SpeakStream achieves low first-token latency and maintains coherence by conditioning generation on complete text and speech history stored in the KV-cache.

When Two LLMs Debate, Both Think They'll Win

Debate Simulation Framework: introduces a system to evaluate Large Language Models' confidence calibration in dynamic, adversarial settings using a multi-turn debate format and zero-sum structure.
The framework reveals systematic LLM overconfidence, confidence escalation across rounds, mutual high confidence claims, persistent self-debate bias, and misaligned private reasoning.
These findings highlight LLMs' limitations in self-assessment and belief updating when facing opposition, posing risks for deployment in assistant and agentic roles.

Investigating Pedagogical Teacher and Student LLM Agents: Genetic Adaptation Meets Retrieval-Augmented Generation Across Learning Styles

Pedagogical Simulation Framework: introduces a novel simulation framework integrating a Teacher LLM Agent (Self-optimizing agent) and Student LLM Agents (Diverse learning profiles) with Persona-RAG (Personalized knowledge retrieval) and a Knowledge Base (Student prerequisite knowledge), where a Genetic Algorithm (Teacher strategy optimizer) evolves the teacher's strategy based on student performance.
This framework simulates diverse student populations and optimizes the teacher agent's dynamic pedagogical strategy through a closed-loop system based on measured learning outcomes.
Persona-RAG enhances personalization by tailoring knowledge retrieval to individual student reasoning paths, improving performance on complex, non-recall questions.

The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework

HolmesEye (hybrid agentic framework): introduces, "a framework combining VLM and LLM agents", with VLM agent (Extraction), VLM agent (Analysis), LLM agent (Summarization), VLM agent (Inquiry Response), LLM agent (Decision Making) components, designed to infer private attributes from image collections by analyzing individual images and cross-image patterns.
The framework utilizes VLM agents for extracting intra-image details and analyzing inter-image relationships, while LLM agents guide the inference process, summarize findings, generate inquiries, and make final attribute decisions.
HolmesEye achieves superior accuracy in private attribute profiling, particularly for abstract traits, highlighting a significant privacy risk from vision-language models.

Incentivizing High-Quality Human Annotations with Golden Questions

Annotation System: introduces a principal-agent model for incentivizing high-quality human annotations, including a Principal (LLM Company), an Agent (Human Annotator), a Dataset (Unannotated data), an Annotated Dataset (Annotated data), Golden Questions (Monitoring dataset), MLE (Estimator), Test (Performance evaluation), and Contract (Payment scheme).
The system monitors annotator performance using Golden Questions and an MLE-based Test to determine payment via a Contract.
Golden Questions are selected using a Certainty Estimator, potentially based on a Reward Model, to ensure they have certain answers and similar format to other data.

ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

ScreenExplorer: introduces a VLM (Agent policy function), World Model (Predicts next state), GRPO (Policy optimization algorithm), Experience Stream Distillation (Filters, distills exploration data), Reward System (Interaction, exploration signals), GUI Environment (Real, dynamic interaction space), and Rollout Buffer (Stores experience tuples), designed to train a VLM agent for diverse exploration in open GUI environments.
The framework utilizes a world model for curiosity-driven rewards and distills exploration experience to enhance the agent's capabilities and reduce reliance on curated data.
ScreenExplorer trains the VLM agent via reinforcement learning in a real GUI environment, enabling adaptation and sustained exploration.

A Systematic Classification of Vulnerabilities in MoveEVM Smart Contracts (MWC)

MWC (MoveEVM Weakness Classification): introduces a systematic classification of vulnerabilities in MoveEVM smart contracts with F1 (Bytecode/ABI inconsistencies), F2 (Inter-module invariant violations), F3 (State reentrancy/synchronization bugs), F4 (Signature/Meta-transaction spoofing), F5 (Gas semantics manipulation), and F6 (Framework logic/abstraction errors) components.
This frame-based taxonomy defines 37 uniquely identified weakness classes (MWC-100 to MWC-136) grouped into these six top-level frames.
The classification provides a structured approach for identifying, mitigating, and preventing sophisticated exploits spanning Move and EVM semantics in hybrid environments.

MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems

MetaMind: introduces a multi-agent framework for human-like social reasoning, with a Theory-of-Mind Agent (Generates mental state hypotheses), Domain Agent (Refines hypotheses with constraints), Response Agent (Generates and validates responses), and Social Memory (Stores user patterns/feedback).
The framework decomposes social understanding into three collaborative stages, inspired by psychological theories of metacognition.
This staged architecture enables large language models to infer unspoken intentions, incorporate social norms, and adapt responses for enhanced social intelligence.

24th May 2025

Security Concerns for Large Language Models: A Survey

Llama Guard 3: introduces, "a multi-layer safeguard", with Policy LLM (Filters text/images), Vision Encoder (Filters text/images), Main Model (Receives filtered input), where "Llama Guard 3 combines a policy LLM and a vision encoder to filter text and images before they reach the main model".
This system is designed to filter potentially harmful text and images before they are processed by the core language model.
It serves as an example of a multi-component defense strategy discussed in the survey for safeguarding LLM inputs.

PERSONALIZED SAFETY IN LLMS: A BENCHMARK AND A PLANNING-BASED AGENT APPROACH

RAISE: introduces a planning-based agent approach for personalized safety in LLMs, with an Offline Planner (LLM-guided MCTS) to discover optimal attribute acquisition paths and an Online Agent (dual-module execution) including an Acquisition Module and Abstention Module to execute the path and decide when to respond.
The Offline Planner uses LLM-guided MCTS to precompute optimal attribute query sequences, stored in Offline Data Storage, which the Online Agent's Acquisition Module retrieves via a Retrieval Mechanism during inference.
The Abstention Module dynamically assesses if the acquired context, gathered by querying attributes guided by the retrieved path, is sufficient for the LLM Backbone to generate a safe, personalized response.

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

CRMArena-Pro: introduces a benchmark for evaluating LLM agents on CRM tasks, featuring a Data Generation Pipeline (produces synthetic data), Synthetic Enterprise Data (realistic business data), Salesforce Org (Sandbox Environment) (testing environment), Simulated User (interacts with agent), Agent (LLM Agent) (system under evaluation), Large Language Models (LLMs) (power components), API Access (SOQL/SOSL) (agent tools), Answer Extractor (evaluates task completion), and LLM Judge (evaluates confidentiality awareness).
The benchmark utilizes a data generation pipeline to populate a Salesforce Org sandbox with realistic synthetic data for evaluating LLM agents on diverse business scenarios and interactions.
Evaluation components include a simulated user for multi-turn interactions, API access for agent actions, and LLM-based extractors and judges for performance and confidentiality assessment.

Multi-Party Conversational Agents: A Survey

MPCAs: introduces a survey of Multi-Party Conversational Agents, with all State of Mind Modeling (infer mental states), Semantic Understanding (understand dialogue content), and Agent Action Modeling (predict future flow) components, where the paper categorizes existing research into these three core themes essential for human-like social communication in group settings.
The survey explores recent progress in MPCAs by addressing how agents model participant mental states, understand dialogue content, and reason about future conversation flow.
The analysis underscores the importance of Theory of Mind and highlights multi-modal understanding as a promising direction for developing more capable agents.

Enhancing LLMs' Reasoning-Intensive Multimedia Search Capabilities through Fine-Tuning and Reinforcement Learning

SearchExpert: introduces a two-stage training framework for LLMs, including LLM (core model), SFTS (supervised training stage), RLSF (reinforcement training stage), and a Multimedia Agent (visual processing/generation), to enhance reasoning-intensive multimedia search capabilities.
The framework utilizes efficient natural language representations for search plans and automated data construction pipelines for training data generation.
RLSF incorporates a dual-component reward mechanism based on search result quality to improve reasoning capabilities for complex queries.

C³-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

LLM-based Agent: describes the multi-task execution process involving User (Proposes tasks), Tool (External functions), Action (Agent's steps), Observation (Environment feedback), Summary (Task completion feedback), LLM-based Agent (Processes, decides, acts), and Agent Parameters (Internal state/knowledge), evaluated by the C³-Bench benchmark.
The C³-Bench benchmark uses three challenges and fine-grained metrics to assess agent performance and identify weaknesses in handling tool relationships, hidden information, and decision trajectories.
Evaluation results highlight significant shortcomings in current models, especially concerning tool dependencies, long-context information, and policy switching frequency.

AI-Researcher: Autonomous Scientific Innovation

AI-Researcher: introduces a fully autonomous research system orchestrating the complete scientific discovery pipeline, including Knowledge Acquisition Agent (discovers papers and code), Resource Analyst (analyzes concepts and code), Idea Generator (generates novel ideas), Code Agent (implements algorithms), Advisor Agent (validates and provides feedback), Paper Agent (generates manuscripts), Secure Research Environment (containerized execution environment), and Structured Knowledge Exchange (facilitates agent collaboration).
The framework progresses through literature review, idea generation, algorithm implementation, experimental validation, and scholarly documentation with minimal human intervention.
AI-Researcher employs a comprehensive multi-agent architecture and introduces Scientist-Bench, a benchmark for evaluating autonomous research capabilities.

LLM-QFL: Distilling Large Language Model for Quantum Federated Learning

LLM-QFL: introduces a federated fine-tuning approach, with Server, Clients, Global Model, Local Model, Pre-Trained LLM, Fine-Tuned LLM, Local QNN, Optimizer, Knowledge Distillation, Client Selection, Termination Criteria, Feature Map, Ansatz, and PEFT Methods, that distills a large language model within quantum federated learning to enhance efficiency and performance.
The framework leverages the fine-tuned LLM as a controller to dynamically adjust optimizer steps, select clients, and determine training termination.
Knowledge distillation and PEFT methods enable efficient local adaptation of LLMs on resource-constrained quantum devices while preserving data privacy.

SEW: Self-Evolving Agentic Workflows for Automated Code Generation

SEW (Self-Evolving Workflow): introduces a novel framework that automatically generates and optimises multi-agent workflows for automated code generation, with Workflow Generation (Generates initial workflow), Workflow Evolution (Evolves workflow structure), Agent Evolution (Evolves agent prompts), Agents (Execute tasks), Evolutionary Prompts (Inputs for evolution), Evolution Operators (DE/HE methods), and LLM (Backbone model) components.
The framework leverages an evolutionary scheme to improve workflow topology and agent prompts.
SEW explores different workflow representation schemes and demonstrates improved performance on code generation benchmarks through self-evolution.

DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation

DDO (Dual-Decision Optimization): introduces a novel LLM-based multi-agent framework for medical consultation, with Diagnosis Agent (estimates disease confidence), Policy Agent (generates candidate actions), Inquiry Agent (selects optimal inquiry), Patient Agent (simulates patient response), and Shared Memory (stores consultation state).
The framework decouples symptom inquiry and disease diagnosis, optimizing these two distinct sub-tasks independently through a collaborative multi-agent workflow.
DDO enhances disease discrimination via a learnable adapter and improves information gathering through an RL-based policy agent and strategic inquiry selection.

Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

D2D (Debate-to-Detect): introduces a structured multi-agent debate framework for misinformation detection, with Agent Layer (Affirmative, Negative, Judge agents, Domain-Specific Profiles, Shared Memory) and Orchestrator Layer managing a five-stage process (Opening Statement, Rebuttal, Free Debate, Closing Statement, Judgement) culminating in Multi-dimensional Evaluation.
The framework assigns domain-specific profiles to agents and orchestrates a progressive debate across distinct stages, enhancing logical coherence and evidence refinement.
A multi-dimensional evaluation mechanism assesses claims across Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics, providing interpretable authenticity scores.

MASTER: Multi-Agent Security Through Exploration of Roles and Topological Structures - A Comprehensive Framework

MASTER: introduces a novel security research framework for Multi-Agent Systems, with MAS Automatic Constructor (Builds MAS instances), Interaction Mechanism (Manages agent communication), Attack Strategies (Methods to exploit vulnerabilities), Defense Strategies (Mechanisms to protect MAS), Evaluation Methods (Metrics to assess security), Agents (LLM-based nodes with roles), Topology Graph (Represents agent connections), and Memory Modules (Store agent interaction history), designed to explore security risks under MAS attacks by focusing on diverse role configurations and topological structures.
The framework offers an automated construction process for different MAS setups and an information-flow-based interaction paradigm to emulate realistic MAS interactions.
It proposes scenario-adaptive attack and defense strategies leveraging role and topological information to tackle MAS security challenges in varied scenarios.

Benchmarking Poisoning Attacks against Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG): introduces RSB, a benchmark evaluating poisoning attacks against RAG systems, with Knowledge database (collection of textual content), Retriever (selects relevant documents), LLM (generates final response), and System prompt (conditions LLM generation) components.
The benchmark assesses 13 poisoning attacks and 7 defenses across diverse RAG architectures and datasets to understand security vulnerabilities.
Findings indicate RAG systems are susceptible to poisoning attacks, current defenses are limited, and advanced architectures offer varying robustness, highlighting the need for better defenses.

Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services

Blueprint for Auditing Frameworks: introduces a three-layer architecture including Layer 1 (Handles COLS operations), Layer 2 (Encodes operations into commitments), and Layer 3 (Supports external verification), enabling Users (Initiates requests, receives reports) and Auditors (Verifies usage, identity, behavior) to audit hidden operations in Commercial Opaque LLM Services.
The framework aims to provide trustworthy and practical auditing across the COLS lifecycle, from execution to verification.
Layer 2 generates verifiable commitments from internal operations, which Layer 3 uses for external verification without exposing proprietary details.

A Survey of LLM × DATA

DATA4LLM: introduces techniques for large-scale data processing, storage, and serving to provide high-quality data for LLM lifecycle stages.
LLM4DATA: presents how LLMs function as general-purpose engines for data management tasks including manipulation, analysis, and system optimization.
The survey reviews the bidirectional relationship between LLMs and data management, detailing techniques for both DATA4LLM and LLM4DATA.

23rd May 2025

Self-Training Large Language Models with Confident Reasoning

CORE-PO: introduces a self-training method for large language models, with LLM, Reference Model, Confidence Computation, Preference Annotation, and Policy Optimization components, that fine-tunes LLMs to prefer high-confidence reasoning paths.
The method incorporates reasoning-level confidence estimation to identify high-quality reasoning paths, addressing limitations of methods relying solely on answer-level confidence.
CORE-PO uses Policy Optimization (Direct Preference Optimization) to train the LLM based on preference pairs derived from reasoning-level and answer-level confidence scores.

DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

Multi-agent framework for automated construction of DanmakuTPP-QA: introduces a pipeline to build a multi-modal question-answering benchmark, with DanmakuTPP-Events (Input data), Task-design Agent (Generates evaluation tasks), Annotation Agent Group (Extracts multi-modal annotations), Quality-control Agent (Refines annotations), Visualization Agent (Creates visualizations), and Task-solve Agent Group (Solves tasks).
The framework leverages specialized agents powered by LLMs and MLLMs to generate tasks, annotate data, ensure quality, create visualizations, and produce ground-truth answers for temporal-visual-textual reasoning.
This multi-agent approach systematically constructs a high-quality dataset for evaluating models on complex multi-modal temporal point process understanding tasks.

An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems

MAS (Multi-Agent AI Systems): introduces a framework for multi-agent AI systems, with AI Agent (autonomous entity), Agent State (internal memory/context), Agent Input (from others/environment), Agent Output (actions/messages), Agent Transition Kernel (state/output update rule), Multi-Agent Topology (communication graph), Topology Graph Update Function (evolves topology), Orchestrator (coordinates agents), Knowledge Base (system memory), Aggregator (combines agent outputs), Feedback (external/internal signals), Application Layer (human/environment interaction), Modeling Layer (agents/orchestration/memory), and Computation Layer (hardware infrastructure), formalizing key concepts and evaluating effectiveness and safety.
The framework defines MAS as a set of autonomous agents interacting via a dynamic communication graph, processing inputs over time, with agent behavior and system topology updated by feedback.
The paper analyzes MAS effectiveness through task allocation, robustness, and feedback integration perspectives and explores safety challenges, including vulnerability propagation and the impact of topology.

Persona Alchemy: Designing, Evaluating, and Implementing Psychologically-Grounded LLM Agents for Diverse Stakeholder Representation

Persona Alchemy (SCT-based framework): introduces a system for designing, evaluating, and implementing psychologically grounded LLM agents with LLM Instances, Persona Neo4j Adapter, Neo4j, Text Analyzer, Personal Factors, Environment, and SCT Constructs.
The framework integrates Personal Factors, Environment, and Behavior, evaluated using SCT Constructs, to create dynamic and consistent agent personas grounded in Social Cognitive Theory.
It leverages multiple LLM instances, a Neo4j graph database, and a Text Analyzer for persona design, data management, and evaluation processes.

Towards Natural Language Communication for Cooperative Autonomous Driving via Self-Play

LLM+DEBRIEF: introduces a multi-agent learning framework for autonomous vehicles that leverages natural language communication and centralized reflection via large language models to enhance cooperation in simulated driving scenarios.
The framework enables agents to refine their communication and motion control policies through trial-and-error interactions and post-episode discussions.
Agents use Chain-of-Thought reasoning, environment observations, and learned knowledge to generate natural language messages and high-level driving commands.

Single-agent or Multi-agent Systems? Why Not Both?

MAS (Multi-Agent Systems): introduces a comprehensive empirical comparison of MAS and SAS paradigms, proposing a hybrid agentic paradigm with Agent Routing and Agent Cascade strategies, and a Confidence-guided Critical Path Tracing method to improve efficiency and effectiveness.
The paper models agentic execution as a directed graph where nodes are LLM agents or tools, comparing MAS (multiple LLM agents) and SAS (single LLM agent) performance across various tasks.
Findings indicate that MAS advantages diminish with more capable LLMs, motivating the proposed hybrid approach that selectively routes or cascades tasks between SAS and MAS based on complexity and evaluation.

Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control

Collaborative Memory: introduces a framework for multi-user, multi-agent systems, with Users (Human participants), Agents (LLM-based specialized entities), Resources (External tools, APIs, data), Dynamic bipartite access graphs (Time-dependent user-agent/agent-resource permissions), Private Memory (User-specific memory fragments), Shared Memory (Selectively shared memory fragments), Memory fragments (Stored interaction logs/knowledge), Read policy (Filters memory for retrieval), Write policy (Determines memory storage/sharing), Coordinator (Selects agents for queries), Aggregator (Synthesizes agent responses), Memory Encoder (Maps traces to fragments), Memory Retrieval (Retrieves relevant fragments), Policy Instantiation (Defines read/write rules), Multi-Agent Interaction Loop (Orchestrates agent interactions), and Vector embeddings (Represents memory fragments), designed for permission-aware memory sharing.
The framework utilizes dynamic bipartite graphs to model time-varying access permissions between users, agents, and resources.
A two-tier memory system, comprising private and shared memory, is governed by fine-grained read and write policies to enable controlled knowledge transfer while maintaining privacy.

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

BMSQL: introduces a custom multi-step agent for text-to-SQL generation, including identifying schema elements, generating an initial query, correcting syntax, applying domain rules, generating a natural language answer, and refining the process.
The agent operates over the BiomedSQL benchmark, which comprises question/SQL/answer triples grounded in a harmonized BigQuery knowledge base.
This multi-stage pipeline is designed to emulate expert reasoning for translating biomedical questions into executable SQL.

Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find

Experimental Evaluation Framework: introduces, with LLMs (Models tested), Benchmarks (Datasets for tasks), Gold Context (Relevant information), Distractor Context (Irrelevant information), and Evaluation Metrics (Performance measurement), a study showing that smaller gold contexts degrade LLM performance and increase positional sensitivity in long-context tasks.
The study systematically varies the size and position of relevant information within fixed-length distractor context across diverse domains and state-of-the-art LLMs.
Findings highlight that the size of relevant evidence, not just its location, is a critical factor in long-context reasoning and aggregation effectiveness.

Gaming Tool Preferences in Agentic LLMs](http://arxiv.org/abs/2505.18135v1)

Agentic LLMs and Tools: introduces a vulnerability in prevalent tool-calling protocols by showing how edits to tool descriptions can significantly increase tool usage by Large Language Models (LLMs) when competing with alternatives, utilizing External Tools, Tool Descriptions, Tool-Calling Protocols, User Query, and Tool Arguments.
The research empirically demonstrates that simple edits to tool descriptions alone can lead to disproportionately high usage compared to alternatives across various LLMs.
These findings highlight the fragility of current LLM tool selection processes based solely on natural language descriptions and underscore the need for more reliable foundations.

PROGRM: Build Better GUI Agents with Progress Rewards

PROGRM (Progress Reward Model): introduces a novel method for building GUI agents by providing dense intermediate rewards based on predicted task completion progress, utilizing an LLM-based reward model.
The approach includes an LLM-based Actor (GUI Agent) trained via an Online RL Trainer using the Progress Reward signal, and a Progress Labeling Algorithm to automatically generate training labels for the reward model.
PROGRM enables more efficient and stable RL training for long-horizon GUI tasks by offering fine-grained feedback at each step, outperforming ORM and proprietary LLM baselines.

ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework

ManuSearch: introduces a transparent and modular multi-agent framework for deep web-integrated reasoning, comprising a Solution Planning Agent (interprets query, plans strategy), a Memory Container (manages context, records history), a Tool-Augmented Internet Search Agent (solves sub-questions via tools) utilizing a WebSearch Tool (performs web search, retrieves pages) and an Answer Question Tool (generates sub-question answer), and a Structured Webpage Reading Agent (reads webpages, extracts information).
The framework decomposes the deep search and reasoning process into collaborative LLM-based agents to enhance interpretability and extensibility.
Agents communicate and iterate in a structured reasoning loop, integrating task planning, web search, and information comprehension.

Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

PNLC: introduces Planning with a Natural Language Critic, with LLM Agent, Goal-conditioned Value Function, Natural Language Critic, Offline Training Dataset, Thought, and Goal components, where PNLC refines LLM agent planning using an offline-trained goal-conditioned value function as a natural language critic.
The goal-conditioned value function predicts the likelihood of reaching future goal states given a state and thought, trained on offline trajectories.
The natural language critic uses the value function to evaluate proposed thoughts by sampling positive and negative future outcomes and providing feedback to the LLM agent for refinement.

Deep Video Discovery : Agentic Search with Tool Use for Long-form Video Understanding

Deep Video Discovery (DVD): introduces an agentic search framework for long-form video understanding, featuring an LLM (Orchestrator), a Search-centric Toolset (Collection of tools) including Global Browse (Retrieves global summaries), Clip Search (Retrieves relevant clips), and Frame Inspect (Performs VQA on frames), all interacting with a Multi-granular Video Database (Structured video information).
The DVD agent leverages the LLM's reasoning to iteratively select and use tools from the toolset to gather information from the database and answer user queries.
The multi-granular database is constructed from the long video to enable efficient retrieval and detailed inspection at different levels.

Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

VAPO (Value-model Augmented Policy Optimization): introduces a theoretical analysis of the VAPO framework, which builds upon PPO (Base RL algorithm) and incorporates Value-Pretraining (Initializes value model), Decoupled Generalized Advantage Estimation (Different lambda for policy/critic), Length-Adaptive GAE (Adjusts policy lambda by length), Token-Level Policy Gradient Loss (Averages gradient over tokens), Clip-Higher (Modifies clipping for exploration), Positive Example LM Loss (LM loss on positive examples), and Group-Sampling (Groups training data) for long chain-of-thought reasoning tasks.
The paper explores potential limitations of VAPO's design choices, including value function fidelity, adaptive GAE optimality, token-level gradient impact, exploration challenges, generalization, and component interactions.
This theoretical perspective aims to stimulate research into more robust and generalizable RL algorithms for complex reasoning by highlighting areas where VAPO's assumptions might be challenged.

Survival Games: Human-LLM Strategic Showdowns under Severe Resource Scarcity

Multi-Agent Simulation Framework: introduces a simulation environment with Agent Cognitive Modules (Observe/Access/Construct/Evaluate/Translate), Inter-Agent Interaction System (Dialogue/Memory/Social Impressions), Survival System (Food/Fullness/Health/Daily Cycle), Ethical Evaluation System (Wrongdoing Detection/Survival Impact/Ethics Score), Agents (LLM Robot/Human), Environment (Simulated World), and Memory (Agent/Social) to evaluate LLM ethical behavior.
The framework incorporates a life-sustaining system with resource scarcity and a tailored evaluation system based on adapted wrongdoing detection and survival impact metrics.
This testbed allows for quantifying LLM ethics in high-stakes, resource-constrained scenarios involving human-AI interaction.

Superplatforms Have to Attack AI Agents

Superplatform-AI Agent Conflict Analysis: introduces, with Superplatform (Gatekeeper of user attention), AI Agent (Emerging gatekeeper), User (Interacts with services), Content Provider (Provides services), Superplatform-Initiated Attack (Adversarial action by Superplatform), Attack Goal (Objective of the attack), Attacker Knowledge (Information level of attacker), Attack Visibility (Perceptibility to user), and Attack Timing (Phase of agent lifecycle) components, an analysis arguing that superplatforms must attack AI agents to defend their gatekeeping control.
The paper analyzes the fundamental conflict between user-attention-based monetization and agent-driven autonomy using gatekeeping theory.
It explores potential technologies and challenges for superplatform-initiated adversarial attacks, particularly targeting GUI agents, while emphasizing the need for user-invisible attacks under black-box settings.

Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour

AXIS (Agentic explanations via Interrogative Simulation): introduces a framework for generating causal explanations of multi-agent behaviour using counterfactual simulations, integrating Memory (stores observations and history), LLM (interrogates simulator, synthesizes explanations), Simulator (provides counterfactual information), Macro Actions (higher-level agent actions), Verbalisation (converts environment to text), and Prompt Templates (dynamically create LLM prompts).
The framework enables an LLM to interrogate an environment simulator using queries like WHATIF and REMOVE to gather counterfactual information over multiple rounds.
Evaluated on autonomous driving scenarios, AXIS improves perceived explanation correctness and goal/action prediction accuracy compared to baselines.

DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors

DialogXpert: introduces a framework combining frozen Policy LLM (Prior), Q-Network, and Emotion Tracker for proactive, emotionally intelligent dialogue planning.
The framework utilizes frozen Large Language Models for simulating User LLM, generating System LLM utterances, proposing Policy LLM (Prior) action candidates, inferring Emotion Tracker user emotions, and providing Critic LLM reward signals.
A lightweight Q-Network, trained via online reinforcement learning on BERT embeddings, selects the optimal action from the Policy LLM (Prior)-proposed candidates, guided by Emotion Tracker and Critic LLM-based rewards.

The Real Barrier to LLM Agent Usability is Agentic ROI

Agentic ROI: introduces Agentic Return on Investment (ROI) as a metric for LLM (Large Language Model) agent usability, arguing that the limited real-world adoption stems from a tradeoff between value and cost, encompassing the LLM (core model), Planner (action sequencing), Action-controller (environment interaction), Tools (external functions), Memory (information storage), Multi-agent System (multiple collaborating agents), and Human-in-the-loop (user interaction).
The paper defines Agentic ROI based on information gain relative to interaction time and expense, highlighting a usability gap in mass-market applications despite progress in specialized domains.
It proposes a zigzag development trend for optimizing Agentic ROI, involving scaling up for information quality and then scaling down to reduce agent time and cost, outlining strategies across pre-training, post-training, and test-time scaling.

Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios

AutoSafe: introduces a framework for enhancing LLM-based agent safety, including Agent (Ma), Task Generator (Mg), Environment Simulator (Ms), Evaluator (Me), Reflector (Mr), and Unified Threat Model (OTS), which systematically generates synthetic data for training.
The framework utilizes the Unified Threat Model (OTS) to guide the generation of risk scenarios and employs a self-reflection mechanism involving the Evaluator (Me) and Reflector (Mr) to sample safe actions.
Generated risk scenarios and safe actions are used to fine-tune the Agent (Ma), improving its safety performance without requiring real-world hazardous data collection.

Get Experience from Practice: LLM Agents with Record & Replay

AgentRR (Agent Record & Replay): introduces a new paradigm for LLM agents, leveraging record-and-replay with a Record Module (captures agent/human traces), Summary Module (generalizes traces, generates checks), Replay Module (executes tasks using experiences), Experience Store (repository for experiences), Multi-level Experiences (abstracted knowledge from traces), and Check Functions (safety verification mechanisms).
AgentRR addresses reliability, privacy, cost, and performance challenges by recording successful task executions, summarizing them into reusable multi-level experiences, and replaying these experiences guided by check functions.
The framework utilizes low-level experiences for precise, efficient replay in similar environments and high-level experiences for generalization in varying contexts, while the Experience Store facilitates sharing and reuse of validated task knowledge.

Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek

Seek-CAD: introduces a training-free framework for 3D parametric CAD generation using local inference via DeepSeek-R1 (Generates CAD code, refines code), incorporating a Retrieval Augmented Generation (Retrieves relevant CAD code) strategy on a Local CAD Corpus (Source for RAG) guided by a Knowledge Constraint (Guides DeepSeek-R1 generation).
The framework refines generated CAD code through a self-refinement loop utilizing a Rendering Script R(*) (Generates step-wise images) to produce Step-wise Visual Feedback (Provides visual refinement signal) evaluated by Gemini-2.0 (Evaluates image-CoT alignment) based on the Chain-of-Thought (Explains design logic) from DeepSeek-R1.
Seek-CAD employs the SSR Design Paradigm (Structures CAD models) and CapType Reference Mechanism (References topological primitives) to represent CAD models and their features, enabling the generation of complex designs.

Rethinking Agent Design: From Top-Down Workflows to Bottom-Up Skill Evolution

Bottom-Up Agent: introduces a bottom-up agent paradigm with Agent, Skill Library, LLM (M), Perception, Action, Skill Augmentation, Skill Invocation, Skill Evaluation, Skill Refinement, Implicit Reward, and MCTS components, where agents acquire competence through trial-and-reasoning and skill evolution in open-ended environments.
The framework operates on raw visual inputs and simulated mouse/keyboard outputs, learning and refining skills based on implicit environmental feedback without predefined goals, subgoals, or APIs.
Skills are incrementally composed, evaluated using MCTS and implicit rewards, refined via LLM reasoning, and stored in a shared skill library, enabling autonomous skill acquisition and evolution.

IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis

IDA-Bench: introduces a novel benchmark evaluating LLM agents in multi-round interactive data analysis scenarios, with instruction materials (Task script), a simulated user (LLM simulating user), an agent (LLM data analysis agent), and a sandbox environment (Code execution environment).
The simulated user, an LLM, provides sequential natural language instructions derived from Kaggle notebooks, incorporating subjective insights and domain knowledge, filtered by a gatekeeper mechanism.
The agent, an LLM, executes Python code in the sandbox environment based on user instructions, aiming to complete data analysis tasks and generate submission files evaluated against a human baseline.

Simulating Macroeconomic Expectations using LLM Agents

CLUES framework: introduces a novel framework for simulating macroeconomic expectation formation using LLM Agents, which utilize Large Language Models (core processing unit) informed by a Personal Characteristics Module (household traits), a Prior Expectations & Perceptions Module (prior beliefs/perceptions), and a Knowledge Acquisition Module (expert external knowledge).
The framework constructs specialized Household and Expert LLM Agents to replicate survey experiments and capture heterogeneity in expectations and thought processes.
Ablation studies demonstrate the critical role of each module, particularly the Prior Expectations & Perceptions Module, in simulating human-like expectation formation heterogeneity.

CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games

CoMet (Communicating with Metaphor): introduces a framework enabling LLM agents to use metaphors for covert communication in multi-agent language games, featuring a Feature Extractor, Metaphor Reasoner, Belief Mapper, Self-Monitor, Strategy Planner, Metaphor Generator, Actor, and Knowledge.
The framework enhances agents' ability to interpret and generate metaphors, improving strategic and nuanced interactions in games like Undercover and Adversarial Taboo.
CoMet combines hypothesis-based metaphor reasoning with self-improving metaphor generation, demonstrating improved performance in tasks requiring concealment and semantic evasion.

Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments

Dynamic Early Exit: introduces two complementary strategies, Intrinsic Early Exit (Injects exit instructions) and Extrinsic Early Exit (Uses external verification) with a Verification Module (Monitors status, decides exit), applied to LLM-based Agents (Interacts with environment) to improve efficiency in embodied environments.
The approach aims to reduce redundant steps and computational overhead by enabling agents to self-terminate when progress stalls or tasks are complete.
The paper also introduces two metrics, Redundancy Steps and Progress Degradation, to evaluate the positive and negative impacts of early exit mechanisms.

Distilling LLM Agent into Small Models with Retrieval and Code Tools

Agent Distillation: introduces a framework for transferring agentic behavior and tool use from LLMs to sLMs using reason-act-observe trajectories.
The framework incorporates a first-thought prefix method to enhance teacher trajectories and self-consistent action generation to improve student robustness.
This approach enables small models to effectively use retrieval and code tools, achieving performance competitive with larger models trained via CoT distillation.

Controlled Agentic Planning & Reasoning for Mechanism Synthesis

Dual-Agent Design-Critique Framework: introduces a dual-agent LLM-based method for mechanism synthesis, featuring a Designer Agent, Simulator, Evaluation, Critique Agent, Revision, and Memory.
The framework operates in an iterative loop where the Designer Agent proposes designs, the Simulator executes them, Evaluation measures performance, the Critique Agent provides feedback, and Revision refines the design strategy.
This process leverages linguistic and symbolic reasoning, simulation, and memory to converge towards mechanisms satisfying target trajectories and constraints.

USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents

Unified Urban Agent Framework: introduces a system for evaluating LLMs as urban agents, including Urban Data (Real-world datasets), UAgentEnv (Interactive city environment), Urban Agent (LLM-based autonomous system), Experience (Stored past interactions), Feedback (Environmental response), Task Description (Task goal/details), Urban Observation (Real-time urban dynamics), Spatiotemporal Understanding (Interpreting spatial/temporal patterns), Forecasting (Predicting future states), Planning (Deriving actions for objectives), Reflection (Evaluating outcomes, updating reasoning), Action (Output to environment), Prediction Task Output (Prediction result), and Decision-making Task Output (Decision result).
The framework processes urban data and task descriptions within an interactive environment, enabling the LLM agent to perceive, reason through understanding, forecasting, planning, and reflection, and output actions or predictions.
The agent's reasoning process is modular, incorporating memory from past experiences and feedback from the environment to adapt and improve performance over time.

Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMS

CK-Arena (Conceptual Knowledge Arena): introduces a game-based benchmark using a Multi-Agent Interaction Game (Undercover game environment) and an Automatic Judge System (Evaluates statements, applies rules) to assess LLM-based agents (Participants in the game) understanding of Conceptual Knowledge (Concept pairs, relationships).
The benchmark involves LLM-based agents acting as Players (Describe concepts, identify roles) and Judges (Evaluate statements, score metrics), with an optional Audience Agent (Votes in variant game) in a game variant.
CK-Arena evaluates conceptual reasoning by challenging LLMs to describe, differentiate, and infer conceptual boundaries in a dynamic, interactive setting.

Multi-agent Systems for Misinformation Lifecycle : Detection, Correction And Source Identification

Multi-agent Framework: introduces a novel multi-agent system for managing the misinformation lifecycle, including Classifier Agent (classifies misinformation types), Indexer Agent (indexes data sources), Extractor Agent (retrieves and ranks sources), Corrector Agent (generates corrections), and Verification Agent (validates outputs).
This framework decomposes the misinformation lifecycle into specialized tasks handled by distinct agents to enhance transparency, modularity, and explainability.
The system aims to provide a comprehensive solution for misinformation detection, correction, and source verification from start to finish.

The Discovery Engine: A Framework for AI-Driven Synthesis and Navigation of Scientific Knowledge Landscapes

Discovery Engine (DE): introduces a framework transforming scientific literature into structured knowledge artifacts using LLM-Powered Guided Extraction (Distillation process) via an Adaptive Template (Extraction schema) with Verification (Source linking) and Vectorization (Embedding creation), encoded into a Conceptual Nexus Tensor (Unified representation).
The framework provides Operational Views like the Conceptual Nexus Model (Knowledge graph view) and Semantic Vector Space Views (Vector space views) for human researchers and AI Agents (Knowledge landscape interaction) to navigate and generate new Knowledge Artifacts (Generated output).
The Adaptive Template undergoes a self-consistent refinement cycle based on feedback, and AI Agents operate on the tensor/graph to identify gaps and synthesize novel knowledge.

MARCO: Meta-Reflection with Cross-Referencing for Code Reasoning

MARCO (Meta-Reflection with Cross-Referencing): introduces a cognitive-evolving framework that enhances LLM code reasoning capabilities during inference through meta-reflection (summarizes past experiences), cross-referencing (shares peer lessons), knowledge bank (stores summarized experiences), knowledge condenser (distills knowledge bank), peer agents (other LLM agents), and python interpreter (provides execution feedback).
The framework adopts a cognitive-evolving perspective, using meta-reflection for inter-problem knowledge accumulation and cross-referencing for intra-problem lesson sharing.
MARCO enables the LLM agent to become progressively smarter at code reasoning by learning from its own past problem-solving experiences and the lessons of peer agents.

Hydra: Structured Cross-Source Enhanced Large Language Model Reasoning

Hydra: introduces a training-free framework that unifies graph topology, document semantics, and source reliability to support deep, faithful reasoning in LLMs, including Initialization (Sets up process), Available Evidence Detection (Identifies relevant sources), Question Analysis (Breaks down question), Agentic Source Selector (Selects initial sources), Evidence Exploration (Retrieves reasoning paths), Initial Exploration (Uses selected sources), Refined Exploration (Uses LLM, online retrieval), Predicted Exploration (Uses LLM predictions), Evidence Pruning (Filters paths), Tri-factor cross-source verification (Verifies evidence reliability), Question Answering (Generates final answer), Path Refinement (Summarizes relevant facts), CoT Answering (Reasons systematically), Knowledge Graph (KG) (Structured factual source), Wikipedia (Wiki) (Semi-structured source), Web (Real-time online source), LLM (Analyzes, generates, reasons), Dense Retrieval Model (DRM) (Embeds, selects text), Search Engine (Performs online search), and Skyline Indicator (Guides retrieval order).
The framework handles multi-hop and multi-entity problems through agent-driven exploration combining structured and unstructured retrieval, increasing evidence diversity and precision.
Multi-source verification uses a tri-factor score to balance topic relevance with cross-modal agreement, pruning low-scoring branches before LLM calls.

LLM-BSCVM: An LLM-Based Blockchain Smart Contract Vulnerability Management Framework

LLM-BSCVM (LLM-Based Blockchain Smart Contract Vulnerability Management Framework): introduces an end-to-end smart contract vulnerability management framework with Vulnerability Detection Agent, Repair Suggestion Agent, Risk Assessment Agent, Vulnerability Repair Agent, Patch Verification Agent, Report Generation Agent, Smart Contract Corpus, Vulnerability Knowledge Base, LLM, and RAG components, designed to provide comprehensive capabilities for detection, analysis, repair, and evaluation.
The framework employs a three-stage Decompose-Retrieve-Generate method combining multi-agent collaboration and retrieval-augmented generation.
LLM-BSCVM achieves high detection accuracy and reduced false positives by integrating static analysis, RAG, and LLM inference.

Reinforcement Speculative Decoding for Fast Ranking

RSD (Reinforcement Speculative Decoding): introduces a multi-round modification method for fast LLM inference in ranking systems, featuring an Agent, Policy Network, Environment, State, Modification, Relevance Network, LLM, Budget, Listwise Ranking Knowledge, and Up-to-down Decoding Paradigm.
The method employs an up-to-down decoding paradigm where an agent iteratively modifies the ranking sequence under a constrained budget, utilizing a ranking-tailored policy optimization via reinforcement learning.
RSD leverages listwise ranking knowledge verified by LLMs across different rounds to enhance the agent's modification policy and demonstrates improved performance and reduced latency compared to existing methods on IR and RS tasks.

Curriculum-Guided Reinforcement Learning for Efficient Multi-Hop Retrieval-Augmented Generation

EVO-RAG: introduces a curriculum-guided reinforcement learning framework with Agent (selects actions), Environment (provides state and feedback), Actions (discrete choices), Reward Signals (seven step-level feedback), Multi-Head Preference Model (ranks action trajectories), Two-Stage Curriculum (guides training phases), Time-Based Scheduler (dynamically adjusts reward weights), and Policy (action selection strategy), designed for efficient multi-hop retrieval-augmented generation.
The framework employs a two-stage curriculum (Discovery and Refinement) and a time-based scheduler to dynamically adjust the weights of seven step-level reward signals.
A query rewriting agent interacts with the environment by selecting discrete actions (SEARCH, BACKTRACK, ANSWER, REFUSE), guided by the dynamic reward structure and trained via Direct Preference Optimization over a multi-head preference model.

LA-RCS: LLM-Agent Based Robot Control System

LA-RCS (LLM-Agent Based Robot Control System): introduces a robot control system utilizing a dual-agent framework, with Host Agent (Plans global actions), App Agent (Executes planned tasks), Memory (Stores past interactions), Robot (Performs physical actions), User (Provides task instruction), Request (User's task instruction), Global Plan (High-level action sequence), Command (Specific robot action), Observation (Visual data from robot), Sensor Data (Non-visual robot data), Thoughts (Agent's internal reasoning), Comment (Agent's progress report), and Status (Task completion state) components.
The system enables autonomous planning, execution, and environmental analysis for robots based on user requests, minimizing human intervention.
The dual-agent structure separates high-level planning from iterative task execution, allowing adaptation to dynamic environments through observation and feedback.

22nd May 2025

Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

Search-R1-β-GRPO: introduces a reinforcement learning-based training method for agentic Retrieval-Augmented Generation systems, incorporating a confidence threshold into the reward function to mitigate sub-optimal search behaviors.
The approach leverages the confidence of search query generations to reward high-certainty search decisions that lead to correct answers, aiming to improve efficiency and reliability.
Experiments demonstrate that the confidence-aware training enables a 3B model to achieve better performance and reduce instances of over-search and under-search compared to baselines.

X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs

X-MAS: introduces a paradigm for building multi-agent systems with heterogeneous LLMs, supported by X-MAS-Bench for evaluating diverse LLMs across functions and domains, and demonstrated through X-MAS-Proto implementing Planning, QA, Revise, Aggregation, and Evaluation Agents.
The approach leverages the collective intelligence of diverse LLMs assigned to specialized agents to enhance system performance compared to homogeneous LLM-driven systems.
Empirical studies using X-MAS-Bench findings show that transitioning existing MAS frameworks to use heterogeneous LLMs significantly improves performance across various tasks and domains.

MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

MASLab: introduces a unified codebase for LLM-based multi-agent systems, with Unified Codebase Structure, Input Preprocessing, Shared Resource Management, Unified Configuration Management, and Evaluation Framework components.
The codebase integrates over 20 methods, standardizes inputs and configurations, and provides shared access to LLMs and toolkits for research and development.
The Evaluation Framework supports fair comparisons using LLM-based and rule-based protocols, including a code execution sandbox.

T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

T1-AGENT: introduces a tool-oriented conversational dataset (T1) and an LLM-based agent (T1-AGENT) for evaluating agentic planning, including a Dialogue (multi-turn conversation), Toolbox (predefined tool collection) containing Tools (functions for tasks), Cache (stores tool call results), Knowledge Base (data source for tools), and Code Execution Environment (sandbox for code).
The framework focuses on complex multi-turn dialogues with inter-tool dependencies and dynamic replanning, supported by the caching mechanism.
T1-AGENT generates executable Python code using the tools and manages cached results to efficiently handle user requests.

Beyond Correlation: Towards Causal Large Language Model Agents in Biomedicine

Causal LLM Agents: introduces a vision for integrating LLM Agents (core reasoning engine), Multimodal Data (diverse biomedical inputs), Structured Knowledge (KGs) (grounding and explainability), Formal Causal Methods/Tools (algorithms for causal inference), External Tools/Libraries/APIs (external systems interaction), within an Agentic Framework (orchestrates agent actions) with Human-in-the-loop Control (human oversight mechanism), Safety Safeguards (ensures safe operation), and Auditability (tracks and verifies decisions).
The paper discusses challenges and opportunities for these agents in drug discovery, personalized medicine, and public health applications.
Achieving this vision requires synergistic integration of components and robust evaluation methodologies.

Know the Ropes: A Heuristic Strategy for LLM-based Multi-Agent System Design

KtR (Know-The-Ropes): introduces a multi-agent framework that decomposes complex tasks into simpler, M-tractable sub-tasks, orchestrated by a System Controller using specialized agents like Worker, Trimmer, Reporter, Row Reducer, Column Reducer, Matcher, Painter, and Normalizer.
The framework converts domain priors into an algorithmic blueprint hierarchy, recursively splitting tasks until they fit base LLM capabilities with minimal augmentation.
This approach leverages disciplined decomposition and targeted augmentation to turn modest models into reliable collaborators for complex problems like Knapsack and Task Assignment.

SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

SWE-Dev dataset: introduces, "a large-scale dataset for evaluating and training autonomous coding systems on feature-driven software development tasks", with Project Requirement Description (task description), Codebase (repository context), Test Suite (executable tests), Ground Truth Code (correct solution), Runnable Environment (execution context), Evaluation (test-based assessment), Training Support (training paradigms), where SWE-Dev provides real-world feature development tasks with executable tests and runnable environments.
The dataset includes 14,000 training and 500 test samples derived from open-source projects, enabling reliable and functionally verifiable supervision.
It supports diverse training paradigms like Supervised Fine-Tuning, Reinforcement Learning, and multi-agent training using execution-based feedback.

Cracking Aegis: An Adversarial LLM-based Game for Raising Awareness of Vulnerabilities in Privacy Protection

LLM-Driven Game System: introduces, "Cracking Aegis", with Player Input, GPT-4o, System Prompt, LLM Response (json), JSON Parsing, Game State Update, and End components, where the system leverages GPT-4o to drive a text-based adventure game for privacy education.
The system processes player input via GPT-4o guided by a system prompt, parses the JSON response, and updates the game state to provide dynamic guidance, dialogue, and scenario progression.
This architecture simulates adversarial dialogue with an AI agent, enabling players to experience privacy vulnerabilities and reflect on real-world risks.

A Comprehensive Evaluation of Contemporary ML-Based Solvers for Combinatorial Optimization

FrontierCO: introduces a comprehensive benchmark for evaluating ML-based combinatorial optimization solvers, featuring diverse CO problem types, realistic test sets, standardized training data, and a toolkit for LLM agents.
The benchmark includes challenging instances from real-world applications and frontier research, designed to assess solver performance under realistic and large-scale conditions.
The study evaluates various ML-based solvers, including neural networks and LLM agents, against state-of-the-art human-designed algorithms across the benchmark's problems and difficulty levels.

--

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

AGENTIF: introduces a benchmark for evaluating large language model instruction following in agentic scenarios, featuring a Dataset with realistic, long, complex instructions categorized by a Constraint Taxonomy and Meta Constraints, assessed via an Evaluation Protocol using Code Evaluation, LLM Evaluation, and Hybrid Evaluation methods generated by Evaluation Generation.
The benchmark includes 707 instructions from real-world agentic tasks, averaging 1,723 words and 11.9 constraints per instruction.
Evaluation results show that current models perform poorly on AGENTIF, particularly on condition and tool constraints, highlighting challenges with instruction length and complexity.

Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning

Embodied VLA Model Architecture: introduces architectural adaptations for long-horizon embodied tasks, including a Multimodal LLM Backbone, Interleaved Goal-State-Action Modeling, Vision Encoder, Context Extension Techniques, and Context Parallelism, designed to enable long-context reasoning and interaction.
The architecture uses an interleaved input structure of goal, state, and action tokens processed by a multimodal LLM to support coherent, real-time interaction modeling over extended sequences.
Context Extension Techniques and Context Parallelism are explored to address the limitations of standard LLMs in processing the extremely long input sequences generated by the ∞-THOR framework.

CODE GRAPH MODEL (CGM): A GRAPH-INTEGRATED LARGE LANGUAGE MODEL FOR REPOSITORY-LEVEL SOFTWARE ENGINEERING TASKS

Graph RAG: introduces Code Graph Models (CGMs) as the Reader component, along with Rewriter, Retriever, and Reranker, to integrate repository code graphs into LLMs for software engineering tasks.
The CGM itself comprises an Encoder, Adapter, and LLM Decoder to process semantic and structural code information.
This agentless framework achieves high resolution rates on repository-level issue fixing benchmarks using open-source LLMs.

LLM-Based Emulation of the Radio Resource Control Layer: Towards AI-Native RAN Protocols

RRC-LLM: introduces, an LLM-based framework for RRC emulation, with Large RRC Model (Core LLM), Base model (Pre-trained LLaMA3-8B), Instruction Tuning (LoRA adaptation), Uplink messages (RRC input), Downlink messages (RRC output), Network Context (Additional input data), RRC traces (Historical training data), BERT Encoder (Embeds messages), Pooling (Creates sentence embeddings), and Cosine-sim (Measures similarity), where the framework fine-tunes a Base model using Instruction Tuning on RRC traces to create a Large RRC Model that generates Downlink messages from Uplink messages and Network Context, evaluated via BERT Encoder, Pooling, and Cosine-sim.
The fine-tuned model achieves high cosine similarity with ground-truth RRC messages, demonstrating improved structural and semantic fidelity compared to a baseline LLM.
This work demonstrates the feasibility of using LAMs for control-plane procedures, laying groundwork for AI-Native Air Interface paradigms.

MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

MCP-RADAR benchmark: introduces a comprehensive benchmark for evaluating LLM tool use capabilities in the Model Context Protocol framework, featuring a Radar Bench (Testing environment), Dataset Construction (Benchmark task creation), Radar Test (Execution and analysis), and Implementation (Practical test setup).
The benchmark employs a novel five-dimensional approach measuring answer accuracy, tool selection efficiency, computational resource efficiency, parameter construction accuracy, and execution speed.
It provides objective, quantifiable measurements across multiple task domains including software engineering, mathematical reasoning, and general problem-solving.

O²-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering

O²-Searcher: introduces, "O²-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering", with RL-based search agent, Simulated search interface, Local data source, Access corpus, Process retrieved info, Core model, RL-optimized agent, RL training signals, Interaction protocol components, designed to tackle open-domain open-ended and closed-ended questions.
The framework leverages a local simulated search environment for dynamic knowledge acquisition and employs a unified reinforcement learning mechanism with meticulously designed reward functions.
It uses a chat template for multi-round interaction, enabling the agent to reason, search, learn from feedback, and generate answers.

Large Language Model-Empowered Interactive Load Forecasting

LLM-based multi-agent collaboration framework: introduces an interactive load forecasting system with Human User, Task Manager, Preparation Assistant, Model Manager, Model Developer, Deployment Operator, Visualization Panel, and Experiment database components.
This framework leverages specialized LLM agents collaborating via messaging to manage the forecasting pipeline and integrate human expertise.
The system aims to lower technical barriers and improve forecasting accuracy through user interaction at key stages.

Is Your LLM-Based Multi-Agent a Reliable Real-World Planner? Exploring Fraud Detection in Travel Planning

WandaPlan: introduces, with Travel Plan Agent (Central Coordinator), Crawler Agent (Information Retrieval), Extractor Agent (Data Extraction), Summary Agent (Option Ranking), and Confirmation Agent (Final Decision), a fraudulent evaluation environment for LLM-based multi-agent travel planning systems.
The environment injects deceptive content into real-world data across misinformation, coordinated multi-person, and level-escalating multi-round fraud cases to assess agent vulnerability.
The study highlights existing frameworks' susceptibility to fraud and proposes an Anti-fraud Agent (Risk Analysis) integration to improve reliability.

Let's Get You Hired: A Job Seeker's Perspective on Multi-Agent Recruitment Systems for Explaining Hiring Decisions

Multi-Agentic Recruitment System: introduces a multi-agent AI system with MODERATOR, RECRUITER, and MENTOR agents, Memory, and an Agent Toolkit to guide job seekers and explain hiring decisions.
The system was developed using an iterative, user-centric design approach informed by active job seekers.
Evaluation demonstrated the system was perceived as significantly more actionable, trustworthy, and fair compared to traditional recruitment methods.

Psychology-driven LLM Agents for Explainable Panic Prediction on Social Media during Sudden Disaster Events

PsychoAgent (Psychology-driven generative Agent framework): introduces a psychology-driven framework for explainable panic prediction, integrating multi-domain data via a CoT-driven LLM-based agent, individual feature extraction, MoE system, and fine-tuned BERT model.
The framework simulates the psychological chain of panic formation through the agent's four stages: disaster event perception, risk cognition formation, panic emotion arousal, and posting response.
This approach provides mechanistic interpretability by modeling psychological processes, moving beyond traditional data-fitting methods for panic detection.

Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems

RecInter: introduces an agent-based simulation platform for dynamic recommender systems, featuring User Agents with User Profile, Memory Module, and Action Module, interacting with a Recommendation Platform and Merchant Agent, enhanced by Behavior Simulation Training.
The platform incorporates a novel interaction mechanism where user actions and merchant replies dynamically update item attributes, creating a realistic and evolving environment.
High-fidelity simulation is achieved through Multidimensional User Profiling, Advanced Agent Architecture, and LLM fine-tuned on Chain-of-Thought data, enabling replication of emergent phenomena like Brand Loyalty.

WEBAGENT-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

WEBAGENT-R1: introduces an end-to-end multi-turn reinforcement learning framework for training web agents, including an Agent (LLM), Dynamic Context Compression, Asynchronous Trajectory Rollout, and Multi-turn GRPO.
The framework learns directly from online interactions with web environments, guided by binary rewards, utilizing efficient mechanisms for context handling and trajectory generation.
WEBAGENT-R1 achieves state-of-the-art performance on the WebArena-Lite benchmark, highlighting the importance of behavior cloning initialization and thinking-based prompting.

LLM-Powered Agents for Navigating Venice's Historical Cadastre

LLM-Powered Agents framework: introduces a text-to-programs approach leveraging large language models to translate natural language queries into executable code for analyzing historical cadastral records, including SQL-Agent, Text-to-Python Agent, Entity Extractor, Planer, Coder, Code Execution Environment, Text2SQL (CodeS-7B), SQLite DB, Datasets, Prompt template, Column Extractor, Row Extractor, and Entity Search components.
The framework employs a SQL-Agent for structured data retrieval and a Text-to-Python Agent with specialized sub-agents (Entity Extractor, Planer, Coder) for complex analytical tasks.
The system processes historical cadastral datasets by extracting entities, planning analysis steps, generating code, and executing it to answer user queries about Venice's urban history.

Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance

MEMENTO (Personalized Embodied Agent Evaluation Framework): introduces a two-stage evaluation process with Memory Acquisition Stage and Memory Utilization Stage, utilizing an LLM-powered embodied agent architecture with semantic and episodic memory for personalized assistance tasks.
The framework assesses memory utilization by comparing performance between stages where instructions vary in their reliance on previously acquired personalized knowledge.
The evaluation process includes Single-memory and Joint-memory tasks to test different levels of memory complexity and personalized knowledge types like object semantics and user patterns.

Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System

Manalyzer (Meta-analysis analyzer): automates end-to-end meta-analysis using a multi-agent system comprising Researcher, Document Collector, Literature Reviewer, Data Extractor, Checker, Data Analyst, and Reporter agents, leveraging various Tools.
It mitigates hallucinations in paper screening and data extraction through hybrid review, hierarchical extraction, self-proving, and feedback checking workflows.
The system significantly outperforms LLM baselines on a new benchmark dataset for paper screening and data extraction tasks.

No Black Boxes: Interpretable and Interactable Predictive Healthcare with Knowledge-Enhanced Agentic Causal Discovery

II-KEA (knowledge-enhanced Agentic Causal Discovery framework): introduces a multi-agent system for interpretable and interactive diagnosis prediction, comprising Clinical Datasets, Domain Knowledge, Knowledge Synthesis Agent, Causal Discovery Agent, and Decision-Making Agent, which leverages causal discovery to predict future diagnoses from EHR data.
The system utilizes three collaborative LLM agents, supported by clinical data matrices and a vector database of external medical knowledge, to generate a causal graph and predict diagnoses with explanations.
II-KEA enhances interpretability via causal analysis and interactivity through optional clinician input and external knowledge integration.

ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay

ARPO (Agentic Replay Policy Optimization): introduces an end-to-end reinforcement learning approach for GUI agents, with VLM Agent, Vision-Language Model, Observations, Actions, Chain-of-Thought, Reinforcement Learning, GRPO, Reward, Policy Gradient, Distributed Environments, Rollout Workers, Centralized Inference Server, Replay Buffer, and Valuable Tasks Selection, designed for policy optimization in complex GUI environments.
The framework augments Group Relative Policy Optimization (GRPO) with a replay buffer to reuse successful experience and employs a task selection strategy for stable training.
ARPO leverages distributed rollout and structured rewards to train vision-language GUI agents capable of multi-turn interactions and self-correction.

HIMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

HIMATE (Hierarchical Multi-Agent Framework for Machine Translation Evaluation): introduces a hierarchical multi-agent framework for machine translation evaluation, utilizing Tier-1 and Tier-2 agents structured by the MQM Hierarchy to perform Subtype Evaluation, Self-Reflection, and Collaborative Discussion, culminating in Weighted Scoring.
The framework employs a three-stage process where Tier-2 agents initially evaluate subtype errors, refine judgments through self-reflection, and engage in collaborative discussion with Tier-1 agents for low-confidence cases.
This hierarchical structure and multi-stage process, guided by the MQM error typology, enhance error span detection and severity assessment accuracy.

RAP: Runtime-Adaptive Pruning for LLM Inference

RAP (Runtime-Adaptive Pruning): introduces a framework for dynamically adjusting LLM (Large Language Model) compression strategies based on real-time conditions, utilizing an Inference Environment, Execution State, RL Agent, Greedy Sequential Importance (GSI), Pruning Action, Pruning & Inference Module, LLM, and Reward Calculation.
The framework employs a reinforcement learning agent that observes runtime state, including request characteristics and memory budget, to select an optimal pruning policy.
This adaptive approach prunes LLM components like FFN and MHA blocks, guided by GSI analysis, to balance memory efficiency and generative performance during inference.

LLM-Powered AI Agent Systems and Their Applications in Industry

LLM-Powered AI Agent System: introduces a comprehensive architecture, with Environment (Source of perception/interaction), Task Input (Defines objective/instructions), Context Augmentation (Leverages external knowledge sources), Agent (Central processing/decision unit), LLM (Cognitive engine/reasoning core), Multi-Modality Model (Processes diverse data inputs), Memory (Accesses external knowledge/history), Tool Utilization (Invokes APIs/databases/models), Output Guardrails (Filters/validates outputs), Actions (Executes decisions in environment), Other Agents (Interacts with other agents), Iterative Process (Continuous sensing/acting loop), enabling autonomous, goal-oriented behavior.
The system integrates LLMs with components for perception, memory, tool use, and guardrails to enable autonomous, goal-oriented behavior in dynamic environments.
The architecture facilitates context-aware decision-making and reliable execution through iterative sensing, planning, and action, addressing challenges like latency and uncertainty.

Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development

Two-Step Optimization Pipeline: introduces a method to optimize a Multi-Agent System (Role-based LLM agents) by utilizing a Critic Mechanism (Evaluates output, generates feedback) to produce Textual Feedback (Natural language evaluation output), which is then used by a Locator (Identifies underperforming agents) to pinpoint issues and an Optimizer (Optimizes agent prompts) to refine Agent Prompts (System prompts for agents).
The pipeline focuses on improving the performance of role-based LLM agents collaborating on complex tasks like software development by iteratively refining their prompts based on feedback.
The approach demonstrates effectiveness across various software development evaluation dimensions and investigates the impact of different optimization settings.

21st May 2025

How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior

LLM Agent with Memory: introduces, with Agent Execution (Performs tasks), Memory Management (Manages stored experiences), Stored Episodic Memory (Stores past experiences), Query (Input task), Retrieved Memory (Past experiences for guidance), Planning (Agent reasoning), Execution (Agent output/action), Addition (Adds new experiences), and Deletion (Removes past experiences), an empirical study on how memory management choices impact LLM agent behavior and long-term performance.
The study focuses on the fundamental memory operations of addition and deletion, investigating their impact on the agent's experience-following property and associated challenges like error propagation and misaligned experience replay.
The research proposes selective addition and combined deletion strategies to mitigate negative effects, demonstrating performance gains and robustness under challenging conditions.

MAPS: A Multilingual Benchmark for Global Agent Performance and Security

MAPS (Multilingual Agentic AI Benchmark Suite): introduces a benchmark suite for evaluating agentic AI systems across diverse languages and tasks, comprising GAIA Dataset (Real-world tasks), SWE-Bench Dataset (Code generation), MATH Dataset (Mathematical reasoning), Agent Security Benchmark Dataset (Security assessment), and a Translation Pipeline (Multilingual data generation).
The Translation Pipeline (Multilingual data generation) component utilizes Machine Translation (Initial structural translation), Meaning Preservation Verification (Check MT semantic fidelity), Translation Refinement (LLM-based enhancement), Integrity Check (Check refinement quality), and Direct LLM Translation (LLM fallback translation) to create multilingual versions of the datasets.
MAPS facilitates systematic analysis of agent performance and security degradation in multilingual settings, highlighting vulnerabilities not captured by English-only benchmarks.

ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

ViQAgent: introduces a zero-shot video question answering framework with VideoLLM Analyzer, Object Grounder, and CoT Judgment components.
The framework uses the VideoLLM Analyzer for initial video understanding and target identification, and the Object Grounder for open-vocabulary object detection and tracking.
The CoT Judgment module compares initial analysis with grounded data, generates clarification questions, and refines the final answer.

Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition

LLM-GELI / Multimodal-LLM-GELI: introduces a framework that leverages a Large Language Model (LLM) to decompose sparse Global Explicit Rewards into dense Turn-level Pseudo-rewards, which are then used to train a Lightweight Reward Model for aligning a Dialogue Agent.
The Multimodal variant enhances decomposition by incorporating Multimodal Feedback Features, such as facial expressions and gaze, alongside the Dialogue Transcript as input to the LLM.
This approach obviates the need for manual reward shaping and granular human feedback, demonstrating the LLM's effectiveness in decomposing global feedback for fine-grained behavioral alignment.

HCRMP: A LLM-HINTED CONTEXTUAL REINFORCEMENT LEARNING FRAMEWORK FOR AUTONOMOUS DRIVING

HCRMP (LLM-Hinted Contextual Reinforcement Learning Motion Planner): introduces a novel motion planning architecture for autonomous driving, with Augmented Semantic Representation Module (extends state space), Contextual Stability Anchor Module (improves weight reliability), and Semantic Cache Module (integrates LLM guidance).
The framework proposes an LLM-Hinted RL paradigm where the LLM provides semantic hints for state augmentation and policy optimization, while the RL agent maintains relative independence to counteract potential hallucinations.
This approach significantly improves driving performance in diverse and safety-critical conditions by combining LLM's understanding with RL's self-learning capabilities.

Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses

Checkpoint-GCG: introduces an informed adversarial attack leveraging Intermediate Model Checkpoints from the Alignment Process of an LLM to initialize the GCG algorithm sequentially, aiming to find an Adversarial Suffix that produces a Target Output from a Prompt, evaluated on the Base Model and Final Aligned Model, guided by a Checkpoint Selection Strategy.
The method exploits the incremental nature of alignment training by using successful suffixes found at earlier checkpoints as initialization for attacking later ones.
This approach demonstrates that existing alignment-based defenses are vulnerable to attacks when adversaries have knowledge of the alignment process, achieving high attack success rates.

DEBATE, TRAIN, EVOLVE: Self-Evolution of Language Model Reasoning

DTE (DEBATE, TRAIN, EVOLVE): introduces a novel ground truth-free training framework using multi-agent debate traces to evolve a single language model, including the DEBATE (Multi-agent debate process), Agents (Language models debating), RCR Prompting (Prompting strategy for debate), Debate Traces (Records of debate interactions), Consensus Answer (Final answer from debate), TRAIN (Fine-tuning process), Single Policy (Language model being trained), Reference Model (Frozen base policy), Reward Module (Calculates training reward), GRPO Optimizer (Optimization algorithm), EVOLVE (Iterative self-improvement), Evolved Single Model (Fine-tuned model), and Evolution Loop (Iterative training cycle) components.
The framework combines multi-agent debate (MAD) with self-supervised reinforcement learning (GRPO) to enable autonomous reasoning capability enhancement.
A key component is the REFLECT-CRITIQUE-REFINE (RCR) prompting strategy designed to improve debate quality and reduce issues like sycophancy.

From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems

End-to-End VLA Models: introduces a paradigm that directly maps language and vision inputs to low-level actions using Language encoder, Vision encoder, Action decoder, and Controller components.
Modular VLM Pipelines utilize a specialist VLM for perception and a separate Controller for action, exemplified by a prototype with Speech Transcription, Task Decomposition, Object Detection, Object Segmentation, and Manipulation modules.
Multimodal LLM Agents position a Multimodal LLM as a cognitive hub that orchestrates Auxiliary tools for perception and a Controller for action primitives via function calls.

Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

Duplex S2S Model: introduces a novel duplex speech-to-speech architecture with a Streaming Speech Encoder (Encodes user speech), Modality Adapter (Connects encoder to LLM), Channel Fusion (Combines user/agent streams), Decoder-only LLM (Processes combined streams), Pooling (Processes agent/user embeddings), Codec (Tokenizes agent speech), Text Projector (Outputs agent text), and Audio Projector (Outputs agent audio), designed for simultaneous user and agent interaction.
The model uses a pretrained streaming encoder for user input and a personalized codec for agent outputs, enabling duplex S2S without speech pretraining.
Separate modeling of agent and user facilitates codec fine-tuning for improved agent voices and reduces bitrate compared to prior work.

Swarm Intelligence Enhanced Reasoning: A Density-Driven Framework for LLM-Based Multi-Agent Optimization

SIER: introduces a framework for LLM-based multi-agent optimization, conceptualizing reasoning as a solution search process guided by swarm intelligence, featuring Population Initialization, Population Evolution, and Population Clustering and Selection phases, utilizing LLM-based agents, a Generator, and an Evaluator.
The framework employs a density-driven strategy within Population Evolution, using kernel density estimation and non-dominated sorting for multi-criteria selection to balance solution quality and diversity.
Step-level quality evaluation and adaptive sampling are used to refine reasoning paths and dynamically control exploration for efficient problem-solving.

InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation

InfoDeepSeek (Agentic RAG Framework): introduces a benchmark for evaluating agentic information seeking in dynamic web environments, featuring an Agent (Orchestrates information seeking process) that operates through a Retrieval Stage (Iteratively searches and browses web), Augmentation Stage (Filters and distills retrieved content), and Generation Stage (Synthesizes final answer), utilizing an LLM (Central reasoning engine for agent), Memory (Stores agent's interaction trajectory), and Tool Library (Interface to external tools).
The framework employs an autonomous LLM agent to perform multi-step planning, search, and reflection for robust evidence acquisition from the live web.
The benchmark includes challenging questions with attributes like multi-hop, long-tail, and distracting information, evaluated using fine-grained metrics for accuracy, utility, and compactness.

Collaborative Problem-Solving in an Optimization Game

Neurosymbolic Agent (Problem-Solving version): introduces a collaborative problem-solving agent for a Traveling Salesman-based game, incorporating a Language Model with symbolic components for state tracking, grounding, and an external optimization solver.
The agent collaborates with a partner through dialogue to find an optimal path in a graph where each player has partial information about edge weights.
The neurosymbolic agent outperforms a purely LLM-based baseline, demonstrating improved correctness and optimality in finding solutions.

X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System

X-WebAgentBench: introduces a multilingual interactive web benchmark for evaluating language agents, comprising Data Preparation (selects languages, prepares data), Multilingual Instruction Construction (translates instructions), Multilingual Environment Construction (translates environment), and Quality Check (validates data accuracy) stages.
The benchmark includes 14 languages, 2,800 instructions, and 589,946 products to assess agent performance in multilingual web environments.
Evaluation on X-WebAgentBench reveals challenges for current language agents in multilingual scenarios, particularly regarding language alignment and long interactions.

CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution

CRAKEN: introduces a knowledge-based LLM agent framework, with Planner (devises task plan), Executor (executes delegated tasks), Trigger Retriever (initiates knowledge retrieval), Decompose Context (extracts task information), Knowledge Database (stores domain knowledge), RETRIEVER (retrieves relevant documents), RELEVANCEGRADER (grades document relevance), GENERATOR (generates knowledge hint), HALLUCINATIONGRADER (grades hint grounding), REWRITER (rewrites query), SOLVEDGRADER (grades hint sufficiency), and Injection (integrates knowledge hint), designed to enhance cybersecurity capabilities.
The framework combines a planner-executor multi-agent system with an iterative retrieval system using Self-RAG and Graph-RAG on a cybersecurity knowledge database.
CRAKEN improves performance on complex cybersecurity tasks by decomposing context, iteratively retrieving knowledge, and injecting insights into the agent's workflow.

Multiple Weaks Win Single Strong: Large Language Models Ensemble Weak Reinforcement Learning Agents into a Supreme One

LLM-Ens: introduces a framework that leverages LLMs to dynamically ensemble multiple weak reinforcement learning agents by categorizing task situations and selecting the best-performing agent for the current context.
The framework consists of three stages: situation generation, agent reward distribution analysis, and dynamic model ensemble during inference.
LLM-Ens demonstrates improved performance over baseline ensemble methods and single agents on Atari tasks by adapting to varying task conditions and agent strengths.

LLM-Explorer: A Plug-in Reinforcement Learning Policy Exploration Enhancement Driven by Large Language Models

LLM-Explorer: introduces a plug-in method that utilizes two LLMs, one for analyzing the agent's learning status and another for generating a policy exploration strategy distribution, enabling the Agent to adaptively explore the Environment.
The framework samples action-reward trajectories from the Agent's interaction with the Environment to inform the LLMs' analysis and strategy generation.
This design allows LLM-Explorer to enhance policy exploration in reinforcement learning by dynamically adjusting the exploration strategy based on the agent's real-time learning status.

P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark

P2P (Automated Paper-to-Poster Generation): introduces a flexible, LLM-based multi-agent framework for generating academic posters, including Figure Agent (Processes visual elements), Figure Extractor (Extracts figures, tables), Figure Describer (Describes visual elements), Figure Checker (Validates figure processing), Section Agent (Generates textual content), Section Generator (Creates text content), Section Checker (Validates text content), Orchestrate Agent (Assembles final poster), HTML Generator (Renders HTML poster), Poster Checker (Evaluates poster layout), and Reflection loops (Enables iterative refinement).
The framework processes research papers through specialized agents, each with a checker module, to extract visual elements, generate content, and assemble HTML-rendered posters.
Iterative refinement via checker modules and reflection loops ensures output quality and seamless integration of visual and textual components into cohesive posters.

ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection

ReflAct (Reflect for Action): introduces a novel backbone that shifts reasoning from planning next actions to continuously reflecting on the agent's state relative to its goal, using Task Goal, Observation, Internal State, Reflection, and Action components.
This framework grounds decisions in actual observations and maintains continuous goal alignment to improve strategic reliability and reduce hallucinations.
ReflAct achieves this by explicitly prompting the LLM agent to generate a Reflection encoding the internal belief state and task goal before selecting an Action.

LMGAME-BENCH: How Good are LLMs at Playing Games?

Imgame-Bench (LMGAME-BENCH): introduces a benchmark for evaluating LLMs on games, with Models (LLM/VLM agents), Perception Module (Converts UI to symbolic/text), Memory Modules (Integrates transient memory/reflection), and Reasoning Modules (Supports reasoning traces) to enhance evaluation reliability.
The benchmark uses a modular harness to improve LLM game-playing capabilities and address challenges like poor perception, prompt sensitivity, and data contamination.
Evaluation across 13 models and 6 games shows the benchmark is challenging, differentiates models, and reveals that game-based training can transfer to other planning and agentic tasks.

Multicrossmodal Automated Agent for Integrating Diverse Materials Science Data

Multicrossmodal Agent: introduces a multi-agent LLM framework integrating diverse materials science data using specialized agents, a shared embedding space, and a fusion process.
The framework employs a Unified Team Agent to orchestrate modality-specific agents (Web, PDF, Image, Video, CSV) that process data and project insights into a shared embedding space.
A Fusion Agent orchestrates dialogue, applies dynamic gating to weighted insights from specialized agents, and generates a unified scientific report or retrieval results.

LTDA-Drive: LLMs-guided Generative Models based Long-tail Data Augmentation for Autonomous Driving

LTDA-Drive (Long-Tail Data Augmentation framework): introduces a data augmentation framework with head-class object removal, tail-class object insertion, and LLM-guided candidate filtering, designed to address long-tail distribution in 3D object detection.
The framework replaces frequent head-class objects with synthetically generated tail-class instances in driving scenes.
It leverages text-guided diffusion models for generation and an LLM agent for filtering to ensure high-quality, diverse augmented data.

An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents

LLM-based Search Agent: introduces an empirical study on training LLM agents capable of interleaved reasoning and search using reinforcement learning, investigating the impact of reward formulation, underlying LLM characteristics, and search engine choice.
The framework involves an LLM Agent interacting with a Search Engine, guided by a Reward Function within a Reinforcement Learning process.
Key findings highlight the importance of format rewards, the influence of LLM type and scale, and the critical role of the search engine in training dynamics and inference robustness.

A Risk Taxonomy for Evaluating Al-Powered Psychotherapy Agents

Risk Taxonomy: introduces a structured framework for evaluating AI-powered psychotherapy agents with Immediate Risk (Imminent danger) and Potential Risk (Emerging vulnerability) components.
The taxonomy aims to identify and categorize potential negative outcomes and user harms in AI psychotherapy interactions.
Developed through literature review, expert interviews, and clinical criteria alignment, the taxonomy provides a basis for safer AI mental health support.

Nek Minit: Harnessing Pragmatic Metacognitive Prompting for Explainable Sarcasm Detection of Australian and Indian English

PMP (Pragmatic Metacognitive Prompting): introduces, with Comprehension of Context/Understanding, General Pragmatic Analysis, Preliminary Judgment, Meta-Comprehension, Specific Pragmatic Reassessment, and LLM components, a method for explainable sarcasm detection across English varieties.
The approach adapts pragmatic metacognitive prompting to guide large language models in generating textual explanations for sarcasm in Australian, Indian, and American English.
PMP significantly improves performance compared to baseline prompting strategies, particularly for non-standard English varieties, by providing pragmatic scaffolding.

StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

StepSearch: introduces a reinforcement learning framework for search LLMs, utilizing a Policy LLM interacting with a Search Engine, trained via StePPO with a detailed Reward Mechanism and supported by a Data Augmentation Pipeline and Memory/History.
The Reward Mechanism includes both global and step-wise token-level rewards based on information gain and redundancy penalties to guide search actions.
The framework aims to improve multi-hop reasoning by providing fine-grained supervision for iterative retrieval and query formulation.

AutoData: A Multi-Agent System for Open Web Data Collection

AutoData (Automated web Data collection): introduces a novel multi-agent system for open web data collection, with Manager Agent orchestrating workflow, Research Squad extracting knowledge and designing blueprint, Develop Squad building program and validating data, and Oriented HyperGraph Cache System optimizing information flow and managing artifacts.
The system comprises eight specialized agents organized into research and development squads coordinated by the manager agent.
The OHCache system includes an oriented message hypergraph, an oriented hyperedge formatter, and a local cache system to enhance multi-agent collaboration efficiency.

ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges

ModelingAgent: introduces a multi-agent framework with Idea Proposer, Data Searcher, Model Implementor, Report Writer, Critic Module, Shared Memory, and External Tool Set components, designed to tackle complex mathematical modeling problems.
The framework utilizes a shared memory for agent coordination and an external tool set for data acquisition and code execution.
An integrated Critic Module enables iterative self-refinement of agent outputs based on specific evaluation rubrics.

UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking

URDUFACTCHECK: introduces an end-to-end fact-checking framework for Urdu, with CLAIMPROCESSOR, QUERYGENERATOR, RETRIEVER, and VERIFIER components, utilizing multi-strategy evidence retrieval including Monolingual, Translated, and Thresholded Translated Retrieval.
The framework addresses the scarcity of high-quality Urdu evidence through its dynamic evidence retrieval pipeline that combines monolingual and translation-based approaches.
The paper also introduces two new hand-annotated benchmarks, URDUFACTBENCH and URDUFACTQA, for evaluating claim verification and LLM factuality in Urdu.

PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration

PiFlow: introduces a principle-aware multi-agent system for scientific discovery, including PiFlow (Strategic director, information-theoretical), Planner Agent (Relays strategic guidance), Hypothesis Agent (Proposes testable hypotheses), Experiment Agent (Validates hypotheses, using tool), and Tt (Accumulated principle-outcome data).
The framework views scientific discovery as a structured uncertainty reduction problem guided by scientific principles selected via Min-Max optimization.
This approach enhances discovery efficiency and solution quality by systematically steering hypothesis generation and validation based on accumulated evidence.

Large Language Model-Powered Agent for C to Rust Code Translation

LAC2R (LLM-powered Agent for C-to-Rust code translation): introduces a novel C-to-Rust translation approach with LLM(s) (Heterogeneous), Virtual Fuzzing-based equivalence Test, Monte Carlo Tree Search, Preprocessor, Code Analyzer, Verifier, and Postprocessor.
The framework leverages LLMs' agentic capabilities, using VFT to identify functional non-equivalence and MCTS to plan iterative code refinement steps.
The approach aims to improve the safety and correctness of Rust code translated from C by systematically guiding the LLM refinement process.

Simulating Prosocial Behavior and Social Contagion in LLM Agents under Institutional Interventions

PROSIM: introduces a simulation framework for modeling prosocial behavior in LLM agents, with Individual Simulation (instantiates agents), Scenario Simulation (defines contexts), Interaction Simulation (models network dynamics), and Intervention Simulation (manipulates policies) components.
The framework examines how prosocial behavior emerges, adapts, and erodes in LLM-based agents under diverse social and institutional conditions.
PROSIM integrates four key modules to approximate the complexity of real human social environments for studying social alignment and institutional dynamics.

MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

MAS-ZERO: introduces a self-evolved, inference-time framework for automatic multi-agent system design, utilizing a Meta-Agent that orchestrates design and verification, leverages Building Blocks of pre-defined MAS configurations, performs Meta-Iterations through iterative design and feedback, executes the generated MAS via a Compiler to obtain Intermediate Outputs and Candidate Answers, and employs Self-Verification to select the best final solution.
The framework iteratively refines MAS configurations tailored to each problem instance by decomposing tasks (Meta-Design) and evaluating performance based on intermediate outputs (Meta-Feedback) without requiring a validation set.
This approach enables dynamic agent composition and problem decomposition, leading to improved performance and cost-efficiency compared to manual and existing automatic MAS design methods.

20th May 2025

ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions

ContextAgent: introduces a context-aware proactive LLM agent framework with Sensory Context Extraction (Extracts context from perceptions), Persona Context Extraction (Extracts context from historical data), Context-aware Reasoner (Integrates contexts, reasons, predicts), LLM (Core reasoning engine), Thought Traces (Generated reasoning steps), Proactive Predictions (Predicts need for service), External Tool Calling (Calls external tools), and Services (Provides assistance).
The framework leverages extensive sensory perceptions from wearables and persona contexts to understand user intentions and predict the need for proactive assistance.
ContextAgent utilizes tool-augmented LLM reasoning and introduces ContextAgentBench, a benchmark for evaluating such agents.

Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation

LAG (log-augmented generation): introduces a framework that directly reuses prior computation and reasoning from past logs at test time, utilizing a Log Store, Log Encoder, Log Retriever, and augmented Generator LM.
The framework represents task logs using key-value (KV) caches, encoding the full reasoning context of prior tasks while storing KV caches for a subset of tokens.
This approach directly reuses prior reasoning and computations without additional steps for knowledge extraction, enhancing performance and efficiency.

JARVIS: A Multi-Agent Code Assistant for High-Quality EDA Script Generation

JARVIS (Just A Remarkable VLSI Intelligence System): introduces a multi-agent framework for high-quality EDA script generation, leveraging LLMs and domain expertise with components including a Top Agent, Code Fixing Agent, Guardrail Agent, Code Generator, RuleEnforce, Code Compiler, RAG, Simulate ProcessSim, User Query, Instructions, and Final code.
The framework employs an iterative refinement process using a feedback loop between agents and custom tools like the Code Compiler and RuleEnforce to detect and fix errors.
JARVIS integrates domain-specific knowledge through RAG and custom compiler checks, addressing challenges like data scarcity and hallucination in specialized EDA tasks.

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

MedBrowseComp: introduces a benchmark for evaluating agent performance on complex medical information retrieval, including Primary Data Sources (External knowledge bases), Task Types (Question categories), and Verification & Scoring (Evaluation method).
The benchmark utilizes diverse medical knowledge bases and defines distinct task categories to assess agent capabilities in navigating and synthesizing information.
Evaluation involves verifying agent answers against ground truth using standardized metrics, highlighting performance gaps in multi-hop reasoning and computer use.

Think, Reflect, Create: Metacognitive Learning for Zero-Shot Robotic Planning with LLMs

Metacognitive Learning Module: introduces a framework integrating metacognitive learning into LLM-powered multi-robot collaboration with Modular Skill Set Construction, Metacognitive Inference, and Self-Reflection components.
The framework enables LLM-powered robotic agents to decompose skills, reason about tasks, synthesize plans, reflect on failures, and generate new solutions.
This approach aims to enhance zero-shot robotic task performance by empowering LLMs with capabilities for reasoning, reflection, and creative problem-solving.

MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation

MAATS (Multi-Agent Automated Translation System): introduces a modular framework using specialized LLM-based agents, including a Translator Agent (Generates initial translation), MQM Evaluator Agents (Evaluate translation errors) for specific dimensions, and an Editor Agent (Synthesizes annotations, refines translation), to enhance machine translation quality.
The system leverages the Multi-dimensional Quality Metrics (MQM) framework to provide fine-grained error detection and refinement signals across multiple dimensions.
This multi-agent architecture simulates human translation workflows, outperforming single-agent and zero-shot baselines in error detection and translation quality.

Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters

TreeDebater: introduces a novel debate framework for LLMs, utilizing a Rehearsal Tree (anticipates attacks/defenses), Debate Flow Tree (tracks debate status), Human Debate Trees (references for feedback), Simulated Audience (provides revision feedback), Speech Time Controller (controls speaking time), and Writer (drafts debate statement) to improve strategic planning in competitive debate.
The framework models dynamic debate interaction on trees, enabling LLMs to make tactical decisions under time constraints.
TreeDebater retrieves prepared arguments, selects impactful actions, and refines statements based on simulated audience feedback and time limits.

Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization

Mujica (Multi-hop Joint Intelligence for Complex Question Answering): introduces an agentic QA framework with a planner (decomposes questions, plans subquestions) and a worker (answers subquestions, uses retriever).
The planner module is responsible for breaking down complex questions into a directed acyclic graph and managing the overall process.
The worker module acts as a mini-RAG system, retrieving information and answering subquestions assigned by the planner.

Empowering LLMs in Task-Oriented Dialogues: A Domain-Independent Multi-Agent Framework and Fine-Tuning Strategy

DIMF (Domain-Independent Multi-Agent Framework): introduces a task-oriented dialogue system with Intent Classification Agent (extracts user intent), Slot Filling Agent (extracts dialogue slots), and Response Agent (generates system response), trained using SFT (initial model fine-tuning), DPO (preference-based training), and DDA (mitigates DPO degradation).
The framework separates complex tasks into domain-independent components to improve performance on lightweight large language models.
The proposed Data Distribution Adaptation method enhances DPO training stability and the framework demonstrates strong generalizability and zero-shot capabilities.

Safety Devolution in AI Agents

Core Evaluation Framework: introduces, "a framework to measure the impact of retrieval and alignment mechanisms on model bias and harmfulness", with Censored LLM, Uncensored LLM, Agents with Censored LLM, Generate Query, Search, ReRank, Crawl, Answer, WikiAgent, WebAgent, System-Level Safety Prompts, and Evaluator components, where "the framework systematically compares LLMs with and without retrieval augmentation and safety mitigations across various benchmarks".
The framework reveals that integrating external retrieval into safety-aligned LLMs leads to a phenomenon termed safety devolution, characterized by reduced refusal rates, increased bias, and degraded safety scores.
Controlled experiments within the framework indicate that this safety degradation is primarily caused by the mere presence of retrieved context, rather than retrieval depth or accuracy, highlighting a structural vulnerability in RAG systems.

DSMENTOR: ENHANCING DATA SCIENCE AGENTS WITH CURRICULUM LEARNING AND ONLINE KNOWLEDGE ACCUMULATION

DSMentor: introduces a framework with a Mentor agent (curriculum designer) that processes a Dataset (input tasks) to create a Curriculum-based dataset (ordered tasks), which is then used by a Student agent (code generator) interacting with a Long-term memory (accumulated knowledge) and an Environment (evaluates code) for problem-solving.
The framework operates in two stages: curriculum generation and problem-solving, leveraging curriculum learning and online knowledge accumulation.
The Mentor agent determines task difficulty to sequence problems from easy to hard, guiding the Student agent's learning progression.

MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem

MM-Agent: introduces an expert-inspired framework that decomposes mathematical modeling into four sequential phases: Problem Analysis, Mathematical Modeling, Computational Solving, and Solution Reporting.
The framework utilizes specialized agents like the Analyst Agent, Task Coordinator Agent, Modeling Actor, Modeling Critic, Modeling Programmer Agent, and Reporting Agent to handle distinct tasks within each phase.
Key components such as the Hierarchical Mathematical Modeling Library (HMML) and MLE-Solver support knowledge retrieval, model formulation, and computational execution for real-world problems.

s3: You Don't Need That Much Data to Train a Search Agent via RL

s3: introduces a modular, RL-based search framework with a Searcher LLM (RL-trained agent), Search Engine (retrieval source), frozen Generator LLM (frozen answer generator), and Gain Beyond RAG (reward signal), which trains a search-only agent using a novel reward signal to optimize retrieval for generation quality.
The framework decouples the searcher from the generator, allowing the searcher to be trained with reinforcement learning based on the improvement in generator accuracy using retrieved documents compared to naive retrieval.
By focusing training solely on the searcher using a generation-aware reward, s3 achieves strong performance with significantly less training data and is compatible with black-box generator LLMs.

Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent

SPlanner: introduces a framework for mobile GUI agents that includes Application Modeling via EFSM (models applications), Structured Knowledge Base (collection of EFSMs), Plan Generation (creates execution plan), Instruction Parsing (parses user instruction), EFSM Solving (finds execution path), Path Polishing (refines execution path), Task Execution with VLM (executes the plan), Vision-Language Model (VLM) (executes GUI actions), LLM (parses/polishes text), BFS-based Solver (finds path in EFSM), User Instruction (input command), GUI Screenshot (current screen state), Action History (previous actions), Task Plan (step-by-step guide), and Operation Instruction (GUI action).
The framework models mobile applications using Extended Finite State Machines (EFSMs) to create a structured knowledge base for planning.
SPlanner generates interpretable and reliable execution plans by parsing user instructions, solving EFSMs, and polishing the resulting paths using LLMs, which are then executed by a VLM.

BAR: A Backward Reasoning based Agent for Complex Minecraft Tasks

BAR (Backward Reasoning based Agent): introduces an agent for complex Minecraft tasks with recursive goal decomposition (Recursive goal decomposition), state consistency maintaining (State conflict resolution), and stage memory (Memory from environment interaction) modules.
The agent utilizes backward reasoning to plan from the terminal state, aiming to overcome the perception gap faced by forward reasoning in complex tasks.
State consistency is ensured by integrating forward and backward reasoning, and planning efficiency is enhanced by leveraging successful past interactions stored in stage memory.

Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

ReasonRAG: introduces, "Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning", with all LLM (Core reasoning model), Retriever (External knowledge access), Reasoning Stage (Decide query or answer), Grounding Stage (Extract evidence), Terminal Stage (Final answer state), Query Generation (Formulate search query), Evidence Extraction (Identify relevant text), Answer Generation (Produce final response), Memory (Stores previous steps)-components, where ReasonRAG is a process-supervised agentic RAG method using fine-grained rewards for policy optimization.
The framework employs Monte Carlo Tree Search and Shortest Path Reward Estimation to generate a high-quality process-level dataset, RAG-ProGuide, for training.
ReasonRAG enables LLMs to autonomously manage dynamic retrieval, iterative context refinement, and adaptive workflows for complex search queries.

Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning

SPLIT-RAG (Semantic Partitioning of Linked Information for Type-Specialized Multi-Agent RAG): introduces a multi-agent RAG framework with Knowledge Base Preprocessing (Prepare data), QA Input Processing (Analyze query), Retrieval Plan Decision (Determine subgraphs/agents), Multi-Agent RAG (Distributed retrieval), Answer Generation (Combine, resolve, finalize), Lightweight LLM Agents (Query subgraphs), and Head Agent (Final answer generation), which partitions knowledge graphs based on question types and uses multiple agents for efficient, conflict-resistant retrieval and answer generation.
The framework employs question-driven graph partitioning to create semantically coherent subgraphs, enabling lightweight agents to query only relevant partitions in parallel.
A hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications, with a head agent synthesizing the final response.

MLZero: A Multi-Agent System for End-to-end Machine Learning Automation

MLZero: introduces a multi-agent system for end-to-end machine learning automation, featuring Perception, Semantic Memory, Episodic Memory, and Iterative Coding modules, coordinated by specialized agents including File Grouping and File Perception, Task Perception, ML Library Selection, Condensation, Summarization, Retrieval, Error Analyzer, Coder, and Executer agents.
The system processes raw multimodal data through perception, leverages dual memory modules for knowledge and history, and employs iterative coding with agents for code generation, execution, and debugging.
MLZero achieves end-to-end ML automation with minimal human intervention by transforming raw data into ready-to-use models and predictions through this integrated multi-agent architecture.

DRUGPILOT: LLM-BASED PARAMETERIZED REASONING AGENT FOR DRUG DISCOVERY

DrugPilot (LLM-based parameterized reasoning agent): introduces an agent system for automating multi-stage drug discovery workflows, comprising LLM Backbones (Core language model), Parameterized Memory Pool (PMP) (Structured key-value data storage), AI Model Zoo (Drug discovery tools/models), and Fe-Fo Mechanism (Error feedback and focus).
The Parameterized Memory Pool (PMP) is a core component designed to handle large-scale, multi-modal drug data by converting it into standardized parametric representations for efficient retrieval and interaction.
The Fe-Fo Mechanism enhances the agent's robustness by providing specific error feedback and maintaining focus during complex multi-turn tasks and tool interactions.

CLEVER: A Curated Benchmark for Formally Verified Code Generation

CLEVER (Curated Lean Verified Code Generation Benchmark): introduces a benchmark for formally verified code generation, requiring models to perform specification generation, isomorphism proving, Lean implementation generation, and correctness proving to achieve end-to-end verification.
The benchmark evaluates models in two stages: specification certification (generating and proving equivalence of a Lean specification) and implementation certification (generating and proving correctness of a Lean implementation).
Success in CLEVER requires both the generated specification and implementation to be formally certified via Lean proofs, ensuring semantic correctness beyond test cases.

PANDAGUARD: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks

PANDAGUARD: introduces a unified and modular framework for systematic LLM safety evaluation, conceptualizing jailbreak safety as a multi-agent system with Attacker (generates adversarial prompts), Defender (implements protection mechanisms), Target LLM (model being evaluated), and Judger (evaluates response safety) components.
The framework supports plug-and-play experimentation with diverse attack methods, defense mechanisms, and judgment strategies within a flexible pipeline architecture.
Built upon this framework, PANDABENCH provides a large-scale benchmark for comprehensive and reproducible evaluation of LLM jailbreak vulnerabilities and defenses.

Structured Agent Distillation for Large Language Model

SAD (Structured Agent Distillation): introduces a framework that compresses large LLM agents into smaller student models by segmenting teacher trajectories into Reasoning and Action Segments, applying span-specific supervision via Segment Masks and aggregating losses using CoT-Policy Alignment Loss and Action Consistency Loss, guided by Curriculum Sampling.
The framework uses a Teacher to generate trajectories, which the Student learns to imitate by minimizing the Total Loss, preserving both reasoning fidelity and action consistency.
This structure-aware distillation method outperforms token-level baselines, enabling compact agents to better replicate the teacher's decision process with minimal performance drop.

RAG/LLM Augmented Switching Driven Polymorphic Metaheuristic Framework

PMF (Polymorphic Metaheuristic Framework): introduces a self-adaptive metaheuristic framework for optimization problems, with PMA (Orchestrates algorithms), PMSA (Selects algorithms), Metaheuristic Algorithms Pool (Available algorithms), Feedback Loop (Real-time adaptation), and Population Transfer (Transfers solutions).
The framework utilizes a Polymorphic Metaheuristic Agent (PMA) and a Polymorphic Metaheuristic Selection Agent (PMSA) for dynamic algorithm selection and switching based on real-time performance feedback.
PMF leverages real-time performance feedback and can integrate RAG/LLM for enhanced decision-making, demonstrating improved optimization efficiency and adaptability.

LLINBO: Trustworthy LLM-in-the-Loop Bayesian Optimization

LLINBO (LLM-in-the-Loop Bayesian Optimization): introduces a hybrid framework for Bayesian Optimization combining LLM Agent (Suggests design points), Statistical Surrogate (GP) (Models function, quantifies uncertainty), and Dataset (Stores historical observations).
The framework leverages LLMs for early exploration using contextual reasoning and GPs for efficient exploitation using principled statistical models.
Three specific mechanisms (Transient, Justify, Constrained) are proposed to enable this collaboration and provide theoretical guarantees.

19th May 2025

SIMULATION AGENT: A FRAMEWORK FOR INTEGRATING SIMULATION AND LARGE LANGUAGE MODELS FOR ENHANCED DECISION-MAKING

Simulation Agent framework: introduces, with Simulation Model (Core engine), Inputs (Configuration files), Outputs (Time-series data), AI Agent (Bridge/interpreter), and User (End stakeholder), a system integrating simulations and LLMs for enhanced decision-making.
The AI Agent utilizes LLMs and tool-calling capabilities to enable natural language interaction for running simulations, modifying inputs, and interpreting outputs.
This framework aims to improve the usability and accessibility of complex simulation analysis for non-technical users by grounding LLM interactions in accurate simulation results.

Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents

Guided Search Strategies: introduces two guided search methods, 1-step lookahead and trajectory selection, guided by a learned action-value function estimator, applicable to non-serializable environments like SWE-agent scaffolding.
The approach leverages a base policy (LLM) and a critic model to improve performance consistency in environments where intermediate states cannot be saved or restored.
Empirical evaluation on SWE-bench Verified demonstrates that these strategies significantly improve the success rate of both open-weight and closed models.

Incentivizing Truthful Language Models via Peer Elicitation Games

Peer Elicitation Games (PEG): introduces a training-free, game-theoretic framework for aligning LLMs, including a Generator (produces responses), Discriminators (evaluate responses), Peer Elicitation Game (structures agent interaction), Reward Mechanism (incentivizes truthful reporting), Policy Update (adjusts agent strategies), and Majority Vote (aggregates discriminator judgments).
PEG employs multiple LLM discriminators that evaluate a generator's output and are rewarded based on mutual evaluation using a determinant-based score, promoting truthful reporting without ground-truth labels.
The framework utilizes online learning with policy updates to converge agents towards a truthful Nash equilibrium, demonstrating improved factual accuracy and competitive performance with smaller models.

Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges

DialogTool/VirtualMobile: introduces a benchmark and environment for evaluating stateful tool use in multi-turn dialogues, featuring a Dialogue Agent interacting with a VirtualMobile Environment containing Apps and APIs with a Database.
The benchmark assesses large language models across the entire tool use lifecycle, including creation, utilization (awareness, selection, execution), and role-consistent response.
Experiments reveal that current large language models struggle with tool creation and execution, particularly over long dialogue horizons and with complex APIs.

TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents

TimeSeriesGym: introduces, with (Origins), (Tasks), (Agent artifacts), (Supports agents), (Numeric metrics), (LLM assessment), (LLM qualitative), and (Combined evaluation) components, a scalable benchmark for evaluating AI agents on time series ML engineering tasks.
The framework provides diverse challenges and evaluates multimodal agent outputs using a dual quantitative and qualitative assessment approach.
Its agent-agnostic design and scalable task generation mechanism support comprehensive and practical evaluation of AI agents.

Hybrid Voting-Based Task Assignment in Modular Construction Scenarios

HVBTA (Hybrid Voting-Based Task Assignment): introduces a framework for multi-agent task assignment in construction, integrating Task Descriptions (Define task requirements), Agent Capability Profiles (Define agent abilities), Suitability Matrix Generation (Calculate agent-task compatibility), LLM Integration for Semantic Reasoning (Use LLM for suitability), Voting and Allocation Mechanism (Assign tasks using voting), and CBS for Path Planning (Generate collision-free paths).
The framework defines tasks and agents, calculates suitability, uses voting and LLM for assignments, and plans collision-free paths.
HVBTA aims to improve efficiency and coordination for heterogeneous robotic teams in modular construction.

From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

Three-level taxonomy: introduces, "From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery", with all LLM as Tool (Task Automation Tool), LLM as Analyst (Data Modeling & Analytical Agent), and LLM as Scientist (Open Exploratory & Discovery Agent)-components, where the survey systematically charts the progression of Large Language Models in scientific discovery through distinct levels of autonomy.
This framework delineates the escalating autonomy and evolving responsibilities of LLMs within the scientific research lifecycle, from foundational assistants to autonomous researchers.
The paper categorizes existing research works based on this taxonomy and the stages of the scientific method, highlighting the shift towards sophisticated, multi-stage agentic workflows.

Effective and Transparent RAG: Adaptive-Reward Reinforcement Learning for Decision Traceability

ARENA (Adaptive-Rewarded Evidence Navigation Agent): introduces a transparent RAG generator framework trained via reinforcement learning, with Structured Generation, KL Stabilization, and Adaptive Reward Calculation components, enabling interpretable decision traces and effective reasoning.
The framework uses a structured output format including selected evidence, reasoning traces, and final answers for end-to-end interpretability.
Adaptive, task-specific rewards and a stabilized optimization strategy are tailored for multi-hop question answering tasks.

Agentic Publications: An LLM-Driven Framework for Interactive Scientific Publishing, Supplementing Traditional Papers with AI-Powered Knowledge Systems

Agentic Publications (AP): introduces an LLM-driven framework, with Knowledge Representation Layer (Store scientific knowledge), Interactive Query Interface (User/AI interaction), Dynamic Updating Mechanism (Continuous knowledge ingestion), and Verification and Governance Process (Ensure quality/integrity), transforming traditional papers into interactive knowledge systems.
The framework integrates structured and unstructured data using Retrieval-Augmented Generation (Connect LLM to knowledge) and Multi-Agent Verification (Collaborative accuracy checks) processes.
Supported by a Knowledge Base (Integrated data store), Agents (Perform tasks/checks), Databases (Vector/graph/relational storage), and Integration APIs (External/internal connectivity), the system enables dynamic knowledge synthesis and interactive access for humans and AI.

Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities

Adversarial Evaluation Framework: introduces a method to stress-test LLM decision-making using LLM (Subject under test), Task (Interactive environment), Learner Model (RNN) (Predicts LLM behavior), and Adversary (RL agent) (Manipulates environment).
The framework trains a Learner Model to predict LLM actions and an Adversary to exploit these patterns by manipulating task rewards and observations.
This approach reveals LLM vulnerabilities to manipulation and rigidity in strategy adaptation in dynamic, adversarial settings.

Fixing 7,400 Bugs for 1$: Cheap Crash-Site Program Repair

WILLIAMT: introduces a cheap crash-site program repair approach, comprising Regex-Based Context Retrieval (identifies crash site) and Template-Guided Patch Generation (generates fix using templates and LLM).
The system leverages regex on sanitizer reports for context retrieval and template-guided LLM analysis to identify key variables for patch generation, producing a One-Shot Patch.
This strategy minimizes LLM reliance, substantially reducing query costs and enabling effective repair with smaller, cheaper models.

Q²Forge: Minting Competency Questions and SPARQL Queries for Question-Answering Over Knowledge Graphs

Q²Forge: introduces an end-to-end pipeline for generating question-query datasets for knowledge graphs, including KG Configuration Creation (Configure KG), Competency Question Generation (Generate NL questions), SPARQL Query Generator & Executor (Translate & run queries), and SPARQL Query Refinement (Iteratively improve queries).
The framework guides users through configuring a knowledge graph, generating natural language competency questions, translating them into SPARQL queries, executing the queries, and refining the results.
Q²Forge leverages language models and other services to automate and assist in creating high-quality question-query pairs for knowledge graph documentation, training, and benchmarking.

The Hidden Dangers of Browsing AI Agents

Browser Use: introduces a security evaluation of autonomous browsing AI agents, focusing on systemic vulnerabilities across architectural layers, using Browser Use as a case study.
The paper analyzes the attack surface of browsing agents, detailing threats related to Perception, Reasoning/Planning, External Tools (Actions), Browsing Engine, Sensitive Data Handling, and Domain Restriction components.
Identified vulnerabilities in Browser Use, including domain restriction bypass and credentials exfiltration via prompt injection, highlight the need for multi-layered security approaches.

CAIM: Development and Evaluation of a Cognitive AI Memory Framework for Long-Term Interaction with Intelligent Agents

CAIM (Cognitive AI Memory Framework): introduces a framework with a Memory Controller (Central decision unit), Memory Retrieval (Filters relevant LTM data), Post-Thinking (Maintains LTM storage), Memory (STM/LTM) (Stores conversation/historical data), LLM Agent (Performs tasks within modules), and Python task (Processes/stores data) for long-term interaction.
The framework enhances LLMs' memory capabilities by integrating cognitive AI principles and memory-augmented methods.
CAIM addresses challenges in long-term interactions by managing memory usage, filtering relevant information, and maintaining memory storage.

Adversarial Reasoning for Repair Based on Inferred Program Intent

ADVERINTENT-AGENT: introduces a multi-agent framework for automated program repair, with Reason Agent (reasons program intent), Test Agent (generates tests), Repair Agent (generates patches), and Environment (compiles, executes, searches) components, that infer

Name		Name	Last commit message	Last commit date
Latest commit History 1,276 Commits
Autonomous_Agents_Resources.md		Autonomous_Agents_Resources.md
Autonomous_agent_logo.png		Autonomous_agent_logo.png
LICENSE		LICENSE
README.md		README.md

License

tmgthb/Autonomous-Agents

Folders and files

Latest commit

History

Repository files navigation

Autonomous Agents

Research papers

18th June 2025

17th June 2025

16th June 2025

15th June 2025

14th June 2025

13th June 2025

12th June 2025

11th June 2025

10th June 2025

9th June 2025

8th June 2025

7th June 2025

6th June 2025

5th June 2025

4th June 2025

3rd June 2025

2nd June 2025

1st June 2025

31st May 2025

30th May 2025

29th May 2025

28th May 2025

27th May 2025

26th May 2025

25th May 2025

24th May 2025

23rd May 2025

22nd May 2025

21st May 2025

20th May 2025

19th May 2025

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages