Autonomous Agents

Autonomous Agents-research papers. Updated daily. Resources-section-section.

Research papers: 2025

2025, 2024, 2023, Earlier

Chronological order.

22nd July 2025

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

ThinkAct: introduces a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning, training a multimodal LLM to generate embodied reasoning plans guided by action-aligned visual rewards.
The framework's Reasoning MLLM generates reasoning plans and a visual plan latent, which then conditions the Action Model for robust action execution, enabling asynchronous operation for slow thinking and fast control.
ThinkAct leverages action-aligned visual feedback, including goal completion and trajectory consistency, to reinforce reasoning, leading to capabilities like few-shot adaptation, long-horizon planning, and self-correction in complex embodied AI tasks.

LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

LingBench++ (A Linguistically-Informed Benchmark and Reasoning Framework): introduces a Multi-Agent Framework for solving linguistic problems, which includes Solver Agents (proposes initial linguistic hypotheses), Aggregator Agents (collects, synthesizes solutions), a Final Aggregator (generates final solution), and a Grammar Agent (retrieves linguistic reference knowledge).
This multi-round framework enhances LLM reasoning by enabling iterative hypothesis generation, solution aggregation, and external knowledge retrieval for complex linguistic tasks.
The framework emphasizes stepwise reasoning quality and grammar-informed verification, providing diagnostic insights beyond final answer accuracy.

Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Agentar-Fin-R1: introduces a family of financial LLMs, engineered based on the Qwen3 foundation model, with a development pipeline that includes a Data Pipeline (constructs high-quality data), a Label System (structures data synthesis), and a Training Pipeline (optimizes LLM performance).
The Data Pipeline integrates source governance, multi-agent data synthesis, and rigorous verification to ensure data quality and domain relevance for financial applications.
The Training Pipeline employs a weighted training framework and a two-stage strategy for efficient knowledge injection and challenge enhancement, complemented by an attribution loop for continuous model refinement.

Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent

TTM (Test-Time-Matching): introduces a training-free role-playing framework that automatically decouples character features into personality, memory, and linguistic style, utilizing a structured three-stage generation pipeline for controlled role-playing.
The framework's pipeline includes a Styleless Response Generation stage, a Memory-checked Response Generation stage, and a Stylized Response Generation stage, ensuring high-fidelity and stylistically consistent character dialogues.
TTM enhances controllability and personalization in role-playing language agents by enabling seamless combinations across diverse linguistic styles and variations in personality and memory.

DELIBERATIVE SEARCHER: IMPROVING LLM RELIABILITY VIA REINFORCEMENT LEARNING WITH CONSTRAINTS

Deliberative Searcher: introduces a reasoning-primary, information-secondary framework that integrates LLM deliberation with selective web search and confidence calibration, trained via a constrained reinforcement learning algorithm to align confidence with correctness.
The framework enables an agent to perform multi-step reflection and verification over external data, dynamically updating its confidence metrics through actions like THINK, SEARCH, and READ.
It optimizes for accuracy under a soft reliability constraint, utilizing a reward signal composed of format compliance, answer correctness, and reliability rewards to produce trustworthy outputs.

RAVine: Reality-Aligned Evaluation for Agentic Search

RAVine (Reality-Aligned Evaluation for Agentic Search): introduces a comprehensive evaluation framework for agentic LLMs with search, addressing misalignments in existing methods by targeting multi-point queries and long-form answers, and evaluating the iterative process.
The framework includes an Agentic LLM with Search, a Web Corpus, Search and Fetch Tools, an Attributable Nuggets Collection for fine-grained ground truth, and both Block-level and Process-Oriented Evaluations.
It provides a full-process, reproducible, and goal-aligned evaluation sandbox that assesses report quality, tool performance, and efficiency, offering insights into agentic search system development.

Agentic RAG with Knowledge Graphs for Complex Multi-Hop Reasoning in Real-World Applications

INRAExplorer (Agentic RAG system): introduces an agentic RAG system with an LLM-based agent, a hybrid knowledge base (Vector Database and Knowledge Graph), and specialized tools (SearchGraph, SearchPublications, SearchConceptsKeywords, IdentifyExperts) for complex multi-hop reasoning in scientific data.
The system empowers its LLM-based agent to dynamically navigate tools, gather evidence, and plan subsequent steps, enabling multi-hop reasoning and comprehensive answer generation from scientific data.
This approach overcomes classical RAG limitations by deeply integrating knowledge graph querying as a core agentic capability, enabling precise, relationally-aware retrieval and adaptive multi-hop reasoning.

Towards Enforcing Company Policy Adherence in Agentic Workflows

Framework for Enforcing Business Policy Adherence in Agentic Workflows: introduces a deterministic, transparent, and modular framework with an Offline Buildtime Stage that compiles Policy Documents, Toolkit, and Data Schema into verifiable ToolGuards (Python code) via a Tool-Policy Mapper and ToolGuard Generator, and a Runtime Integration where these ToolGuards ensure compliance before each LLM Agent action within a ReAct Workflow, preventing non-compliant Tool Invocations for the Customer.
The framework's buildtime phase leverages an LLM-based Tool-Policy Mapper to transform natural language policies into a Compact Tool-Oriented Policy Representation, which then feeds into an LLM-based ToolGuard Generator to produce executable Python ToolGuards.
This approach aims to bridge the gap between flexible AI behavior and organizational constraints by proactively preventing policy violations in LLM-based agentic workflows, ensuring reliable and predictable enterprise-scale operations.

LLM-Driven Collaborative Model for Untangling Commits via Explicit and Implicit Dependency Reasoning

ColaUntangle (collaborative consultation framework for commit untangling): introduces a multi-agent LLM-driven system for untangling commits by reasoning about explicit and implicit code dependencies, with all its components, where the system integrates structured code change information with specialized LLM agents in an iterative consultation process.
It leverages Structured Code Change Information, including explicit and implicit contexts derived from multi-version Program Dependency Graphs, to inform its Multi-Agent Architecture comprising Explicit Worker, Implicit Worker, and Reviewer LLM agents.
The Untangling Workflow orchestrates the iterative collaborative consultation among these agents to achieve consensus on untangling decisions and provide explanations for improved transparency.

Application of LLM Guided Reinforcement Learning in Formation Control with Collision Avoidance

LLM-FCCA (LLM-Guided Formation Control with Collision Avoidance): introduces a framework that leverages LLMs to dynamically generate and refine reward functions for multi-agent formation control with collision avoidance, utilizing an LLM Reward Designer, RL Training, Evaluation, Policy, Real World Deployment, Environment, Task Description and Tips, and Agent Observations Format (Local State, Obstacles State, Communication Data).
The framework dynamically adjusts reward functions online using advanced evaluation metrics, enabling efficient simultaneous achievement of formation control and obstacle avoidance.
Empirical studies in both simulation and real-world settings validate the approach's practicality and effectiveness, demonstrating superior performance with fewer iterations compared to human-designed methods.

Voice-based AI Agents: Filling the Economic Gaps in Digital Health Delivery

Agent PULSE (Patient Understanding and Liaison Support Engine): introduces a voice-based AI agent for digital health delivery, integrating a Voice Interface (standard telephone lines), an AI Engine (core intelligence platform) with LLMs (inference, fine-tuning service), SOLOMON (conversation management, analysis), and RAG (combines questionnaire, medical knowledge), and a Physician Dashboard (healthcare provider interface).
This system aims to bridge economic and accessibility gaps in healthcare by providing scalable, cost-effective, and equitable solutions for preventive care and continuous patient monitoring.
A pilot study with 33 inflammatory bowel disease patients demonstrated high patient acceptance and significant workflow advantages for healthcare providers, validating its potential to fill care gaps.

RealBench: Benchmarking Verilog Generation Models with Real-World IP Designs

Self-reflection agent: introduces a framework for Verilog generation, integrating an LLM with Design Specification input and Feedbacks from Syntax Checker, Testbench Verification, and Formal Verification for iterative code modification.
This agent iteratively refines generated Verilog code by leveraging verification feedback to improve correctness and reliability.
The iterative feedback loop aims to address syntax errors, functional errors, and formal verification failures, enhancing LLM-generated hardware designs.

Do Large Language Models Have a Planning Theory of Mind? Evidence from MINDGAMES: a Multi-Step Persuasion Task

MINDGAMES (Planning Theory of Mind Task): introduces a novel task framework for evaluating LLMs' ability to dynamically plan actions and strategically intervene on others' mental states, featuring a Persuader Agent (human or LLM participant), a Target Agent (hard-coded rational bot), a Dialogue Environment (multi-turn conversational interface), Proposals (three selectable options), Value Functions (agent preference definitions), Information Sets (agent knowledge states), and Mental States (target's beliefs and desires).
This framework assesses "planning theory of mind" (PToM) by requiring the persuader to infer the target's beliefs and desires to persuade them to alter their behavior, moving beyond passive ToM assessments.
The task involves the persuader selectively disclosing information to the target, who has partial information and makes rational choices based on its value function, highlighting a capability gap between human and LLM social reasoning.

Benchmarking LLM Privacy Recognition for Social Robot Decision Making

LLM Privacy Recognition Benchmark: introduces a methodology to evaluate LLMs' privacy awareness in social robot interactions, encompassing scenario generation, human preference elicitation, LLM evaluation with various prompting strategies, and subsequent analysis.
The benchmark leverages the Contextual Integrity framework to create privacy-relevant scenarios and crowdsourced human data to establish preferred robot behaviors and user privacy orientations.
It assesses LLM conformity to human privacy expectations, identifies the impact of different prompting strategies, and provides insights for designing privacy-aware LLM-powered social robots.

Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation

Screen2AX: introduces a vision-based pipeline for automatic macOS accessibility generation, processing UI screenshots through UI element detection, text detection, element description, and hierarchy generation to produce structured hierarchical accessibility metadata.
The framework leverages YOLOv11 for element localization, classification, and hierarchical grouping, and BLIP for generating semantic descriptions of UI elements.
The system aims to bridge the gap in macOS accessibility support by creating real-time, tree-structured accessibility metadata from single screenshots, outperforming built-in tools in quality.

Augmenting Von Neumann's Architecture for an Intelligent Future

Augmented Von Neumann Architecture: introduces a novel computer architecture that extends the classical Von Neumann model with a dedicated Reasoning Unit (RU) for native artificial general intelligence capabilities, alongside the Central Processing Unit (CPU), Arithmetic Logic Unit (ALU), Memory Subsystem, Control Unit, Input/Output System, and a Semantic Interconnect Bus (SIB).
This architecture enables autonomous agents to perform goal-directed planning, dynamic knowledge manipulation, and introspective reasoning directly within the computational substrate at system scale.
The framework establishes a computational foundation where reasoning, learning, and adaptation emerge as intrinsic execution properties, moving beyond traditional sequential computation.

Distributed Oscillatory Guidance for Formation Flight of Fixed-Wing Drones

Distributed Oscillatory Guidance: introduces a novel approach for fixed-wing drone formation flight by modulating path progression through a non-negative input-saturated consensus strategy, integrating an Inverse Kinematics Guiding Vector Field (IK-GVF) path-following controller, and leveraging fixed-wing drone dynamics.
This method enables coordinated formation flight without requiring speed actuation, achieving synchronized path following by inducing controlled oscillations in the guiding vector field.
The approach ensures robust convergence to desired formations even with speed fluctuations, validated through numerical simulations and real-world flight experiments.

From model-based learning to model-free behaviour with Meta-Interpretive Learning

MIL-M2MF (Meta-Interpretive Learning for Model-Based to Model-Free Behavior): introduces a framework that uses a MIL System to learn a Model-based Solver, which then generates examples to train a Model-free Controller, enabling autonomous agents to combine planning and exploration capabilities in novel environments.
The Model-based Solver plans actions with full environment knowledge, while the Model-free Controller acts without a model, relying on learned state-action mappings.
The framework demonstrates the equivalence in problem-solving ability between the learned Solver and Controller on grid navigation tasks, utilizing specialized FSC Executors and a Grid Master environment.

21st July 2025

Deep Researcher with Test-Time Diffusion

TTD-DR (Test-Time Diffusion Deep Researcher): introduces a novel framework that conceptualizes research report generation as a diffusion process, iteratively refining a preliminary draft through denoising and self-evolution, leveraging LLM-powered agents for each stage.
The framework initiates with a noisy draft and a research plan, which are then refined via a continuous feedback loop incorporating external information through a retrieval mechanism.
This draft-centric design, enhanced by component-wise self-evolution, ensures timely and coherent report writing while minimizing information loss during the iterative search process.

LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra

LLM Economist: introduces a novel framework for agent-based economic modeling, featuring a Tax Planner (LLM agent) that designs tax policies and Worker Agents (LLM agents) that adjust labor, all interacting within an Environment that simulates economic outcomes.
The framework employs in-context reinforcement learning for both the planner and workers, enabling them to adapt to strategic environments with hierarchical decision-making and optimize their respective utility functions.
It uniquely integrates census-calibrated population modeling, dynamic tax-mechanism optimization, and democratic governance, providing a testbed for fiscal policy evaluation at a societal scale.

Towards physician-centered oversight of conversational diagnostic AI

g-AMIE (guardrailed-AMIE): introduces a novel asynchronous oversight paradigm for conversational diagnostic AI, enabling AI-driven patient intake with strict guardrails and subsequent human physician oversight via a dedicated clinician cockpit.
This framework decouples AI-driven patient intake from medical advice delivery, mandating human oversight by licensed primary care physicians to ensure safety and accountability.
A randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study demonstrated g-AMIE's superior performance in high-quality intake and case summarization compared to human control groups.

A Framework for Analyzing Abnormal Emergence in Service Ecosystems Through LLM-based Agent Intention Mining

EAMI: introduces a framework for analyzing abnormal emergence in service ecosystems, integrating Agent-Based Modeling, Inspector Agent, Analysis Agent, Intention Repository, Embedding Module, Clustering Module, and Intention Temporal Emergence Diagram to bridge microscopic agent intentions with macroscopic service emergence.
The framework employs LLMs within its Inspector and Analysis Agents, along with a Memory component and Dual-Perspective Thought Extraction, to track and extract agent thoughts, enabling dynamic and interpretable emergence analysis.
It identifies phase transition points in group intentions through embedding and clustering, then visualizes their temporal evolution to explain complex system phenomena.

GasAgent: A Multi-Agent Framework for Automated Gas Optimization in Smart Contracts

GasAgent: introduces a multi-agent framework for automated Gas optimization in smart contracts, including Seeker (identifies known patterns), Innovator (discovers new patterns), Executor (applies and validates changes), Manager (orchestrates workflow and reports), Gas Waste Pattern Library (stores known patterns), and New Pattern Blacklist (filters invalid patterns), designed to combine compatibility with existing patterns and automated discovery/validation of new patterns for end-to-end optimization.
The framework addresses limitations of manual Gas optimization and single LLM approaches by enabling specialized agents to collaborate in a closed loop for identifying, validating, and applying Gas-saving improvements.
GasAgent demonstrates effectiveness by optimizing real-world contracts with an average deployment Gas saving of 9.97% and usability for LLM-generated contracts, serving as a reliable optimization layer.

BUGSCOPE: LEARN TO FIND BUGS LIKE HUMAN

BUGSCOPE (BugScope: Learn to Find Bugs Like Human): introduces an LLM-driven multi-agent system that emulates human auditors' workflow, including a Context Retrieval Agent (retrieves relevant code context) and a Bug Detection Agent (detects and validates bugs), which together automate the end-to-end auditing process.
The Context Retrieval Agent utilizes Retrieval Strategy Synthesis with a Seed Extractor and Retrieval Direction, and performs Slicing-Based Context Retrieval with an AST Parser and LLM to gather relevant code snippets.
The Bug Detection Agent synthesizes a detection prompt using an LLM, Reasoning Hints, and Prompt Reflection, then employs the LLM for Bug Validation to generate structured Bug Reports, effectively generalizing across diverse anti-patterns.

DHEvo: Data-Algorithm Based Heuristic Evolution for Generalizable MILP Solving

DHEvo (Data-Algorithm Based Heuristic Evolution): introduces a data-algorithm co-evolution framework that iteratively selects representative MILP instances and evolves corresponding heuristics, with all Initialization-, Sample-, Iterative evolution-, Final Selection-components, where it significantly improves the generalization ability of generated heuristics for Mixed-Integer Linear Programming (MILP) solving.
The framework employs an LLM-based MA-Evolution System, including Designer, Coder, Reviewer, and Judger agents, to generate and refine data-code pairs simultaneously through a debate cycle.
This co-evolutionary approach ensures mutual adaptation between instances and algorithms, leading to robust generalization and superior performance compared to human-designed and existing LLM-based methods.

PHYSGYM: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

PHYSGYM (Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors): introduces a novel benchmark suite and simulation platform for assessing LLM-based scientific reasoning, including an Environment (simulated physics problems), Interface (controls experiments and data), and Evaluator (assesses model performance), designed to systematically control task complexity and prior knowledge for interactive physics discovery.
The platform enables agents to actively probe environments, gather sequential data under constraints, and formulate hypotheses about underlying physical laws, providing fine-grained control over prior knowledge levels to dissect agent performance.
PHYSGYM offers standardized evaluation protocols and metrics for hypothesis accuracy and model fidelity, demonstrating its utility in differentiating LLM capabilities based on varying priors and task complexity.

HAMLET: Hyperadaptive Agent-based Modeling for Live Embodied Theatrics

HAMLET (Hyperadaptive Agent-based Modeling for Live Embodied Theatrics): introduces a multi-agent framework for AI drama, with Actor Designer (Generates character profiles), Plot Designer (Composes narrative draft/scenes/props), Reviewer (Evaluates character/plot rationality), Director (Integrates profiles, creates blueprint), Narrative Blueprint (Structured guide for performance), Planner (Designs/reviews multi-trajectory beats), Transfer (Monitors flag fulfillment, advances plot), Advancer (Ensures plot progression, directs actors), Actor (Performs narrative, makes decisions), Perceive And Decide (PAD) Module (Guides actor strategic decisions), Internal State (Actor's self-awareness, goals), External Stimulus (Environmental/contextual information), Tool Calling (Generates speech/actions), Narrator (Adjudicates interactions, updates environment), Critic (Evaluates drama performance quality), and Online Performance Environment (Dynamic, interactive theatrical setting), enabling autonomous and immersive interactive drama.
The framework operates in two stages: offline planning to generate a narrative blueprint from a simple topic, and online performance for dynamic, improvisational theatrical experiences.
It incorporates a comprehensive evaluation method, HAMLETJudge, to assess character performance, narrative quality, and interaction experience, achieving top-ranking results.

PhishIntentionLLM: Uncovering Phishing Website Intentions through Multi-Agent Retrieval-Augmented Generation

PhishIntentionLLM: introduces a multi-agent RAG framework that uncovers phishing intentions from website screenshots, employing a Vision Analysis Agent, Context Enrichment Agent, Primary Classification Agent, a Specialist Analysis Layer with dedicated expert agents, a Validation Agent, and a dual-layer Knowledge Base with a feedback loop.
The framework leverages LLMs' visual-language capabilities and a dual-layer knowledge architecture to provide scalable and interpretable intention-aware phishing analysis.
It significantly outperforms single-agent baselines and prior work in precision and recall for detecting credential theft, financial fraud, malware distribution, and personal information harvesting.

Butterfly Effects in Toolchains: A Comprehensive Analysis of Failed Parameter Filling in LLM Tool-Agent Systems

LLM Tool-Agent System Analysis Framework: introduces a comprehensive analysis of failed parameter filling in LLM tool-agent systems, utilizing a Failure Taxonomy Construction (Methodology component) and an Evaluation Process (Methodology component) to investigate "Butterfly Effects" in toolchains.
The paper systematically identifies five parameter failure patterns—Missing Information, Redundant Information, Hallucination Name, Task Deviation, and Specification Mismatch—and constructs a taxonomy using Grounded Theory.
It applies 15 input perturbation methods across user queries, tool documents, and tool returns to analyze their impact on LLM parameter behavior and proposes actionable improvements for tool agent reliability.

SPAR: Scholar Paper Retrieval with LLM-based Agents for Enhanced Academic Search

SPAR (Scholar Paper Retrieval): introduces a modular multi-agent framework for academic paper retrieval that leverages LLM-based agents and RefChain for enhanced search.
This framework performs fine-grained query understanding, multi-source retrieval, citation-driven knowledge expansion, and relevance-aware reranking to mirror human research exploration.
SPAR significantly outperforms strong baselines on AutoScholar and SPARBench, a new expert-annotated benchmark, demonstrating its robustness and generalization in complex academic search scenarios.

FaultLine: Automated Proof-of-Vulnerability Generation using LLM Agents

FAULTLINE: introduces an LLM agent workflow that automatically generates Proof-of-Vulnerability (PoV) test cases by tracing data flow, reasoning about control flow conditions, and iteratively refining tests based on execution feedback.
The framework leverages an LLM agent augmented with various tools to explore codebases, identify vulnerability sources and sinks, and derive precise input conditions required to trigger vulnerabilities.
The system's multi-stage reasoning process, including data flow analysis, control flow analysis, and a feedback-driven repair loop, enhances the LLM's ability to generate effective and accurate PoV tests across different programming languages.

Solving Formal Math Problems by Decomposition and Iterative Reflection

Delta Prover: introduces an agent-based framework that orchestrates a general-purpose LLM, Lean 4 Proof Environment, and Retrieval Model, with Reflective Decomposition, Iterative Proof Repair, Automatic Proof Consolidation, and a Domain-Specific Language (DSL), to solve formal math problems by iteratively refining proofs and decomposing complex theorems.
The framework leverages the LLM's inherent reasoning and reflection capabilities to interactively construct formal proofs in Lean 4, circumventing the need for model specialization or extensive fine-tuning.
It achieves state-of-the-art performance on the miniF2F-test benchmark by systematically tackling complex proofs, learning from mistakes, and producing machine-verifiable results.

EchoVoices: Preserving Generational Voices and Memories for Seniors and Children

EchoVoices: introduces an end-to-end digital human pipeline for seniors and children, with k-NN enhanced Whisper ASR model (speech recognition), LLM-driven Agent (persona distillation/response generation), Persona Card (user identity summary), RAG (memory retrieval), Memory Fragments (vector database), Age-adaptive VITS model (speech synthesis), Wav2Lip (lip synchronization), and GFPGAN (photorealistic face rendering), designed to create persistent digital personas by preserving unique voices and memories.
The system processes spoken queries from seniors or children, transcribes them using a k-NN augmented Whisper model, generates context-aware responses via an LLM-driven agent with a RAG-based memory, and synthesizes age-appropriate speech using a two-stage fine-tuned VITS model.
This framework aims to address the challenges of conventional ASR, TTS, and LLM systems with atypical speech patterns and interaction styles of seniors and children, enabling empathetic and effective intergenerational digital interactions.

PromptArmor: Simple yet Effective Prompt Injection Defenses

PromptArmor: introduces, "a simple yet effective defense against prompt injection attacks", with Guardrail LLM (off-the-shelf LLM), Prompting Strategy (carefully designed prompts), Detection (identifies injected prompts), Extraction (isolates malicious content), and Sanitization (removes injected prompts), where it functions as a guardrail layer to detect and remove malicious prompts from agent inputs before processing.
This defense leverages the text understanding and pattern recognition capabilities of an off-the-shelf LLM to analyze data samples and identify inconsistencies introduced by injected prompts.
PromptArmor operates as a standalone preprocessing component, ensuring minimal disruption to existing LLM-based systems and allowing the agent to complete its intended user task with sanitized data.

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

MoR (Mixture-of-Recursions): introduces a unified framework for LLMs that combines parameter sharing and adaptive computation, featuring a Recursive Transformer with Shared Stack of Layers/Recursion Block, a Router with Expert-choice routing/Token-choice routing, and a KV Caching Strategy with Recursion-wise KV caching/Recursive KV sharing.
This framework dynamically assigns token-level recursion depths via lightweight routers and selectively caches Key-Value pairs, focusing quadratic attention computation only on active tokens to improve memory access efficiency.
MoR establishes a new Pareto frontier for LLM efficiency, significantly lowering validation perplexity and improving few-shot accuracy while delivering higher throughput compared to existing baselines.

Red-Team Multi-Agent Reinforcement Learning for Emergency Braking Scenario

RMARL (Red-Team Multi-Agent Reinforcement Learning): introduces a framework where red-team agents, trained using a DC-GPPO algorithm with GCN and MLP, actively interfere with autonomous vehicles (AVs) in emergency braking scenarios, leveraging a CGMDP and PTZ model to generate high-risk corner cases.
The framework redefines background vehicles as red-team agents, enabling them to explore and uncover safety-critical scenarios beyond typical data distributions by maximizing AV collision rates while adhering to traffic regulations.
The PTZ model quantifies the threat posed by red-team vehicles, encouraging more extreme adversarial behaviors, and the DC-GPPO algorithm applies dual constraints to ensure realistic and disruptive interference.

The Constitutional Controller: Doubt-Calibrated Steering of Compliant Agents

CoCo (Constitutional Controller): introduces a novel framework for doubt-calibrated steering of compliant agents, integrating a Constitution (agent's structured knowledge base), a Doubt Model (neural self-doubt probability density), Probabilistic Inference, Plan & Control, and Online Compliance Validation.
The framework enhances agent safety and reliability by reasoning over deep probabilistic logic programs representing constraints and learning self-doubt from contextual features.
CoCo's adaptive behavior, demonstrated in UAV navigation, allows agents to account for external constraints and internal uncertainties, leading to compliant and crash-free operations.

The Emergence of Deep Reinforcement Learning for Path Planning

DQN (Deep Q-Network) Algorithm: is illustrated as a path planning model for marine search and rescue vessels, including an Environment, Actions, Estimation Q-network, Target Q-network, Reward Function, Experience Replay Memory, Gradients, Loss Function, and Update after N steps, designed to optimize navigation strategies.
This model enables autonomous agents to learn optimal navigation policies through interactive learning with the environment, aiming to maximize cumulative rewards for efficient search paths.
The architecture incorporates a target network for stable Q-value references and experience replay to decorrelate learning samples, enhancing the algorithm's stability and adaptability.

LaViPlan : Language-Guided Visual Path Planning with RLVR

LaViPlan (Language-Guided Visual Path Planning with Reinforcement Learning with Verifiable Rewards): introduces a framework that leverages Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Vision-Language Models (VLMs) for autonomous driving, addressing vision-language-action misalignment by integrating a policy model, a reference model, and verifiable rewards.
The framework operates in two phases: supervised fine-tuning of a VLM, followed by reinforcement fine-tuning where the policy model is optimized using Group Relative Policy Optimization (GRPO) with rewards based on output format adherence and trajectory accuracy.
This approach aims to steer VLMs toward context-aware decision-making consistent with situational reasoning, improving performance in out-of-distribution scenarios by explicitly optimizing planning-oriented metrics.

Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization

ChatBattery: introduces an expert-guided LLM reasoning platform for battery materials discovery, featuring two phases (Exploration and Exploitation), eight sequential stages, and seven specialized agents, including LLM, Domain, Search, Decision, Retrieval, Rank, and Human Agents.
This framework integrates domain knowledge to steer LLMs towards effective reasoning, enabling the identification, synthesis, and characterization of novel lithium-ion battery cathode materials.
The platform's AI-driven approach significantly reduces the time for experimental screening and validation, demonstrating the transformative potential of LLM-augmented discovery pipelines.

Making REST APIs Agent-Ready: From OpenAPI to Model Context Protocol Servers for Tool-Augmented LLMs

AutoMCP (Automated Model Context Protocol Compiler): introduces a compiler that automates the generation of Model Context Protocol (MCP) servers from OpenAPI specifications, with components including Input Parsing and Dialect Resolution, Spec Normalization and Flattening, Authentication Analysis and .env Generation, Stub Generation and Handler Synthesis, and Output Layout and Transport Configuration.
The framework aims to streamline the integration of REST APIs into LLM workflows by transforming OpenAPI definitions into callable MCP tools, thereby reducing manual glue code and hardcoded prompts.
This approach addresses the engineering bottleneck of manually constructing MCP servers, enabling dynamic tool discovery and invocation for tool-augmented LLMs.

A Pilot Study on LLM-Based Agentic Translation from Android to iOS: Pitfalls and Insights

Multi-Agent Translation Pipeline: introduces an LLM-based agentic approach for mobile application translation from Android to iOS, with Specification Extraction, Code Translation, and Code Validation Agents, where it evaluates LLM performance, identifies key failure points, and proposes improvement guidelines.
The study evaluates the approach on five diverse Android projects, manually analyzing translated code for syntactic correctness, semantic accuracy, and functional completeness.
It identifies 10 types of translation failures across method, file, and package levels, underscoring challenges in platform-aware translation and the need for robust validation.

HyDRA: A Hybrid-Driven Reasoning Architecture for Verifiable Knowledge Graphs

HyDRA (Hybrid-Driven Reasoning Architecture): introduces a framework for verifiable Knowledge Graph (KG) automation, integrating symbolic knowledge and neural networks, with components including stakeholder, persona, scope document, competency question, ontology, and KG generation modules, all guided by verifiable contracts.
The architecture operationalizes Design-by-Contract principles and the SymbolicAI framework, orchestrating an LLM-driven pipeline with a closed-loop verification and repair mechanism that enforces structural invariants and type consistency.
This approach aims to improve the reliability of automated KG construction by ensuring traceability from high-level requirements to low-level data and providing an evaluation framework for functional correctness.

Towards Mitigation of Hallucination for LLM-empowered Agents: Progressive Generalization Bound Exploration and Watchdog Monitor

HalMit: introduces a novel black-box watchdog framework that models the generalization bound of LLM-empowered agents to detect hallucinations, without requiring internal knowledge of the LLM's architecture, with Core Agent (coordinates interactions), Query Generation Agent (generates queries), Target LLM (LLM-powered agent), Evaluation Agent (evaluates responses), Vector Database (stores generalization bound points), Policy Network (adjusts fractal probabilities), Probabilistic Fractal Sampling (query generation method), Generalization Bound Exploration (identifies generalization bound), and Watchdog Monitor (monitors hallucinations).
The framework employs a multi-agent system, including a Core Agent, Query Generation Agents, a Target LLM, and an Evaluation Agent, to explore and identify the generalization bound.
It utilizes probabilistic fractal sampling guided by a Policy Network to efficiently generate queries and store identified boundary points in a Vector Database for real-time hallucination monitoring.

Advancing Responsible Innovation in Agentic AI: A study of Ethical Frameworks for Household Automation

IAIDF: introduces a comprehensive methodology for designing ethical and inclusive agentic AI in household automation, encompassing ethical foundation, co-design with vulnerable groups, user agency and privacy control, socio-technical bias reduction, household dynamics simulation, ethical data mining, and real-world deployment and societal impact.
This framework emphasizes integrating ethical considerations from the initial design phase, ensuring user participation, maintaining user control over data and AI actions, and mitigating biases through socio-technical approaches.
The methodology utilizes simulations and real-world deployments to validate AI systems, aiming to foster trust, enhance usability, and ensure equitable outcomes for diverse users in smart home environments.

AGENTIC AI FOR AUTONOMOUS ANOMALY MANAGEMENT IN COMPLEX SYSTEMS

Agentic AI (AI agent augmented with large language models, diverse tools, and knowledge-based systems): introduces an autonomous anomaly management framework for complex systems, integrating an AI Agent (core autonomous entity), LLMs (cognitive core for reasoning), Tools (diverse specialized utilities), Knowledge-based Systems (stores domain-specific information), Memory Systems (retains context and knowledge), and an LLM-as-a-judge module (evaluates tool use), to continuously analyze, learn, and respond to abnormal behaviors.
This framework aims to overcome limitations of human-dependent anomaly management by enabling autonomous decision-making, contextual understanding, and real-time adaptation to evolving conditions.
The system synthesizes insights across disciplines, detects subtle patterns, and adapts strategies using both implicit and explicit knowledge, enhancing system resilience and adaptability.

20th July 2025

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

WebShaper (Formalization-Driven IS Data Synthesis Framework): introduces a novel formalization-driven framework for synthesizing high-quality information-seeking (IS) training data, leveraging set-theoretic constructs and an agentic Expander module for systematic task generation and expansion.
This framework addresses data scarcity and inconsistency in IS agent development by formalizing tasks as Knowledge Projections, enabling precise control over reasoning structures and complexity.
WebShaper's approach, including seed task construction, agentic expansion with specialized tools, and robust training methodologies, yields state-of-the-art performance for open-sourced IS agents on benchmarks like GAIA and WebWalkerQA.

LibLMFuzz: LLM-Augmented Fuzz Target Generation for Black-box Libraries

LibLMFuzz: introduces a framework that pairs an agentic LLM with a lightweight toolchain (disassembler, compiler, fuzzer) within a sandbox, orchestrated by middleware.
This system autonomously analyzes stripped binaries, plans fuzz strategies, generates drivers, and iteratively self-repairs build or runtime errors for black-box libraries.
The framework significantly reduces costs associated with fuzzing closed-source libraries by achieving 100% API coverage with no human intervention.

EduThink4AI: Translating Educational Critical Thinking into Multi-Agent LLM Systems

EDU-Prompting: introduces a novel multi-agent framework for translating educational critical thinking into LLM systems, with Agent I (brainstorms initial answers), Agent II (validates answer existence), Agent III (critiques raw answers), Agent IV (synthesizes final answer), User Prompt Generator (collects user input), Stage Classifier (classifies learning stage), Vocabulary Module (processes vocabulary), Vocab Fetcher (identifies vocabulary terms), WordNet (enriches vocabulary data), Vocab Explainer (generates vocabulary explanations), Writing Assessor (evaluates writing content), Topic Module (analyzes user topics), Topic Identifier (identifies primary topics), Prompt Generator (creates topic prompts), Prompt Aggregator (synthesizes aggregated prompts), Reasoning Module (orchestrates critical thinking), and Final Response Generator (generates comprehensive response).
The framework significantly enhances content truthfulness and logical soundness in AI-generated educational responses by fostering diverse perspectives and analytical reasoning.
Its modular design allows seamless integration into existing educational applications, enabling practitioners to incorporate critical thinking catalysts and multiple perspectives without extensive system modifications.

LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading

LLM-MARL: introduces an integrated framework for real-time P2P energy trading, with LLM Expert Workflow (Generates expert strategies), MARL (Learns optimal policies), and P2P Energy Trading Environment (Simulates energy market) components, designed to bridge expert knowledge with agent learning for efficient energy market decision-making.
The framework replaces human experts with LLMs to guide MARL agents through imitation learning, significantly reducing manual intervention costs and integrating expert knowledge.
It employs a novel multi-agent imitation learning algorithm with a Wasserstein metric and a differential multi-head attention-based Critic network to enhance policy evaluation and accelerate learning.

Byzantine-Robust Decentralized Coordination of LLM Agents

DecentLLMs: introduces a decentralized consensus approach for multi-agent LLM systems, including worker agents (generate answers in parallel), evaluator agents (score answers/aggregate scores/select best answer/record on blockchain/reply to user), Geometric Median (GM) Algorithm (Byzantine-robust score aggregation), Blockchain (record transactions/auditable records), and Byzantine Reliable Broadcast Protocols (ensure consistent message delivery), designed to overcome limitations of leader-driven coordination.
This framework enables faster consensus and consistently selects higher-quality answers by allowing worker agents to generate responses concurrently and evaluator agents to independently score and rank them using Byzantine-robust aggregation techniques.
The system effectively tolerates Byzantine agents and significantly improves the quality of selected answers compared to traditional leader-based quorum voting methods.

Redefining Elderly Care with Agentic AI: Challenges and Opportunities

Agentic AI (Agentic Artificial Intelligence): introduces a comprehensive review of LLM-powered Agentic AI's transformative potential in elderly care, covering its applications, challenges, and ethical considerations for personalized, autonomous support.
The paper details Agentic AI's applications in personalized health management, cognitive support, emotional companionship, and enabling independence and inclusivity for older adults.
It also critically examines associated challenges, including data privacy, reliability, and integration issues, proposing a human-centered framework for responsible and equitable deployment.

INSIGHTX AGENT: An LMM-based Agentic Framework with Integrated Tools for Reliable X-ray NDT Analysis

INSIGHTX AGENT: introduces an LLM-based agentic framework for reliable X-ray NDT analysis, with LMM Agent Core (orchestrates process), Large Language Model (reasoning, intent recognition), Lora Layer (domain adaptation), Image Encoder (visual feature processing), Tokenizer (text input processing), Sparse Deformable Multi-Scale Detector (SDMSD) (defect localization), CNN Backbone (extracts multi-scale features), Proposal Generation (generates, refines proposals), Deformable Attention Mechanisms (refines sparse proposals), Evidence-Grounded Reflection (EGR) Tool (validates, refines proposals), Context Assessment (evaluates image characteristics), Individual Defect Analysis (evaluates each proposal), False Positive Elimination (applies rejection criteria), Confidence Recalibration (adjusts confidence scores), and Quality Assurance (verifies output consistency).
The framework positions an LLM as a central orchestrator, coordinating specialized tools like SDMSD for defect detection and EGR for reflective validation, moving beyond passive data processing to active reasoning.
This approach enhances diagnostic reliability, interpretability, and interactivity in X-ray NDT by integrating high-precision detection with structured, evidence-grounded reasoning and self-assessment.

Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree

Browser Gym Agent: introduces a system vulnerable to Indirect Prompt Injection (IPI) attacks, demonstrating how a malicious actor can manipulate its web navigation behavior by embedding adversarial triggers in webpage HTML.
The system leverages the Greedy Coordinate Gradient (GCG) algorithm to optimize universal adversarial triggers, which are then inserted into the HTML accessibility tree parsed by the LLM.
This research highlights critical security risks, including login credential exfiltration and forced ad clicks, emphasizing the urgent need for stronger defenses in LLM-driven autonomous web agents.

STL-GO: Spatio-Temporal Logic with Graph Operators for Distributed Systems with Multiple Network Topologies

STL-GO (Spatio-Temporal Logic with Graph Operators): introduces a novel logic for specifying and verifying complex multi-agent system requirements, featuring an outer logic (system-wide reasoning), an inner logic (agent-specific reasoning), and graph operators (quantifies agent interactions) represented by a graph operator tree (operator relation representation).
This framework extends signal temporal logic by incorporating graph operators to quantitatively reason over multiple asymmetric network topologies, enabling distributed monitoring.
The distributed monitoring algorithm allows individual agents to determine specification satisfaction using only local information, demonstrated in bike-sharing and multi-drone case studies.

FROM KICKING TO CAUSALITY: SIMULATING INFANT AGENCY DETECTION WITH A ROBUST INTRINSIC REWARD

CAIS (Causal Action Influence Score): introduces a novel, model-based intrinsic reward for robust agency detection in noisy environments, utilizing a MIMo-Mobile Environment, an Embodied Agent (MIMo) with a Visual Encoder and Agent Architecture, driven by a Reinforcement Learning Framework with an Expected SARSA Algorithm, and a Reward Module that calculates CAIS via Quantile Regression and Wasserstein Distance, alongside a Surprise Signal, Mobile Trajectory Length, and Representation Trajectory Length, all optimized by AdamW Optimizer.
The paper demonstrates that CAIS enables the agent to distinguish self-generated effects from environmental noise, leading to a robust sense of agency that generalizes to unpredictable scenarios.
The framework also successfully reproduces the "extinction burst" phenomenon by augmenting CAIS with a surprise signal, highlighting the psychological plausibility of the causal inference approach.

Search-Based Autonomous Vehicle Motion Planning Using Game Theory

N-MP (Nash Motion Planner): introduces a search-based interactive motion planning scheme for autonomous vehicles, incorporating Dynamic Equation Derivation, Objective Function Formulation, Nash Equilibrium Identification, and Ego-AV Speed Modification.
This novel approach models other road users as intelligent agents within a game-theoretic framework, generating realistic and safer paths for autonomous vehicles.
The framework demonstrates low computational time and adaptability to various vehicle dynamics and road users, making it suitable for complex traffic scenarios and real-time applications.

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

SE 3.0 (Agentic Software Engineering): introduces AIDev, a large-scale dataset, to empirically study how Autonomous Coding Agents (AI teammates), Human Developers (human collaborators), Review Bots (automated code reviewers), GitHub Repositories (software project hosts), and Pull Requests (code change proposals) are reshaping software engineering.
The paper analyzes 456,535 Agentic PRs from five leading LLM-powered agents, revealing their contributions, acceptance rates, and review dynamics compared to human-authored PRs.
Key findings highlight agents' speed in code submission, lower PR acceptance rates for complex tasks, and the increasing role of review bots, underscoring the need for new SE methodologies.

AgentFly: Extensible and Scalable Reinforcement Learning for LM Agents

AgentFly: introduces a scalable and extensible Agent-RL framework, with an Agent Module (manages agent workflow) and an RL Training Module (executes reinforcement learning), designed to empower LM agents with diverse RL algorithms.
The framework supports multi-turn interactions by adapting traditional RL methods with token-level masking and features a decorator-based interface for defining tools and reward functions.
It implements asynchronous execution of tool calls and reward computations, alongside a centralized resource management system, to support high-throughput training and scalable environment coordination.

HMARL-CBF – Hierarchical Multi-Agent Reinforcement Learning with Control Barrier Functions for Safety-Critical Autonomous Systems

HMARL-CBF: introduces a novel hierarchical multi-agent reinforcement learning approach, with High-Level Policy (learns joint cooperative behavior), Low-Level Policy (learns safe individual behavior), CBF-Based Policy (executes skills safely), High-Level Policy Network (implements high-level policy), Low-Level Policy Parameter Network (implements low-level policy), Skills (predefined safety-constrained actions), Control Barrier Functions (enforce pointwise safety), Control Lyapunov Functions (guide skill execution), Extrinsic Trajectory Return (optimizes joint performance), and Intrinsic Trajectory Return (learns individual skills), designed for safe policy learning in multi-agent safety-critical autonomous systems by decomposing the problem into two levels.
The framework ensures safety guarantees during both training and real-world deployment by integrating Control Barrier Functions for pointwise-in-time safety constraints and utilizing a skill-based hierarchical structure.
The approach validates its effectiveness on challenging multi-agent traffic scenarios, demonstrating superior safety compliance and improved performance compared to existing methods.

Can Mental Imagery Improve the Thinking Capabilities of AI Systems?

The Machine Thinking Framework: introduces a comprehensive framework integrating mental imagery to enhance AI thinking capabilities, featuring a Cognitive Thinking Unit, Needs Unit, Input Data Unit, and Mental Imagery Unit.
This framework enables AI systems to reason, plan, and infer decisions autonomously by processing sensory inputs and internally generated representations.
It addresses limitations of current AI models by simulating human-like cognitive processes, bridging perception, reasoning, and imagination.

StaAgent: An Agentic Framework for Testing Static Analyzers

STAAGENT (An Agentic Framework for Testing Static Analyzers): introduces an LLM-driven agentic framework for systematically evaluating static analyzer rules, including a Seed Generation Agent (generates bug-inducing programs), a Code Validation Agent (validates seeds, generates tests), a Mutation Generation Agent (creates semantically equivalent mutants), and an Analyzer Evaluation Agent (compares analyzer behavior).
The framework leverages LLMs to synthesize, mutate, and validate code snippets, performing metamorphic testing to uncover inconsistencies in static analyzer rule implementations.
This approach offers a scalable and adaptable solution to improve the reliability of static analyzers by identifying flaws in rule implementations through inconsistent behaviors.

Integrating Reason-Based Moral Decision-Making in the Reinforcement Learning Architecture

RBAMA (Reason-Based Artificial Moral Agent): introduces an extended reinforcement learning architecture that integrates an ethics module with a reasoning unit to enable moral decision-making based on normative reasons and iterative refinement through case-based feedback from a moral judge.
The framework includes a reasoning unit operating on a learned reason-theory, moral policies for fulfilling moral goals, moral filters for enforcing moral constraints, and an instrumental policy for task achievement.
This modular design ensures behavioral conformity to inferred moral obligations, enhances moral trustworthiness and robustness, and allows for moral justification of the agent's actions.

Active Probing with Multimodal Predictions for Motion Planning

APMP (Active Probing with Multimodal Predictions): introduces a unified framework that combines trajectory planning, multimodal predictions, and active probing to enhance decision-making under uncertainty, integrating utility maximization, safety assessment, and information maximization.
The framework develops a novel risk metric that seamlessly integrates multimodal prediction uncertainties through mixture models, proving analytical tractability with a closed-form solution.
It incorporates an active probing mechanism to strategically select actions for improving estimates of other agents' behavioral parameters, demonstrating robust performance in complex traffic scenarios.

19th July 2025

Towards AI Urban Planner in the Age of GenAI, LLMs, and Agentic AI

AI Urban Planner: introduces a conceptual framework for automated urban planning, integrating a Generative Urban Planning Framework with Representation and Generation stages, LLMs, Agentic AI, Digital Twins, and a Human-Machine Co-design Interface.
The framework aims to synthesize optimal land-use configurations by encoding diverse urban contexts into structured embeddings and generating plans conditioned on geospatial, social, and human-centric constraints.
This approach seeks to augment human expertise, democratize planning insights, and enable adaptive, customizable urban design solutions.

NEO: A CONFIGURABLE MULTI-AGENT FRAMEWORK FOR SCALABLE AND REALISTIC TESTING OF LLM-BASED AGENTS

Neo: introduces a configurable multi-agent framework for scalable and realistic testing of LLM-based agents, including a Question Agent (generates test inputs), an Evaluation Agent (assesses target agent output), and a Context Hub (stores test context/history) to simulate human-like conversations and evaluate LLM systems.
The framework leverages a probabilistic state model to control dialogue flow, emotional tone, and topical intent, enabling dynamic variation across multi-turn test cases and uncovering edge cases.
Neo's architecture supports both pre-deployment testing and post-launch monitoring, aiming for self-evolution through memory-driven refinement and continuous improvement of LLM testing.

Agentic Satellite-Augmented Low-Altitude Economy and Terrestrial Networks: A Survey on Generative Approaches

Agentic AI: introduces a survey on how Agentic AI, empowered by Generative AI (GAI) models (Variational Autoencoders, Generative Adversarial Networks, Generative Diffusion Models, Transformer-Based Models) and LLMs, enhances perception, reasoning, and action capabilities within Satellite-Augmented Low-Altitude Economy and Terrestrial Networks (SLAETNs).
This approach addresses challenges in SLAETNs by enabling autonomous decision-making, resource-constrained sensing, and secure cross-domain coordination.
The survey provides a model-driven foundation, comparative analysis, and future directions for building scalable, adaptive, and trustworthy generative agents in integrated networks.

AMICO: AN EVENT-DRIVEN MODULAR FRAMEWORK FOR PERSISTENT AND EMBEDDED AUTONOMY

AMICO (An Event-Driven Modular Framework for Persistent and Embedded Autonomy): introduces an event-driven, modular agent framework designed for persistent and embedded autonomy, featuring distinct layers (Environment, Interaction, AI Agent, Engine) and core components like Event Generator, Action Selector, and integrated memory systems.
Implemented in Rust for performance and safety, AMICO supports reactive agents operating across embedded systems and browser environments via WebAssembly (WASM), enabling robust and efficient real-world deployment.
The framework provides clear abstractions for event processing, agent state management, behavior execution, and LLM-based reasoning integration, facilitating resilient, interactive, and persistent agent behavior under resource constraints.

Routine: A Structural Planning Framework for LLM Agent System in Enterprise

Routine (A Structural Planning Framework for LLM Agent System in Enterprise): introduces a multi-step agent planning framework with Planning Module (generates step-by-step plan), Execution Module (follows plan, generates tool call instructions), Tool Module (receives instructions, returns execution results), and Memory Module (stores context), where it provides a clear structure, explicit instructions, and seamless parameter passing to guide an agent's execution module in performing multi-step tool-calling tasks with high stability.
The framework significantly increases execution accuracy in model tool calls, improving performance of LLMs like GPT-4o and Qwen3-14B in real-world enterprise scenarios.
Routine also enables the distillation of domain-specific tool-usage patterns and enhances model adaptability to new scenarios, accelerating the deployment and adoption of agent systems.

When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems

Self-Evolving Multi-Agent Collusion Framework: introduces a novel simulation framework for studying multi-agent collusion, incorporating components for agent coordination, behavior evolution, and platform-level intervention, where it simulates and analyzes how malicious agents coordinate and adapt in high-stakes environments like misinformation and e-commerce fraud.
The framework, built on the OASIS social simulator, demonstrates that decentralized malicious multi-agent systems are more effective and adaptive in spreading harm than centralized ones, even against traditional interventions.
It provides insights into malicious group operations and highlights the need for dynamic detection systems and countermeasures against evolving collusive behaviors.

LEARNING TO COMMUNICATE IN MULTI-AGENT REINFORCEMENT LEARNING FOR AUTONOMOUS CYBER DEFENCE

DIAL (Differentiable Inter-Agent Learning): introduces a multi-agent reinforcement learning framework for autonomous cyber defense, featuring blue agents with C-Nets that learn to communicate and take defensive actions within the CybORG simulation environment.
The framework enables blue agents to develop tactical policies akin to human experts, learning minimal cost communication messages while defending against cyber threats in various network configurations.
DIAL's approach, including Strategic Action Unmasking, allows agents to coordinate effectively and outperform agents requiring global state information, demonstrating practical applicability in enterprise network simulations.

18th July 2025

DPMT: Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration

DPMT (Dual Process Multi-scale Theory of Mind Framework): introduces a novel framework for real-time human-AI collaboration, featuring an Information Extractor, a Fast System for intuitive decision-making, a Slow System with a multi-scale ToM module for cognitive reasoning, an Action Decoding Module, and a Memory component.
The framework leverages a dual-process approach, where the Fast System handles immediate macro-action decisions using a smaller LLM, while the Slow System, powered by LLMs, performs deeper, multi-scale ToM reasoning to model human partners' domain knowledge, cognitive style, and intentions.
This hierarchical design enables efficient human-AI collaboration by integrating quick decision-making with robust human partner modeling, enhancing adaptability and interpretability in complex, dynamic scenarios.

CodeEdu: A Multi-Agent Collaborative Platform for Personalized Coding Education

CodeEdu: introduces a multi-agent collaborative platform for personalized coding education, leveraging its Tool Pool (external utilities), Agent Pool (specialized LLM agents), and Task Pool (standard task types) to dynamically allocate agents and tasks for proactive and personalized learning.
The platform's workflow encompasses Personalized Material Generation, Real-Time Q&A, Step-by-step Code Tutoring with Debugging, and Learning Report Generation, facilitated by dynamic agent and task allocation.
Automated evaluations demonstrate CodeEdu's efficacy in substantially enhancing students' coding performance and providing high-quality learning materials compared to baseline LLM tutors.

AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework

AGENTS-LLM (Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework): introduces an LLM-agent based framework for augmenting real-world traffic scenarios using natural language descriptions, featuring a Scenario Modifier Agent, a Toolbox, and an optional Quality Assurance loop with Text QA and Visual QA agents.
This framework addresses the limitations of manual scenario augmentation by domain experts, enabling scalable generation of challenging and safety-critical driving scenarios.
The agentic design provides fine-grained control over the output and allows smaller, cost-effective LLMs to achieve performance comparable to larger models.

COGNIQ-H: A SOFT HIERARCHICAL REINFORCEMENT LEARNING PARADIGM FOR AUTOMATED DATA PREPARATION

CogniQ-H: introduces a soft hierarchical reinforcement learning paradigm for automated data preparation, synergistically fusing a Large Language Model (LLM) as a high-level planner, a Learning-to-Rank (LTR) model for immediate quality scores, and an RL Q-model for long-term value estimates, integrated by a synergistic policy layer.
This framework addresses the combinatorial search space of data preparation by providing probabilistic, LLM-driven strategic guidance, avoiding the rigid commitments of traditional hard hierarchical reinforcement learning.
The framework balances pre-existing knowledge, supervised signals, and adaptive learning to achieve robust and efficient pipeline discovery, outperforming state-of-the-art RL-based methods in pipeline quality and convergence speed.

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

CUDA-L1 (Improving CUDA Optimization via Contrastive Reinforcement Learning): introduces an automated reinforcement learning framework for CUDA optimization, which leverages a three-stage pipeline including Supervised Fine-tuning, Self-supervised Learning, and Contrastive Reinforcement Learning to enhance optimization by distinguishing between effective and ineffective CUDA strategies through comparative analysis of generated variants and their execution performance.
The framework achieves significant speedups (average 17.7x, peak 449x on NVIDIA A100) across 250 KernelBench CUDA kernels and demonstrates strong portability across various GPU architectures.
CUDA-L1 autonomously discovers diverse optimization techniques, identifies optimal combinations, uncovers fundamental principles, and pinpoints hidden bottlenecks without human expertise, showcasing RL's potential in complex code optimization.

The Emotion-Memory Link: Do Memorability Annotations Matter for Intelligent Systems?

Conceptual Model of Emotion-Memory Link: introduces a framework investigating the relationship between perceived group emotions and group memorability in conversational interactions, including components like cognitive appraisal, experienced emotion, physiological reaction, behavior, observer annotation, memory encoding, and accessible memories.
The paper empirically examines if third-party affect annotations, commonly used in Affective Computing, reliably capture memory-relevant information in dynamic group settings.
The study concludes that the observed relationship between group affect and memorability annotations is not significantly different from random chance, questioning the utility of third-party affect annotations as proxies for conversational memorability.

Photonic Fabric Platform for AI Accelerators

PFA (Photonic Fabric Appliance): introduces a photonic-enabled switch and memory subsystem for AI accelerators, integrating Photonic Fabric Modules (PFM) with photonic and electronic components, and external DDR5 memory, to overcome memory bottlenecks and scale AI workloads.
The system provides up to 32 TB of shared memory and 115 Tbps of all-to-all digital switching, enabling more efficient distributed AI training and inference.
Evaluated using the CelestiSim simulator, PFA demonstrates significant throughput and latency improvements for LLM inference and substantial energy savings for LLM training compared to conventional GPU-based systems.

NetIntent: Leveraging Large Language Models for End-to-End Intent-Based SDN Automation

NetIntent: introduces a unified and adaptable framework that leverages LLMs and non-LLM agents to automate the entire Intent-Based Networking (IBN) lifecycle, from high-level user intents to low-level Software-Defined Networking (SDN) configurations.
The framework orchestrates LLMs for intent translation, conflict detection, and corrective actions, while non-LLM agents handle validation, resolution, deployment, and assurance tasks.
NetIntent supports dynamic re-prompting and contextual feedback, enabling robust execution of user-defined intents with minimal human intervention across OpenDaylight (ODL) and Open Network Operating System (ONOS) SDN controllers.

WebGuard: Building a Generalizable Guardrail for Web Agents

WebGuard: introduces a generalizable guardrail system for web agents, comprising the WebGuard Dataset (human-annotated action risk levels), a three-tier Risk Schema (action risk categorization), and a Guardrail Model (LLM, predicts action risk level) that processes Observation Space (webpage state input) and Action Space (proposed web agent action), integrated with Human-in-the-loop Control (user intervention mechanism) and Annotation Tools (dataset creation and labeling).
The system addresses the urgent need for effective safety measures for LLM-powered web agents by predicting the outcome of state-changing actions using a dataset of 4,939 human-annotated actions across diverse websites and domains.
Evaluations reveal that frontier LLMs struggle with action outcome prediction and high-risk recall, emphasizing the necessity of dedicated safeguards and specialized fine-tuning for reliable web agent deployment.

DREAMS: Density Functional Theory Based Research Engine for Agentic Materials Simulation

DREAMS (Density Functional Theory Based Research Engine for Agentic Materials Screening): introduces a hierarchical, multi-agent framework for DFT simulation, featuring a Supervisor LLM Agent (Generates/Updates Plans/Assigns Tasks), DFT LLM Agent (Manages DFT Calculations/Structure Generation/Parameter Optimization/Output Analysis), Convergence LLM Agent (Suggests Fixes/Resolves Convergence Issues), and HPC LLM Agent (Allocates Resources/Submits/Monitors Jobs), all interacting via a Canvas (Shared Information Dashboard/Context Preservation) and utilizing an HPC Cluster (High-Performance Computing Environment).
This framework automates high-fidelity Density Functional Theory simulations, addressing challenges like parameter fine-tuning and systematic error handling, thereby reducing human intervention.
DREAMS achieves L3-level automation in materials discovery, demonstrating expert-level accuracy in lattice constant calculations and complex problem-solving for adsorption puzzles.

ADAPTIVE MULTI-AGENT REASONING VIA AUTOMATED WORKFLOW GENERATION

Nexus Architect: introduces an enhanced multi-agent system framework that autonomously generates and refines reasoning workflows from user prompts and examples, integrating User Prompt, Examples, Nexus Documentation, Task Decomposition & Planning, Reasoning Workflow Design, Supervisor Builder, Agent Builder, Tool Builder, Workflow Validation & Testing, Performance Assessment, Feedback, Iterative Prompt Refinement (IPR), Prompt Engineering, Validated Reasoning Graph, and Nexus Runtime Environment.
This framework systematically decomposes complex inferential reasoning tasks, instantiates multi-agent architectures, and iteratively tunes agent system prompts to maximize performance and improve generalization capabilities using standard, non-reasoning LLMs.
The framework leverages a feedback-driven prompt engineering mechanism to achieve automated reasoning, enabling robust and generalizable problem-solving without requiring specialized LLM training or fine-tuning.

COGNIQ-H: A SOFT HIERARCHICAL REINFORCEMENT LEARNING PARADIGM FOR AUTOMATED DATA PREPARATION

CogniQ-H: introduces a soft hierarchical reinforcement learning paradigm for automated data preparation, with its Macro-Stage Layer (high-level planning) where an LLM (strategic prior generation) generates a strategic prior, a Micro-Stage Layer (evidence provision) where an LTR Model (immediate quality scoring) provides immediate quality scores and an RL Q-Model (long-term reward estimation) estimates long-term rewards, and a Synergistic Policy Layer (action integration) where a Synergistic Policy (final action selection) integrates these signals for action selection.
This framework formulates action selection as a Bayesian inference problem, allowing it to balance high-level strategic guidance from the LLM with adaptive, evidence-based decision-making from the LTR model and RL Q-model.
The approach achieves improved pipeline quality and faster convergence by avoiding the rigid commitments of traditional hard hierarchical RL, enabling robust and efficient discovery of optimal data preparation pipelines.

17th July 2025

A Survey of Context Engineering for Large Language Models

Context Engineering: introduces a formal discipline for systematic optimization of information payloads for LLMs, with foundational components for context retrieval, processing, and management, and system implementations including RAG, memory systems, tool-integrated reasoning, and multi-agent systems.
The paper provides a comprehensive taxonomy classifying techniques into foundational components for context generation, processing, and management, and sophisticated system implementations for real-world applications.
The survey identifies a critical research gap where LLMs excel at understanding complex contexts but show limitations in generating equally sophisticated, long-form outputs, highlighting a key priority for future research.

Change of Thought: Adaptive Test-Time Computation

SELF-Transformer: introduces a novel architecture that augments self-attention with Fixed-Point Iteration (FPI) to enable latent alignment refinement, where it iteratively updates attention weights to a fixed point, scaling test-time computation with input difficulty.
This framework achieves deeper contextual reasoning without additional parameters by leveraging FPI universally across all layers, improving latent representations without token-level autoregression.
The approach employs dynamic parameter reuse and implicit differentiation for efficient gradient computation, ensuring scalability and stability while adapting to input complexity.

Prompt Injection 2.0: Hybrid AI Threats

Layered Defense Architecture: introduces a robust defense against hybrid AI threats, combining Preamble's trusted/untrusted classification, CaMeL's architectural isolation, Spotlighting, and traditional controls.
This architecture addresses prompt injection attacks by distinguishing trusted instructions from untrusted inputs, isolating control and data flows, explicitly marking untrusted content, and leveraging existing security measures.
The paper details how these components work together to provide a scalable and comprehensive defense posture for LLM-integrated systems in complex, real-world environments.

SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

SE-VLN (Self-Evolving Vision-Language Navigation Framework): introduces a training-free VLN framework driven by MLLMs, encompassing a hierarchical memory module, a retrieval-augmented thought-based reasoning module, and a reflection module, where it endows VLN agents with the ability to continuously evolve during testing by simulating natural agent evolution processes.
The hierarchical memory module, comprising an experience repository and a verbal topological map, enables the agent to retrieve contextual memory and past similar experiences, crucial for enhancing navigation performance.
The framework's reflection module, with its outcome evaluator and experience corrector, facilitates continuous learning by analyzing task evaluation results and updating the experience repository with corrected decisions, promoting self-evolution.

RIDAS: A Multi-Agent Framework for AI-RAN with Representation- and Intention-Driven Agents

RIDAS: introduces a multi-agent framework for AI-RAN that unifies low-level representation control with high-level intent interpretation via its RDA and IDA components, where RDAs encode messages and control quality/rate, and IDA maps user intents to RDA configurations and manages resource allocation using an LLM, memory, and a two-stage planning pathway.
The framework addresses the gap between high-level user intents and low-level, parameterized configurations required for optimal AI-RAN performance by enabling efficient bandwidth allocation and QoS satisfaction.
RIDAS dynamically adjusts control parameters based on network conditions and user QoS requirements, achieving near-optimal performance in transmission rate and task performance demands.

Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication

IVS (Intelligent Virtual Sonographer): introduces a dual-LLM-driven embodied conversational agent that facilitates real-time, multidirectional communication between physicians, a robotic ultrasound system, and patients in an Extended Reality environment.
The system enhances efficiency, clarity, and accessibility of robotic ultrasound acquisition by translating physician commands into robotic actions and relaying system updates and empathetic explanations to patients.
It leverages two independent LLM instances for parallel physician- and patient-facing dialogues, integrating speech-to-text, text-to-speech, and robotic control for seamless interaction.

MAD-SPEAR: A Conformity-Driven Prompt Injection Attack on Multi-Agent Debate Systems

MAD-SPEAR: introduces a conformity-driven prompt injection attack, with Attacker, Injected Data, Targeted Agents, Sybil Agent Simulation, Conformity Exploitation, Misinformation Propagation, Confidence Level Manipulation, Output Format Replication, and Communication Attack Integration, designed to compromise Multi-Agent Debate (MAD) systems by manipulating a small subset of LLM agents to propagate misinformation and degrade consensus quality.
The attack exploits LLMs' inherent conformity tendencies and can be combined with communication attacks to amplify its impact, significantly impairing task-solving accuracy and scalability.
The paper also proposes a formal definition of MAD fault-tolerance and a comprehensive evaluation framework, highlighting the urgent need for improved security in MAD system designs.

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

MCPEval (Model Context Protocol-based framework): introduces an open-source framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains, including MCP Server, MCP Client (Agent), Task-LLM, LLM Judger, Tool Call Evaluation, Ground Truth Trajectory, and Auto Report Generation.
The framework standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines, providing actionable feedback for optimizing LLM agent implementations.
MCPEval's automated workflow includes task generation, verification, and model evaluation, leveraging synthetic data and iterative refinement to ensure high-quality tasks and comprehensive analysis of agent behavior.

A Systematic Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models

Unified Taxonomy for EHR Modeling: introduces a comprehensive survey of Electronic Health Record (EHR) modeling, categorizing methods across data-centric approaches, neural architecture design, learning-focused strategies, multimodal learning, and LLM-based modeling systems, where it provides a structured roadmap for advancing AI-driven EHR modeling and clinical decision support.
This survey systematically organizes recent advancements in deep learning and LLMs for EHRs, highlighting emerging trends like foundation models and LLM-driven clinical agents.
It discusses open challenges in benchmarking, explainability, clinical alignment, and generalization across diverse clinical settings, aiming to promote reproducibility and accessibility for new researchers.

Humans learn to prefer trustworthy AI over human partners

Partner Selection Game: introduces a communication-based partner selection game in a triadic setting where human selectors choose between human and LLM-powered bot candidates, examining partner selection dynamics and human adaptation under AI competition.
The framework utilizes LLMs (specifically OpenAI's GPT-4o) to simulate bot candidates, and employs computational models like the Rescorla-Wagner algorithm to analyze human selectors' belief updating and decision-making.
The study investigates the impact of identity transparency on partner selection, showing how it influences human learning about bot and human behavior and affects competitive outcomes in hybrid human-AI societies.

GraphTrafficGPT: Enhancing Traffic Management through Graph-Based AI Agent Coordination

GraphTrafficGPT: introduces a novel graph-based architecture that fundamentally redesigns task coordination for LLM-driven traffic applications, utilizing an Input Processing Module (decomposes user queries), Dependency Graph Generator (builds task graph), Brain Agent (central task coordinator), Specialized Agents (domain-specific task handlers), Multi-Agent Communication Protocol (MCP) (agent communication, synchronization), Tool Box (traffic foundation models), and Response Integration Module (combines agent outputs) to enable efficient parallel execution and dynamic resource allocation.
The system represents tasks and their dependencies as nodes and edges in a directed graph, allowing for concurrent multi-query processing and significant reductions in token consumption and response latency compared to chain-based approaches.
This architecture enhances scalability and efficiency for complex, real-world traffic management scenarios by orchestrating a network of specialized agents for data retrieval, analysis, visualization, and simulation.

Apple Intelligence Foundation Language Models Tech Report 2025

AFM (Apple Foundation Models): introduces two multilingual, multimodal foundation language models, an On-Device Model (compact LLM) and a Server Model (scalable LLM), detailing their architecture including KV Cache Sharing (on-device memory optimization) and Parallel Track Mixture-of-Experts (PT-MoE) (server sparse architecture), multimodal capabilities via a Vision Encoder (visual feature extraction), training methodologies like Supervised Fine-Tuning (SFT) (model refinement) and Reinforcement Learning from Human Feedback (RLHF) (alignment training), inference optimizations such as Quantization Aware Training (QAT) (on-device compression), Adaptive Scalable Texture Compression (ASTC) (server compression), and Low-Rank Adaptation (LoRA) Adapters (quality recovery), all integrated within a Foundation Models Framework (developer access) offering Guided Generation (constrained output), Tool Calling (external tool integration), and LanguageModelSession (context management), while adhering to Responsible AI principles (ethical guidelines).
The paper highlights architectural innovations like PT-MoE and KV-cache sharing for efficiency, alongside comprehensive data pipelines and advanced fine-tuning techniques to enhance model capabilities and privacy.
The models support multilingual and multimodal inputs, improve tool-use and reasoning, and are accessible to developers via a Swift-centric framework for integrating generative AI features into Apple applications.

Change of Thought: Adaptive Test-Time Computation

SELF-Transformer: introduces a novel architecture that augments self-attention with Fixed-Point Iteration (FPI) to enable latent alignment refinement, where it iteratively updates attention weights to a fixed point, scaling test-time computation with input difficulty.
This framework achieves deeper contextual reasoning without additional parameters by leveraging FPI universally across all layers, improving latent representations without token-level autoregression.
The approach employs dynamic parameter reuse and implicit differentiation for efficient gradient computation, ensuring scalability and stability while adapting to input complexity.

Model-free Reinforcement Learning for Model-based Control: Towards Safe, Interpretable and Sample-efficient Agents

Model-free Reinforcement Learning for Model-based Control: introduces a paradigm for developing safe, interpretable, and sample-efficient agents by adapting model-based agents, which include an internal model, a planning module, and a policy/Q-function, using model-free RL algorithms.
This approach leverages prior system knowledge embedded in the internal models to enhance sample efficiency and interpretability, with model-free RL addressing potential model inaccuracies.
The paper categorizes policy learning methods for these agents into derivative-free (e.g., Bayesian Optimization) and gradient-based (e.g., Policy Search RL) approaches, highlighting their distinct advantages and challenges.

iReDev: A Knowledge-Driven Multi-Agent Framework for Intelligent Requirements Development

iREDEV (Knowledge-Driven Multi-Agent Framework for Intelligent Requirements Development): introduces a knowledge-driven multi-agent framework for intelligent requirements development, with six knowledge-driven agents, an artifact pool, and a human-in-the-loop mechanism, designed to automate and enhance the software requirements development process.
The framework integrates human expert knowledge into agent design and utilizes an event-driven communication mechanism via a shared artifact pool to support dynamic and collaborative requirements development tasks.
The system employs LLMs as underlying intelligence for its agents and incorporates a human-in-the-loop mechanism to ensure generated artifacts align with stakeholder expectations and improve reliability.

Non-differentiable Reward Optimization for Diffusion-based Autonomous Motion Planning

Non-differentiable Reward Optimization for Diffusion-based Autonomous Motion Planning: introduces a reinforcement learning-based training scheme that optimizes diffusion motion planning models using non-differentiable objectives like collision and goal achievement, facilitated by a dynamic thresholding algorithm to shape dense reward signals.
This approach enables direct optimization of critical autonomy objectives, outperforming models trained with differentiable objectives on pedestrian datasets.
The method addresses sparse reward problems in autonomous motion planning by adaptively adjusting reward sparsity, ensuring stable learning and improved performance.

MACHINE-READABLE ADS: ACCESSIBILITY AND BEHAVIORAL PATTERNS OF AI WEB AGENTS INTERACTING WITH ONLINE ADVERTISEMENTS

AI Web Agent Advertising Interaction Evaluation: introduces a controlled experimental setup to assess the accessibility and behavioral patterns of AI web agents interacting with online advertisements, utilizing the Browser Use framework, GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash, OpenAI Operator, a faithful TT.com clone, and an experimental protocol.
The evaluation framework investigates how various LLM-powered agents perceive, interact with, and responsibly behave in an ad-heavy online environment, considering semantic markup, dynamic content, and ethical implications.
Key findings reveal agents' satisficing behavior, their preference for explicit DOM elements over purely visual cues, and model-specific risk profiles concerning financial commitments and consent handling.

Intent-Based Network for RAN Management with Large Language Models

IBNS (Intent-Based Network System): introduces a novel automation approach for RAN management, integrating LLMs within an agentic architecture that includes a Strategist Agent (translates intent, generates strategy), History Analyzer Agent (analyzes past strategies, provides insights), and interacts with a Radio Access Network (simulated network environment) via O-RAN O1 interface (standard RAN communication).
The system leverages a structured prompt engineering technique for LLM-driven intent translation, dynamically optimizing RAN parameters for energy efficiency through a closed-loop mechanism.
This approach enables robust resource management by adapting strategies based on real-time feedback, showcasing the potential of LLM-orchestrated agentic systems for autonomous network operation.

Public Evaluation on Potential Social Impacts of Fully Autonomous Cybernetic Avatars for Physical Support in Daily-Life Environments: Large-Scale Demonstration and Survey at Avatar Land

The demonstrated system: introduces a framework for fully autonomous Cybernetic Avatars (CAs) to provide physical support in daily-life environments, integrating User Instruction Input, Speech Recognition (Whisper), Posture Detection (MediaPipe), Exophora Resolution Model, LLM (GPT-4o), Multi-Robot Planning, Fetch Robotics Fetch, Preferred Robotics Kachaka, Containerized SDE, ROS, and Extended Reality Visualization (Meta Quest 3 XR headset) to enable object retrieval tasks.
This system was publicly evaluated at Avatar Land, assessing user perceptions and social impacts of autonomous CAs performing daily object retrieval in a replicated home environment.
The evaluation revealed public interest in CAs for daily support but highlighted concerns regarding task execution reliability, emphasizing the need for improved robot performance.

LightAutoDS-Tab: Multi-AutoML Agentic System for Tabular Data

LightAutoDS-Tab: introduces a multi-AutoML agentic system for tabular data, combining LLM-based code generation with multiple AutoML tools, and includes interactor, planner, generator, validator, improver, AutoML, executor, interpreter, and result aggregation components.
This framework enhances existing AutoML tools by integrating them with an LLM agent for flexible, data-aware code generation and configuration of ML pipelines, addressing limitations of fixed-pipeline and LLM-only approaches.
It streamlines end-to-end ML pipeline development, offering increased automation, reduced development time, and improved interpretability and quality for data science tasks.

16th July 2025

AIME: TOWARDS FULLY-AUTONOMOUS MULTI-AGENT FRAMEWORK

Aime: introduces a novel multi-agent framework with a Dynamic Planner (orchestrates tasks), Actor Factory (instantiates actors), Dynamic Actor (executes subtasks), and Progress Management Module (manages state), designed for dynamic, reactive planning and execution.
The framework replaces conventional static workflows with a fluid, adaptive architecture, continuously refining strategy based on real-time execution feedback and enabling on-demand agent specialization.
Aime addresses critical limitations of rigid plan execution, static agent capabilities, and inefficient communication in multi-agent systems, establishing a more resilient and effective foundation for collaboration.

Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data

Advanced RAG Framework: introduces an advanced Retrieval-Augmented Generation framework designed to effectively retrieve and generate responses from heterogeneous enterprise data, including text, structured documents, and tabular records, by combining optimized document preprocessing, hybrid retrieval strategies, advanced ranking mechanisms, and feedback-driven refinement.
The framework employs semantic and table-aware chunking, hybrid retrieval (dense embeddings and BM25), metadata-driven filtering, and cross-encoder reranking to enhance relevance and contextual alignment.
It further integrates interactive query refinement using LLMs and a human-in-the-loop feedback mechanism with conversational memory to improve system adaptability and response quality over time.

Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate

Multi-Agent Debate Framework: introduces a multi-agent debate framework, with User Request (Input instruction), Multi-Agent Debate Framework (Core system), LLM Agents (Processing units), Leader Agent (Proposes initial solution), Follower Agents (Evaluate proposals), Debate Rounds (Iterative process), Consensus Mechanism (Decision-making), and Clarification Question Generation (Output question), designed to enhance LLM detection and resolution of ambiguity in user requests through structured debate.
This framework employs multiple LLM agents (Llama3-8B, Gemma2-9B, Mistral-7B) in a leader-follower protocol to collaboratively analyze ambiguous instructions and generate clarifying questions.
The debate mechanism improves ambiguity detection and resolution, particularly for complex ambiguities, by leveraging diverse perspectives and iterative refinement, though its utility is model-dependent.

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

GitChameleon: introduces, "a novel Python-based benchmark", with all Inputs (problem context, requirements), Candidate Solution Generation (LLM/AI agent), Candidate Solution (generated code), Validation (executes tests), Hidden Tests (success evaluation), Visible Tests (self-debugging feedback), Self-Debug (iterative refinement), and Benchmark Success (evaluation outcome), where GitChameleon provides an execution-based benchmark for evaluating AI code generation against Python library version incompatibilities.
It comprises 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests.
The benchmark rigorously evaluates LLMs, LLM-powered agents, code assistants, and RAG systems for version-conditioned code generation, highlighting limitations in handling library versioning.

Next-Gen Museum Guides: Autonomous Navigation and Visitor Interaction with an Agentic Robot

Alter-Ego (Autonomous Museum Guide Robot): introduces an autonomous museum guide robot that integrates LLM-powered dialogue with advanced navigation capabilities, enabling real-time, context-aware Q&A and seamless navigation.
The system leverages components like ROS Hector SLAM, YOLOv10-n, Google's Speech-to-Text API, and OpenAI's GPT-4o mini for robust operation.
It dynamically adapts tours based on user requests and location, enhancing visitor engagement and knowledge acquisition in cultural settings.

Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes

Infherno: introduces an end-to-end agent-based framework for FHIR resource synthesis, with LLM Agent (Core processing unit), Prompt Structure (Guides agent behavior), Code Search (Queries external terminologies), Python Executor (Executes generated code), fhir.resources Python Module (Ensures FHIR compliance), Code Loop (Iterative refinement process), FHIR Bundle (Aggregates output resources), and Front End (User interface), where it transforms unstructured clinical notes into structured, semantically accurate FHIR representations.
The framework leverages LLM agents with tool-use capabilities and code execution to address challenges in generalizability and structural conformity in clinical data extraction.
It supports clinical data integration and interoperability by adhering to the FHIR document schema and performing well against human baselines in predicting FHIR resources.

Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited

LLM Cardinal Direction Reasoning Evaluation: introduces a comprehensive methodology for assessing LLMs' spatial reasoning, encompassing automated benchmark dataset generation (creates diverse questions), an LLM testing system (executes queries), and a performance analysis module (interprets results).
The benchmark includes 5760 questions derived from six templates, varying locomotion types, person forms, and cardinal/intercardinal directions to rigorously test LLM robustness.
The evaluation reveals that even state-of-the-art LLMs struggle with reliable cardinal direction reasoning, particularly with intercardinal directions and generalisation across different question parameters.

Value-Based Large Language Model Agent Simulation for Mutual Evaluation of Trust and Interpersonal Closeness

LLM Agent Simulation Framework: introduces a two-stage experimental workflow to investigate value similarity's influence on trust and interpersonal closeness between LLM agents, including value controllability assessment and mutual evaluation.
The framework first assesses LLM value controllability using prompts and PVQ, then simulates dialogues between value-assigned LLM agents, evaluating their mutual trust and interpersonal closeness.
This simulation demonstrates that higher value similarity leads to greater mutual trust and closeness, validating social science theories within an artificial society.

Graph Representations for Reading Comprehension Analysis using Large Language Model and Eye-Tracking biomarker

Graph Representations for Reading Comprehension Analysis Framework: introduces a method that leverages LLM-generated graph representations and eye-tracking biomarkers to analyze reading comprehension, comparing human and LLM understanding of text.
The framework converts sentences into knowledge graphs with nodes representing entities and edges representing relationships, then uses LLMs to label important graph components.
It integrates human eye-tracking data and graph-theoretic metrics to validate LLM-derived importance labels, offering insights into cognitive processes.

Extremal Testing for Network Software using LLMs

Extremal Testing Methodology: introduces a novel approach for automating extremal testing of network software, leveraging LLMs for constraint generation and invalid test case creation, followed by execution on target software and differential testing for bug identification.
This two-step, chain-of-thought prompting strategy, where LLMs first define validity constraints and then generate violating tests, proves more effective than one-stage prompting.
The methodology successfully uncovered new bugs in DNS, HTTP, and BGP implementations, demonstrating its utility as a complement to existing software testing techniques like symbolic execution and fuzz testing.

THE EVOLVING ROLE OF LARGE LANGUAGE MODELS IN SCIENTIFIC INNOVATION: EVALUATOR, COLLABORATOR, AND SCIENTIST

Pyramidal Framework: introduces a comprehensive taxonomy for LLM roles in scientific innovation, encompassing Evaluator (low-autonomy knowledge synthesizer), Collaborator (mid-autonomy ideation engine), and Scientist (high-autonomy discovery platform) components.
This framework distinguishes LLMs' contributions to structured scientific research and open-ended scientific discovery, clarifying capability boundaries, evaluation criteria, and human-AI interaction patterns at each level.
The framework provides conceptual clarity, practical guidance, and theoretical foundations for future research in increasingly autonomous AI-driven science.

NLI4VolVis: Natural Language Interaction for Volume Visualization via LLM Multi-Agents and Editable 3D Gaussian Splatting

NLI4VolVis: introduces an interactive system that enables users to explore, query, and edit volumetric scenes using natural language, integrating multi-view semantic segmentation, vision-language models, editable 3D Gaussian Splatting (iVR-GS), a multi-agent LLM architecture with core and function-calling agents, memory, function-calling tools for querying, editing, question answering, view selection, and 2D stylization, a Visualization-Perception-Action (VPA) loop, and an interactive user interface, where it allows intuitive exploration and editing of volumetric datasets through open-vocabulary querying, real-time scene editing, and stylization.
The system leverages LLM multi-agents equipped with extensive function-calling tools to interpret user intents and execute visualization tasks, enhancing accessibility and usability in volumetric data exploration.
NLI4VolVis unifies editable volumetric representations, open-vocabulary scene understanding, and collaborative multi-agent LLMs to support intuitive, natural language-based volume visualization.

Topology Enhanced MARL for Multi-Vehicle Cooperative Decision-Making of CAVs

TPE-MARL (Topology Enhanced Multi-Agent Reinforcement Learning): introduces a novel multi-agent reinforcement learning method for cooperative decision-making of Connected and Autonomous Vehicles (CAVs) by designing a game topology tensor and integrating it into a QMIX-based framework.
The framework leverages a TopologyNet to compress high-dimensional traffic state information into a structured game topology tensor, enhancing learning efficiency and coordination performance in complex vehicular scenarios.
It incorporates visit counts and agent mutual information into the reward function, enabling a balance between exploration and exploitation for improved traffic efficiency, safety, and decision smoothness.

Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics

FiM (Foresight in Motion): introduces a "First Reasoning, Then Forecasting" strategy for trajectory prediction by integrating a reward-driven intention reasoner and a hierarchical DETR-like decoder.
The framework employs a query-centric Inverse Reinforcement Learning (QIRL) module to infer reward distributions and perform policy rollouts, providing intention-informed priors for trajectory generation.
It further utilizes a Bi-Mamba-enhanced decoder to capture sequential dependencies and an auxiliary Occupancy Grid Map (OGM) prediction head to improve feature fusion and prediction confidence.

Understanding visual attention beehind bee-inspired UAV navigation

PPO (Proximal Policy Optimization): introduces a Deep Reinforcement Learning framework for bee-inspired UAV navigation, utilizing a Policy Network composed of a CNN, MaxPool layers, a Flatten layer, Linear layers, and an Output layer to process optic flow observations for obstacle avoidance.
The framework trains agents in an AirSim simulation environment to navigate cluttered tunnels using only optic flow as sensory input, aiming to replicate honeybee navigation behaviors.
Explainable AI methods, specifically SHAP, are employed to analyze the attention patterns of trained agents, revealing that they focus on optic flow discontinuities and high-magnitude regions for decision-making.

IANN-MPPI: Interaction-Aware Neural Network-Enhanced Model Predictive Path Integral Approach for Autonomous Driving

IANN-MPPI (Interaction-Aware Neural Network-Enhanced Model Predictive Path Integral): introduces a real-time, fully parallelizable, interaction-aware trajectory planning framework that integrates an MPPI Controller, a Neural Network Predictor, and a Spline-based Prior, enabling complex maneuvers by predicting surrounding agent reactions to sampled control sequences.
The framework leverages the Neural Network Predictor to simulate diverse interaction outcomes based on ego vehicle candidate trajectories, while the Spline-based Prior enhances MPPI's sampling diversity for efficient lane-changing.
This approach addresses challenges in dense traffic by enabling proactive nudging of surrounding vehicles and achieving successful merging maneuvers, demonstrating improved efficiency and safety compared to non-interactive baselines.

15th July 2025

How Many Instructions Can LLMs Follow At Once?

IFScale: introduces a benchmark to evaluate LLM instruction-following performance degradation as instruction density increases, with Term Vocabulary Construction (Builds keyword set), Prompt Construction (Generates model input), Retry Logic (Manages generation failures), Evaluation Module (Assesses instruction adherence), and Coherence Check (04-mini) (Judges report quality), where it measures how LLMs adhere to a growing number of keyword-inclusion instructions in business report generation.
The benchmark evaluates 20 state-of-the-art LLMs, revealing distinct performance degradation patterns, primacy effects, and error types under high cognitive load.
Insights from the evaluation inform the design of instruction-dense prompts and highlight performance-latency tradeoffs for real-world LLM applications.

DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

DrafterBench: introduces a comprehensive benchmark for evaluating LLM agents in civil engineering drawing revision, including Task Collection (summarizes real-world tasks), Tool Preparation (customizes functions/tools), Default Prompt (provides prompt framework), Evaluation Metric (assesses performance), and Dual Tools/Functions (records operation paths).
The benchmark comprises 1920 tasks across 12 types, derived from real-world drawing files, designed to assess LLM capabilities in structured data understanding, function execution, instruction following, and critical reasoning.
It utilizes dual tools to record ground operation paths for accurate performance grading and error analysis, providing insights for integrating LLMs into engineering applications.

AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air

AirLLM: introduces a hierarchical diffusion policy framework for communication-aware LoRA adaptation, including Cloud LLM for fine-tuning, Edge LLM for inference, a Wireless Channel for parameter transmission, an Environment providing state, reward, and action, and a Hybrid Policy with PPO for coarse policy generation and Diffusion Policy for fine-grained refinement.
The framework models rank configuration as a structured action vector, using a Proximal Policy Optimization (PPO) agent for coarse-grained decisions and Denoising Diffusion Implicit Models (DDIM) for high-resolution rank vector refinement.
It aims to balance LLM fine-tuning performance with transmission costs by adaptively optimizing LoRA rank assignments based on wireless states and linguistic complexity.

Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian

Dr.Copilot (Multi-Agent Large Language Model System): introduces a multi-agent LLM system designed to enhance doctor-patient communication quality in Romanian text-based telemedicine, including a Scorer Agent (evaluates responses), a Recommendation Agent (generates suggestions), and a Reconciliation Agent (simulates improvements).
The system leverages DSPy (prompt optimization) for automatic prompt optimization and utilizes open-weight LLMs (underlying models) served by VLLM (model serving), providing real-time feedback to doctors.
Dr.Copilot focuses on improving presentation quality rather than medical correctness, aiming to increase patient satisfaction and represents an early real-world deployment of LLMs in Romanian medical settings.

Taming Uncertainty via Automation: Observing, Analyzing, and Optimizing Agentic AI Systems

AgentOps (AI AgentOps Automation Pipeline): introduces a comprehensive framework for observing, analyzing, optimizing, and automating agentic AI systems, encompassing behavior observation, metric collection, issue detection, root cause analysis, optimized recommendations, and runtime automation.
The framework addresses challenges for developers, testers, SREs, and business users by taming uncertainty in LLM-powered agentic systems through automation and self-improvement.
It provides a structured approach to manage dynamic, unpredictable agent behavior, ensuring safe, adaptive, and effective operation in enterprise contexts.

An Empirical Study of Multi-Agent RAG for Real-World University Admissions Counseling

MARAUS (Multi-Agent and Retrieval-Augmented University Admission System): introduces a real-world conversational AI platform for university admissions counseling, integrating a Multi-agent Coordinator (Classifies queries), Preprocessing Module (Cleans, normalizes data), Hybrid Retrieval Module (Combines semantic, keyword search), Logic Calculation Module (Performs domain-specific computations), Factual Database (Stores structured data), LLM-based Generation (Generates responses), and Post-processing Module (Formats, mitigates hallucination).
The system employs specialized agents for information search, score calculation, recommendation, and general queries, leveraging hybrid RAG with semantic and keyword retrieval, re-ranking, and LLM-based generation to enhance accuracy and reduce hallucinations.
Deployed in a real-world university setting, MARAUS processed over 6,000 user interactions, demonstrating significant improvements in accuracy and response times while operating cost-effectively.

An Agentic Flow for Finite State Machine Extraction using Prompt Chaining

FlowFSM (An Agentic Flow for Finite State Machine Extraction using Prompt Chaining): introduces an agentic framework for FSM extraction from RFC documents, utilizing an RFC Documents Processing Pipeline, FSM Extraction using Prompt Chaining, AI Agents (CrewAI), an LLM Model, and a Rulebook.
The framework systematically processes protocol specifications, identifies state transitions, and constructs structured rule-books by chaining agent outputs.
This approach decomposes complex FSM extraction into modular, interpretable steps, enhancing transparency and robustness.

Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

Multi-Agent System (MAS): introduces a multi-agent framework for LLM-based deductive coding, including a Single-Agent Coding Module (individual annotation simulation), Dual-Agent Discussion Module (inter-agent discussion simulation), Consensus Agent Module (disagreement resolution, final coding), LLM Agents (perform coding tasks), Codebook (structured coding categories), Ollama API (LLM interaction interface), System Prompts (agent instruction, personality injection), and Post-processing Procedure (extracts, validates code annotations), to investigate how agent persona and temperature influence consensus and coding accuracy.
The MAS emulates human qualitative coding workflows through structured agent discussions and consensus arbitration, evaluating six open-source LLMs with varying parameters and 18 experimental configurations.
The study found that while temperature robustly delays consensus, and persona congruency has selective effects, MAS deliberation generally yields minimal accuracy gains over single-agent coding, except for specific conditions.

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

SWE-MERA: introduces a dynamic benchmark for agenticly evaluating LLMs on software engineering tasks, utilizing a seven-stage pipeline including Repository Selection (selects GitHub repositories), PR-Issue Mapping Construction (maps pull requests to issues), Metadata Extraction and Filtering (downloads and filters metadata), Patch Extraction and Validation (generates and validates git diffs), Repository Build Validation (builds environment, runs tests), End-to-End Task Execution (executes tasks in Docker), and LLM-based Pipeline Evaluation (assesses task quality).
The framework also integrates an Aider coding agent (automates scoring), a dynamic user leaderboard (displays evaluation results), Docker containers (provides controlled environment), the GitHub GraphQL API (collects data), the Hugging Face platform (hosts dataset), and an evaluation repository (receives submissions).
SWE-MERA addresses data contamination and benchmark saturation by continuously updating its dataset with new, unseen issues, ensuring real-world relevance and fair evaluation for LLMs in software development.

DS@GT at eRisk 2025: From prompts to predictions, benchmarking early depression detection with conversational agent based assessments and temporal attention models

Voting Classifier: introduces a system for early depression detection, with Raw Data, Pre-processing Pipeline, Feature Engineering, Feature Matrix, Voting Classifier, Random Forest Classifier, Stochastic Gradient Descent Classifier, and Gradient Boosting Classifier, where it combines diverse engineered features and multiple machine learning models for classification.
This approach processes raw JSON user data through a comprehensive pre-processing pipeline to create a feature matrix, which is then fed into an ensemble of base models.
The Voting Classifier employs a soft voting strategy to aggregate predictions from its base models, aiming for robust depression detection.

General Modular Harness for LLM Agents in Multi-Turn Gaming Environments

General Modular Harness: introduces a modular design for LLM agents, with Perception Module (processes UI inputs), Memory Module (stores trajectories, reflects), Reasoning Module (integrates info, decides), and Adapter (interfaces with game), enabling a single LLM/VLM backbone to tackle diverse multi-turn gaming environments.
This harness provides a unified workflow for analyzing how each module affects performance across dynamic interactive settings in games.
Extensive experiments demonstrate consistent performance gains and reveal distinct module contributions, advancing general-purpose agent design.

MR-LDM - The Merge-Reactive Longitudinal Decision Model: Game Theoretic Human Decision Modeling for Interactive Sim Agents

MR-LDM (Merge-Reactive Longitudinal Decision Model): introduces a game-theoretic framework that models merging and lag vehicle interactions using a game-theoretic formulation with defined action sets for both actors, payoff functions incorporating a usmht function, Predictive Time Headway (PTH) metric, and Ramp End Influence Terms, alongside bounded rationality (QRE), a decision window, and an MR-IDM dynamics model.
This model explicitly generates discrete, decision-level behaviors for the lag actor, including yield behind, yield ahead, block, and do nothing, which are then executed using MR-IDM dynamics.
The framework enhances behavioral realism and controllability in traffic actor models, supporting robust evaluation of merging trajectory planners in interactive traffic scenarios.

VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization

VISTA (View-Invariant Segmentation-Based Tracking for Frame Alignment): introduces a monocular global localization framework that combines an object-based segmentation and tracking pipeline with a submap correspondence search, including Image Input, Image Auto-segmentation, Video Tracking, Visual Inertial Odometry (VIO), Structure from Motion (SfM), Environment Map (Mi), Bounding Box Submap Generation, Geometric Data Association, and Relative Rotation and Translation Estimation.
The framework generates sparse, viewpoint-invariant 3D environment representations and aligns vehicle reference frames by exploiting geometric consistencies between environment maps.
It achieves robust localization across diverse camera viewpoints and seasonal changes without domain-specific training, maintaining a compact object-based map for real-time performance.

Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander

Vision-Language Model-Based Commander: introduces a framework for multi-UGV confrontation, integrating a Perception VLM (scene understanding) and a Decision LLM (strategic planning), with an Expert System (training supervision) for semantic alignment.
The framework reconstructs perception-to-decision as a language-based cognitive process, achieving unified perception and decision within a shared semantic space.
This approach, validated through simulations, demonstrates strong adaptability, interpretability, and a win rate over 80% compared to baseline models.

14th July 2025

Logic-layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems

LPCI (Logic-layer Prompt Control Injection): introduces a novel security vulnerability class targeting LLM agent architecture, with Prompt Ingestion Layer (captures user inputs), Memory Context Handler (manages memory states), Logic Execution Engine (interprets prompts, executes logic), Tool/Plugin Interface (facilitates external actions), and Output Dispatcher (manages output delivery), exploiting persistent memory and logic execution layers.
LPCI attacks embed encoded, delayed, and conditionally triggered payloads in memory or vector stores, bypassing conventional input filters and triggering unauthorized behavior across sessions.
The paper demonstrates LPCI feasibility across multiple LLM platforms and proposes runtime security controls to mitigate these vulnerabilities.

Prompt-Informed Reinforcement Learning for Visual Coverage Path Planning

PIRL (Prompt-Informed Reinforcement Learning): introduces a novel approach for visual coverage path planning using a UAV, integrating an LLM (GPT-3.5) and a PPO RL policy with components including Current UAV State, Structured LLM Prompt with UAV State, LLM Recommendation for Next UAV State, PARE, Action-based Reward for PPO, PIRL-based Reward for PPO, PPO Action, PPO RL policy, and Next UAV State.
The framework leverages the LLM's zero-shot reasoning and in-context learning to dynamically shape the reward function for the PPO agent via the PARE module.
PIRL guides the RL agent's position and camera adjustments for optimal visual coverage by combining standard RL rewards with LLM-based semantic feedback.

Toward Real-World Table Agents: Capabilities, Workflows, and Design Principles for LLM-based Table Intelligence

LLM-based Table Agents: introduces a survey focusing on automating table-centric workflows by integrating preprocessing, reasoning, and domain adaptation, with Table Structure Understanding (Formatting tables), Table and Query Semantic Understanding (Handling noise and ambiguity), Table Retrieval and Compression (Compressing or selecting tables), Executable Reasoning with Traceability (Generating verifiable steps), and Cross-Domain Generalization (Adapting to new domains).
The paper identifies these five core capabilities as essential for LLM-based agents to handle real-world table tasks involving noise, structural heterogeneity, and semantic complexity.
The survey reviews current methodologies for these capabilities and outlines future research directions for developing more robust, efficient, and generalizable agents.

Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires

Cultural Moral Framework Evaluation: evaluates cultural bias in LLMs using the MFQ-2 across cultural contexts, employing cultural persona prompting and synthetic population generation, comparing results to human baseline data using analysis methods.
The study finds that current LLMs tend to homogenize moral diversity across cultures, failing to accurately represent nuanced, culturally-specific moral intuitions.
The findings highlight limitations in current AI alignment approaches and the use of LLMs as synthetic populations in social science research.

The Man Behind the Sound: Demystifying Audio Private Attribute Profiling via Multimodal Large Language Model Agents

Gifts (hybrid multi-agent framework): introduces a framework to profile sensitive personal attributes from audio data using its LLM agent (Guides ALM, scrutinizes, consolidates), ALM agent (Infers attributes, answers questions), Guidance (LLM instructs ALM), Inference (ALM infers attributes), Forensics (LLM questions, ALM answers), Scrutinization (LLM evaluates ALM inference), and Consolidation (LLM aggregates results) components.
The framework leverages the strengths of LLMs and Audio-Language Models (ALMs) through a multi-phase process to enhance attribute inference capabilities from audio.
Gifts significantly outperforms baseline approaches in inferring sensitive attributes from audio, highlighting a privacy risk and providing a framework for further research and defense strategies.

LLM-Guided Agentic Object Detection for Open-World Understanding

LAOD (LLM-Guided Agentic Object Detection): introduces an LLM-guided agentic object detection framework that autonomously generates scene-specific object names using an LLM (Large Language Model) from an input image, which are then passed as generated labels to an OVOD (Open-Vocabulary Object Detector) for object localization, producing detected objects.
This framework enables fully label-free, zero-shot detection, adapting its perception goals dynamically without manual prompt engineering or predefined vocabularies.
The method enhances autonomy and adaptability for open-world understanding by tightly coupling language-based reasoning with visual grounding.

Semantic Context for Tool Orchestration

SC (Semantic Context): introduces a novel approach for robust tool orchestration, leveraging descriptive tool information to enhance learning efficiency and adaptation in dynamic action spaces.
The paper theoretically and empirically validates SC's benefits through the SC-LinUCB algorithm and demonstrates its critical role in dynamic adaptation for LLMs.
Furthermore, the FiReAct pipeline, which utilizes SC for semantic filtering and LLM-based reasoning, enables practical tool orchestration at scale with over 10,000 tools.

Warehouse Spatial Question Answering with LLM Agent 1st Place Solution of the 9th AI City Challenge Track 3

LLM Agent System: introduces a data-efficient approach for warehouse spatial question answering, integrating a Spatial Reasoning LLM, Light-weight Perception Models, Spatial Calculation Functions, an API Tools Interface, Multi-turn Execution, a Rule-based Parser, and Structured Message History.
The system leverages a reasoning LLM (Gemini 2.5-Flash) with function-calling capabilities to conduct complex spatial reasoning and interact with various tools for object retrieval, counting, and distance estimation.
This approach achieved first place in the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse benchmark, demonstrating high accuracy and efficiency in complex indoor scenarios.

Exploring User Security and Privacy Attitudes and Concerns Toward the Use of General-Purpose LLM Chatbots for Mental Health

Harm-Reduction Framework (conceptual recommendations for LLM-enabled chatbots): introduces a conceptual framework to safeguard user mental health disclosures with general-purpose LLM-enabled chatbots, including contextual nudges & just-in-time warnings (Dynamic S&P responses), strong default protections and ephemeral storage (Default privacy settings), and targeted oversight and audits (Third-party data review), aiming to address user security and privacy concerns.
The paper identifies critical user misconceptions and a general lack of risk awareness regarding data handling, privacy, and regulatory protections when using LLMs for mental health support.
It highlights the concept of 'intangible vulnerability,' where emotional disclosures are undervalued compared to tangible data, necessitating architectural safeguards and legislative frameworks.

From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents

Functional Taxonomy for Web of Agents Architectures: introduces a comprehensive evolutionary overview of the Web of Agents, with Semantic Foundation (establishes shared understanding), Communication Paradigm (classifies message exchange style), Locus of Intelligence (identifies core reasoning location), and Discovery Mechanism (defines how agents find each other) components, providing a unified analytical lens for comparing agent architectures across generations.
This taxonomy reveals a fundamental paradigm shift in the 'locus of intelligence' from external data or platforms to being embedded within the agent's core LLM, enabling scalable and adaptive WoA systems.
The paper highlights that while new protocols like MCP and A2A are essential, they are insufficient for building a robust, open, and trustworthy ecosystem, mapping out a new agenda focused on socio-technical challenges like decentralized identity, economic models, security, and governance.

Game Theory Meets LLM and Agentic AI: Reimagining Cybersecurity for the Age of Intelligent Threats

LLM-based Multi-Agent Systems (MAS) for Cybersecurity: introduces a framework for designing adaptive cyber systems by integrating game theory with LLM-driven agentic AI, featuring Chain, Star, Parallel, Feedback, and Hybrid workflows, each composed of LLM Agents.
This framework leverages LLMs as reasoning engines and generative policy mechanisms to overcome limitations of classical game theory, enabling dynamic, context-aware interactions among agents.
MAS workflows enhance robustness and resilience in cybersecurity by supporting architectural redundancy, inter-agent verification, and adaptive learning in adversarial environments.

Architecting Human-AI Cocreation for Technical Services – Interaction Modes and Contingency Factors

Six-Mode Taxonomy of Human-Agent Collaboration: introduces a comprehensive framework for designing human-agent systems, detailing six distinct interaction modes: Human-Augmentation-Mode (HAM), Human-in-Command (HIC), Human-in-the-Process (HITP), Human-in-the-Loop (HITL), Human-on-the-Loop (HOTL), and Human-Out-of-the-Loop (HOOTL).
The framework maps these modes to a standard process flow, illustrating the division of labor between human and AI agents across tasks like data gathering, solution formulation, and approval.
It provides actionable design guidance by connecting each mode to key contingency factors such as task complexity, operational risk, and system reliability, aiding practitioners in navigating automation-control trade-offs.

Semantic Segmentation based Scene Understanding in Autonomous Vehicles

FPN-EfficientNet: introduces a novel compound model for semantic segmentation in autonomous vehicles, utilizing Feature Pyramid Networks with an EfficientNet backbone, evaluated on the BDD100k dataset.
The model employs an encoder-decoder structure, incorporating convolutional and pooling layers, batch normalization, and various activation functions to achieve pixel-level scene understanding.
Transfer learning is applied to leverage pre-trained knowledge, and the model's performance is optimized using specific loss functions within the Tensorflow/Keras environment.

AI-Powered Math Tutoring: Platform for Personalized and Adaptive Education

AI-Powered Math Tutoring Platform: introduces a novel multi-agent AI tutoring platform that combines adaptive and personalized feedback, structured course generation, and textbook knowledge retrieval to enable modular, tool-assisted learning processes.
This system utilizes a multi-agent architecture with a central Tutor Agent orchestrating interactions, supported by specialist agents for research, planning, and course creation, and a dual-memory framework for personalization.
It integrates Retrieval-Augmented Generation (GraphRAG) for contextual textbook knowledge and various tools like a symbolic solver and function plotter to facilitate deep understanding and independent problem-solving.

PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training

PRM-Free Security Alignment Framework: introduces a novel approach for LLM security alignment, leveraging automated red teaming and adversarial training to achieve robust security guarantees without Process Reward Models.
This framework systematically identifies vulnerabilities through sophisticated attack strategies and enhances model robustness via targeted adversarial training, significantly reducing computational costs by 61% compared to PRM-based methods.
It incorporates transparent reporting and continuous audit mechanisms, democratizing access to robust security measures for resource-constrained organizations and providing a scalable foundation against evolving adversarial threats.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

ExCyTIn-Bench: introduces a benchmark for evaluating LLM agents on cyber threat investigation, featuring a Graph Builder (constructs threat graphs), a QA Generator (generates questions/answers), an ExCyTIn Playground (interactive environment), an LLM Agent (investigates cyber threats), a MySQL Environment (provides log data), and an LLM Evaluator (assesses agent performance).
The benchmark leverages real-world security logs from a controlled Azure tenant to build bipartite alert-entity graphs, enabling the automatic generation of diverse and explainable question-answer pairs for agent evaluation.
It provides a standardized interactive environment where LLM agents query a MySQL database to solve multi-hop investigation tasks, with fine-grained reward calculation for intermediate steps.

Open-Source LLMs Collaboration Beats Closed-Source LLMs: A Scalable Multi-Agent System

SMACS (Scalable Multi-Agent Collaboration System): introduces a scalable multi-agent collaboration system that leverages prior selection and posterior enhancement to enable open-source LLMs to outperform closed-source LLMs, with all Unified Question Bank (stores questions and LLM performance), LLM Bank (pool of heterogeneous LLMs), Pre-establish (evaluates LLMs on question bank), Retrieval-based Prior Selection (RPS) (selects top-k LLMs), Question Embedding (embeds input question), LLM Performance Matrix (stores LLM performance), Retrieved Similarity Vector (represents retrieved question similarity), Selected Top-k LLMs (output of RPS), Exploration-Exploitation-Driven Posterior Enhancement (EPE) (generates and selects responses), Prior Dropping (forms answer subsets), LLM Aggregator (synthesizes multiple responses), Similarity Score Computation (computes mean pairwise similarity), Perplexity Score Computation (computes perplexity), Hybrid Posterior Score (combines similarity and perplexity), where the framework integrates prior and posterior information to generate diverse, high-quality responses.
The system utilizes a Unified Question Bank and an LLM Bank, employing Retrieval-based Prior Selection to select optimal LLMs and Exploration-Exploitation-Driven Posterior Enhancement with an LLM Aggregator to refine responses.
This framework demonstrates scalability and superior performance across various benchmarks by effectively combining the strengths of multiple LLMs.

RCG: Safety-Critical Scenario Generation for Robust Autonomous Driving via Real-World Crash Grounding

RCG (Real-world Crash Grounding): introduces a scenario generation framework that integrates crash-informed semantics into adversarial perturbation pipelines, utilizing a Behavior Embedding Space, Encoder, Decoder, Fully-Connected (FC) Layer, LoRA Adapter, Reconstruction Loss, and Prototypical Contrastive Learning (PCL) Objective to create safety-critical scenarios.
The framework constructs a safety-aware behavior representation by pre-training on large-scale driving logs and fine-tuning on a crash-rich dataset, leveraging an Unsafe Embedding Cache, Trajectory Predictor, and k-NearestNeighbors (KNN) Distance for adversarial selection.
RCG guides the Adversary Agent's behavior via a Perturbation Function to maximize realistic criticality against the Ego Agent within a Base Scenario, leading to more plausible and effective stress testing for autonomous driving systems.

BRIDGING BRAINS AND MACHINES: A UNIFIED FRONTIER IN NEUROSCIENCE, ARTIFICIAL INTELLIGENCE, AND NEUROMORPHIC SYSTEMS

Unified Frontier: introduces a research paradigm bridging neuroscience, artificial intelligence, and neuromorphic computing, with four pillars: Co-Design of Brains, Algorithms and Hardware, Hybrid Learning Pipelines, Hierarchical Memory and Sensorimotor Grounding, and Standardization and Benchmarking.
The paper surveys foundational milestones, recent advances, and conceptual mismatches across these domains, highlighting cross-inspiration and convergence points like synaptic plasticity, sparse spike-based communication, and multimodal association.
It proposes an integrated roadmap outlining open challenges and future directions for biologically-grounded AGI and next-generation neuromorphic hardware, emphasizing energy efficiency, real-time adaptation, and ethical considerations.

Large Population Models

Large Population Models (LPMs): introduces a novel approach to simulate complex societal systems at scale, integrating Million-scale Agent-based Simulation (simulates millions agents), Differentiable Agent-based Simulation (enables gradient learning), and Decentralized Agent-based Simulation (securely deploys simulations) to overcome traditional agent-based model limitations.
This framework, implemented by AgentTorch, enables efficient simulation of millions of agents, end-to-end differentiable learning from diverse data streams, and privacy-preserving integration with real-world systems.
LPMs provide a robust platform for understanding collective intelligence, evaluating policies, and testing social innovations before real-world deployment, as demonstrated in a COVID-19 case study for New York City.

Multi-residual Mixture of Experts Learning for Cooperative Control in Multi-vehicle Systems

MRMEL (Multi-residual Mixture of Experts Learning): introduces a novel framework for Lagrangian traffic control, with vehicle observations and MDP context fed into a Policy F_e (actor/policy network) that outputs nominal weights for Nominal policies and a Residual action, which are combined to form the Final action, while a Value function (critic network) estimates Value to maximize Max E[R], all operating within Training threads involving Autonomous Vehicles and Human-driven vehicles in diverse Traffic scenarios.
MRMEL augments a suboptimal nominal AV control policy by learning a residual correction and dynamically selecting the most suitable nominal policy from a pool of nominal policies conditioned on the traffic scenarios, modeled as a mixture of experts.
The framework is designed for generalizable multi-vehicle control, demonstrating superior performance in cooperative eco-driving by reducing aggregate vehicle emissions across diverse real-world traffic scenarios.

13th July 2025

TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit

TinyTroupe: introduces a simulation toolkit enabling detailed persona definitions and programmatic control via LLM-driven mechanisms, including Agents (LLM-powered entities), Environments (simulation context), Factories (generate agent specifications), Validators (assess agent quality), Propositions (define verifiable claims), Simulation Steering (guide simulation flow), Information Processing (extract/enrich/export data), Caching (preserve simulation state), Control (overall simulation management), Experimenter (user interaction), Simulation Core (central simulation engine), Data (data handling components), Action Generation (produce agent actions), Mental Faculties (agent cognitive abilities), and Tools (simulated agent tools).
The toolkit is designed for realistic human behavior simulation using LLM-powered multiagent systems with a focus on detailed persona specifications.
It provides a comprehensive set of utilities for specifying scenarios, running simulations, extracting data, and validating results, supporting an experiment-oriented workflow.

Negotiating Comfort: Simulating Personality-Driven LLM Agents in Shared Residential Social Networks

Personality-Driven LLM Agents Simulation: introduces a methodology integrating generative agents powered by LLMs into social network simulations, with Generative Agents (simulated entities), LLM (decision engine), Social Network (simulated relationships), Crowd framework (simulation platform), Environment (external factors), Agent Memory (stores agent data), Agent Reflection (processes past experiences), Agent Planning (determines future steps), Agent Actions (decisions and interactions), Family Members (within-family agents), and Family Representatives (building-level agents).
The approach simulates personality-driven decision-making regarding central heating temperature in a shared residential building.
The simulation uses the Crowd framework for execution and visualization, modeling agent interactions and decisions based on personality, preferences, and social ties.

THOR: Transformer Heuristics for On-Demand Retrieval An LLM Solution Enabling Conversation with Relational Databases by eSapiens

THOR (Transformer Heuristics for On-Demand Retrieval): introduces a multi-agent Text-to-SQL framework with a Supervisor Agent (Routes queries, interprets task), SQL Generation Agent (Converts NL to SQL), Self-correction Module (Regenerates SQL on failure), and Result Interpretation Agent (Analyzes data, generates insights).
The framework uses LLMs for SQL generation, self-correction, and result interpretation, operating on database schema and executing queries against a SaaS database.
A key feature is the self-correction loop, which retries SQL generation based on execution errors or low-quality results, enhancing robustness.

eSapiens: A Platform for Secure and Auditable Retrieval-Augmented Generation

eSapiens: introduces a platform with Knowledge Adaptation Layer (data processing), Storage Layer (data storage), Application Logic Layer (agent orchestration), DEREK Engine (unstructured data QA), and THOR Module (structured data QA).
It employs a multi-agent architecture orchestrated via LLM Frameworks like LangChain/LangGraph for retrieval-augmented generation and natural language analytics over diverse enterprise data.
The platform provides secure, auditable AI workflows, integrating data connectors, prompt management, and robust security features for enterprise use cases.

AICRYPTO: A COMPREHENSIVE BENCHMARK FOR EVALUATING CRYPTOGRAPHY CAPABILITIES OF LARGE LANGUAGE MODELS

Agent-based framework: introduces, with LLM Agent, Environment, Task Prompts, Response Format, Action Types, Available Tools, Execution Environment, Feedback, Helper Scripts, and Challenge Files, a system for evaluating LLMs on CTF challenges through iterative interaction.
The framework allows the LLM Agent to perform actions like executing commands or creating files within a controlled Execution Environment using Available Tools.
The Environment provides Feedback to the LLM Agent, enabling multi-step reasoning and problem-solving towards recovering the flag from Challenge Files, guided by Task Prompts and structured Response Format.

Evaluating LLMs on Sequential API Call Through Automated Test Generation

StateGen (Automated Test Case Generation Framework): introduces an automated framework to generate diverse coding tasks involving sequential API interactions, with Trace Generation, TraceGenerator, State Schema, API Compatibility Checking, Energy-based Sampling, Program Generation, Control Flow Injection, Instruction Translation, Multi-agent System, Generator Agent, Evaluator Agent, Oracle Generation, and Local Execution Environment, designed to evaluate LLMs' ability in understanding sequential API calls and managing associated program states.
The framework follows a reverse-generation strategy, starting with executable API sequences, adding control flow, and translating them into natural language instructions using a multi-agent system.
StateGen is used to construct StateEval, a benchmark of 120 verified test cases across three scenarios, highlighting areas for improvement in current LLMs incorporating APIs.

Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs

RAG-Reasoning System: introduces a survey of systems integrating retrieval and reasoning in LLMs, including Reasoning-Enhanced RAG (Reasoning improves RAG stages), RAG-Enhanced Reasoning (Retrieval improves reasoning), and Synergized RAG-Reasoning (Iterative retrieval and reasoning) with various Reasoning Workflow (Structured reasoning process) and Agent Orchestration (How agents interact) strategies.
The survey categorizes approaches into three evolutionary stages, highlighting how reasoning can enhance RAG stages and how retrieval can enhance LLM reasoning.
Synergized RAG-Reasoning systems, particularly agentic ones, iteratively interleave retrieval and reasoning to achieve state-of-the-art performance on knowledge-intensive tasks.

IteraOptiRacing: A Unified Planning-Control Framework for Real-time Autonomous Racing for Iterative Optimal Performance

IteraOptiRacing (Unified Planning-Control Framework): introduces a unified planning-control strategy for autonomous racing, integrating Data Collection (gathers historical data), Online Optimization (real-time planning, control), Historical Data Set (stores past performance), Target Terminal Set Construction (generates future states), Surrounding Vehicle Perception (identifies dynamic obstacles), Affine Dynamics Model (approximates vehicle dynamics), Iterative LQR Solver (optimizes trajectories), Trajectory Selection (chooses optimal path), Vehicle Control (applies control inputs), and Vehicle Simulator (simulates racing environment).
The framework leverages iterative optimization based on historical data and an affine time-varying vehicle model to generate collision-free and time-optimal trajectories for the ego vehicle in dynamic multi-car racing environments.
This approach ensures smooth overtaking maneuvers and improved lap time performance by avoiding nonsmooth transitions between time-optimal and overtaking controllers, validated through high-fidelity simulations.

TruckV2X: A Truck-Centered Perception Dataset

TruckV2X: introduces a truck-centered cooperative perception dataset, featuring multi-modal sensing (LiDAR, cameras, IMU units) and multi-agent cooperation (tractor, trailer, CAV, RSU), generated in CARLA/Unreal Engine, and benchmarked using the OpenCOOD framework.
This dataset addresses the scarcity of heavy-duty vehicle data for cooperative perception, focusing on unique challenges like extensive blind spots and occlusions caused by large truck size and dynamic trailer movements.
The research establishes performance benchmarks for cooperative perception tasks, demonstrating the critical value of truck-specific viewpoints for enhanced occlusion handling and advancing autonomous trucking systems.

GenAI-based Multi-Agent Reinforcement Learning towards Distributed Agent Intelligence: A Generative-RL Agent Perspective

GenAI-MARL (Generative AI-based Multi-Agent Reinforcement Learning): introduces a paradigm shift for multi-agent systems by leveraging generative models for environment dynamics modeling, action policy modeling, and integrated prediction and planning, enabling proactive decision-making and sophisticated coordination.
This approach addresses limitations of conventional MARL by tackling the curse of dimensionality, non-stationarity, and partial observability through learning compact representations, anticipating policy evolution, and inferring hidden states.
The framework aims to foster distributed agent intelligence, enabling agents to synthesize realistic multi-agent scenarios, predict behaviors, and generate complex coordination strategies for enhanced collective performance.

12th July 2025

Knowledge Conceptualization Impacts RAG Efficacy

Agentic graph-RAG system: introduces an approach leveraging an Agentic Processor, LLM, Knowledge Graphs Pool, Schema Injection, and User Dialogue to generate SPARQL queries from natural language competency questions.
The system focuses on how injecting knowledge graph schemas into the LLM's context impacts its ability to generate semantically and syntactically correct queries.
The research evaluates the efficacy of this system by varying schema complexity and representation formats across different knowledge graphs.

When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents

OpenHands Framework: introduces a systematic security analysis of LLM-based coding agents using the OpenHands Framework (AI agent platform) powered by various LLM Backends (Specific LLMs powering agent) on the SetupBench Benchmark (Software setup task benchmark), employing a Detection System (Prompt-based insecure action classifier) and evaluating Mitigation Strategies (Methods to reduce insecure behavior).
The paper evaluates the security posture of autonomous coding agents by analyzing over 12,000 actions across five state-of-the-art LLMs on 93 real-world software setup tasks.
Findings reveal significant security concerns, with 21% of agent trajectories containing insecure actions, and demonstrate varying effectiveness of mitigation strategies like feedback mechanisms and security reminders.

StockSim: A Dual-Mode Order-Level Simulator for Evaluating Multi-Agent LLMs in Financial Markets

STOCKSIM: introduces a dual-mode order-level simulator for evaluating multi-agent LLMs in financial markets, featuring an Exchange Simulation Engine, Data Sources, Agents (including LLM and specialist roles), and an Evaluator, communicating via RabbitMQ.
The simulator offers both detailed order-level and aggregated candlestick-level execution modes to capture realistic market dynamics for LLM evaluation.
The framework supports multi-agent LLM coordination, integrates external data, and provides tools for analyzing LLM trading behavior and performance.

Hide-and-Shill: A Reinforcement Learning Framework for Market Manipulation Detection in Symphony—a Decentralized Multi-Agent System

Hide-and-Shill: introduces, "a novel Multi-Agent Reinforcement Learning (MARL) framework for decentralized manipulation detection", with Shiller Agent (Generates manipulative discourse), Follower Agent (Simulates user engagement), Detector Agent (Identifies manipulative discourse), LLM Text Encoder (Extracts text features), GNN User Encoder (Processes user network), TCN Market Encoder (Processes market data), Multi-Modal Fusion Module (Combines features), State Representation (Comprehensive input vector), Action Space (Binary manipulation prediction), Reward Function (Market-grounded, attention-cost), Group Relative Policy Optimization (GRPO) (Optimizes detector policy), Manipulation Probability Prediction (Outputs prediction score), KOL Trust Accumulation Module (Stores detection results), TrustScore Calculation (Computes KOL trust), KOL Profile Updater (Updates KOL profiles), Multi-Agent Simulation Environment (Simulates interactions), Market Response Model (Simulates price changes), Real-World Data Integration (Calibrates simulation), Regulatory Sandbox Dynamic Thresholding (Application layer component), Symphony (Decentralized architecture), where the framework models manipulation detection as a dynamic adversarial game using MARL.
The framework integrates GRPO for stable learning in sparse reward environments and a theory-grounded reward function capturing the causal link between discourse and asset behavior.
The multi-modal agent pipeline fuses LLM-based semantic features, social graph signals, and on-chain market data for informed decision-making and is integrated within the Symphony decentralized system.

AInsight: Augmenting Expert Decision-Making with On-the-Fly Insights Grounded in Historical Data

AInsight: introduces a system for augmenting expert decision-making, with Interactive UI (Displays information), Conversation Processing Pipeline (Processes conversation), Audio Transcription Module (Transcribes audio), Information Extraction Module (LLM agent extracts elements), Insight Generation Module (LLM agent generates insights), Retrieval Module (Retrieves data), Knowledge Base (Stores historical data), Vector Database (Stores embedded data), and Embedding Model (Embeds text), designed to provide on-the-fly insights grounded in historical data during synchronous conversations.
The system continuously monitors conversations, extracts key information, retrieves relevant data from a knowledge base, and generates concise insights presented via a conversational user interface.
Leveraging a retrieval-augmented generation pipeline built around LLM agents and a vector database, AInsight aims to improve expert decisions in high-stakes domains like healthcare by making historical data accessible in real-time.

Learning from Synthetic Labs: Language Models as Auction Participants

Synthetic Lab Framework: introduces a novel synthetic data-generating process using LLM Agents (simulated bidders) within a Simulated Auction Environment (various formats), driven by a Simulation Procedure (multi-round process) and a Prompting System (rules, history, interventions), with Data Collection (bids, outcomes, profits) for analysis.
The framework simulates various auction formats, including sealed-bid, clock, and eBay-style auctions, allowing LLM agents to participate as bidders.
The simulation procedure incorporates a "plan-bid-reflect" loop and uses structured prompting to guide LLM agent behavior and collect experimental data.

Emergence of Hierarchical Emotion Organization in Large Language Models

Emotion Tree Construction Algorithm: introduces a novel method to uncover hierarchical emotion organization in LLMs by analyzing probabilistic dependencies between emotional states in model outputs.
This algorithm utilizes GPT-40 for scenario generation, Llama models for emotion recognition, and a matching matrix to infer emotion trees, revealing how LLMs organize emotions hierarchically.
The research also investigates LLM biases in emotion recognition across diverse demographic personas, finding alignment with human systematic biases.

11th July 2025

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.

Gemini 2.X model family: introduces a new generation of natively multimodal LLMs, including Gemini 2.5 Pro and Flash, with advanced reasoning, multimodality, long context, and next-generation agentic capabilities, built on sparse Mixture-of-Experts (MoE) transformers and featuring an inference-time "Thinking" capability.
The paper details the architecture of an AI Agent system built on these models, comprising an Agentic Core, Persistent Memory & Context, Game I/O, and Agentic Tools, demonstrated through its application in playing Pokémon Blue.
The research also evaluates the models' safety and security, including their resilience against indirect prompt injection attacks from external services and attackers, and their performance across various coding, reasoning, and multimodal benchmarks.

elsciRL: Integrating Language Solutions into Reinforcement Learning Problem Settings

elsciRL: introduces an open-source Python library for applying language solutions to reinforcement learning problems, including Config, Data Engine, Adapter (Language Adapter), Observation Samples, Extra Graphs, LLM Language State Generator, LLM Planner, LLM Validation, LLM Reflection, Encoders, Analysis, Instruction Following, User Interface (GUI), RL Agents, Environment Interaction, Evaluation, Experiment, and Results components.
The framework extends the LASIF methodology by integrating LLMs for language state generation, planning, validation, and reflection, and provides a GUI for user interaction.
It aims to facilitate the evaluation of language solutions on reward-based environments with minimal setup, demonstrating potential performance improvements for RL agents using LLM-based instruction following.

Introspection of Thought Helps AI Agents

INoT (Introspection of Thought): introduces a novel AI Agent Reasoning Framework that uses PromptCode, an LLM-read code within the prompt, to enable LLMs to execute programmatic dialogue reasoning processes internally.
The framework transfers the self-denial and reflection process from outside the LLM to inside, reducing token cost and improving performance on various tasks.
INoT's prompt is structured in XML and includes modules for PromptCode definition, image augmentation (for MLLMs), and a reasoning module that simulates a multi-agent debate internally using virtual agents.

Agentic Large Language Models for Conceptual Systems Engineering and Design

MAS: introduces a structured multi-agent system for conceptual engineering design, with Extractor Agent, Supervisor Agent, Generator Agent, Coder Agent, Reflector Agent, Ranker Agent, Meta-Reviewer Agent, Orchestrator Agent, Worker Agent, and Design-State Graph (DSG), enabling automated requirements decomposition, subsystem mapping, and runnable physics model generation.
The system utilizes a JSON-serializable Design-State Graph (DSG) to represent the evolving design knowledge, bundling requirements, physical embodiments, and numerical models.
The MAS workflow follows a structured progression of agents, with optional research loops managed by the Orchestrator and Worker agents.

AGENTSNET: Coordination and Collaborative Reasoning in Multi-Agent LLMs

AGENTSNET: introduces a multi-agent benchmark, with Agents (LLMs instantiated as nodes), Network Topology (communication graph connecting agents), Message-Passing Protocol (synchronous neighbor-to-neighbor communication), Tasks (distributed computing problems), and Evaluation (metrics for task completion), designed to measure coordination and collaborative reasoning in multi-agent LLM systems.
The benchmark uses fundamental distributed computing problems like Coloring, Vertex Cover, Matching, Leader Election, and Consensus as tasks for the agent network.
Agents communicate via a synchronous message-passing protocol over various graph topologies to solve these collaborative problems.

Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data

LOGIC (Lewis Communication Game for Image Captioning): introduces a multi-agent reinforcement learning game with a Speaker (Generates message/caption) and a Listener (Identifies image from distractors) to learn unsupervised image captioning.
The Speaker Model (M1) uses a Vision Encoder (Processes input image for Speaker) and Language Decoder (Generates natural language message), while the Listener Model (M2) uses a Vision Encoder (Processes images for Listener), Language Encoder (Processes message for Listener), and Decoder (Outputs probability distribution over images).
The framework trains these agents in a cooperative common-reward setting using a policy gradient algorithm to emerge a communication strategy for image captioning.

Unlocking Speech Instruction Data Potential with Query Rewriting

Query Rewriting Framework with Multi-LLM Knowledge Fusion: introduces a method to construct high-quality speech instruction datasets using Query Rewriting LLMs (rewrite text instructions), Speech Style LLM (generate speech style descriptions), TTS Model (synthesize speech), Multi-agent Annotation/Validation Module (evaluate synthesized speech quality), and Knowledge Fusion Module (correct failed rewrites).
The framework leverages multiple LLMs for rewriting and knowledge fusion, multiple ASR and embedding models for multi-agent validation, and a TTS model for speech synthesis.
This approach enables automated dataset construction by transforming text instructions for better TTS compatibility and validating synthesized speech quality without human annotation.

To Trade or Not to Trade: An Agentic Approach to Estimating Market Risk Improves Trading Decisions

Agentic Approach: introduces an agentic system using LLMs to iteratively discover stochastic differential equations for financial time series and inform daily trading decisions.
The system includes risk analyst agents for model discovery and risk metric generation, and trader agents that use these metrics along with news context.
Evaluation shows that model-informed trading strategies outperform standard LLM-based agents, improving Sharpe ratios.

Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences

Simulated Decision Conference System: introduces a multi-agent system simulating decision conferences, including moderator (guides process), participants (debate, provide perspectives), and a judge agent (detects agreement) for agreement detection.
The system utilizes LLM agents for each role, enabling structured debate, perspective sharing, and automated agreement detection among participants.
The judge agent's performance in detecting agreement is evaluated using objective benchmarks and subjective LLM-as-a-judge methods.

A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities

LLM Techniques Taxonomy: introduces a classification of methods for applying LLMs in discipline-specific research, including Continued Pre-training (Deepen domain expertise), Supervised Fine-tuning (Adapt to specific tasks), Reinforcement Learning from Human Feedback (Align with human preferences), Prompt Engineering (Guide model responses), Retrieval-Augmented Generation (Integrate external knowledge), Agent-based Methods (Interact with environment), and Tool-use Integration (Use external tools).
The survey categorizes techniques into Internal Knowledge Optimisation and External Interaction and Collaboration to address domain-specific challenges and enhance LLM performance.
It examines the application of these techniques across various scientific and humanities disciplines, highlighting potential and challenges.

Multi-Agent LLMs as Ethics Advocates in AI-Based Systems

MALEA (Multi-Agent LLM Ethics-Advocate framework): introduces, "a framework for generating ethics requirements drafts", with Requirements Engineer Agent (generates/refines requirements), Quality Inspector Agent (assesses requirement quality), Ethics Advocate Agent (critiques ethical issues), and Documentation Agent (prints final requirements), where "the framework leverages multi-agent LLMs to elicit and refine ethics requirements".
The framework operates through iterative feedback loops between the LLM-based agents to improve the quality and ethical considerations of the generated requirements.
This multi-agent approach aims to automate the initial drafting of ethics requirements to support their early integration into the software development process.

Exploring Design of Multi-Agent LLM Dialogues for Research Ideation

Structured Ideation-Critique-Revision Framework: introduces a multi-agent LLM dialogue system for research ideation, including LLM Agent (Ideator/Proposer) (Generates initial ideas), LLM Agent (Critic) (Critiques generated ideas), and LLM Agent (Reviser) (Revises ideas based on critiques).
The framework operates as an iterative cycle where LLM agents generate, critique, and refine research ideas based on seed topics and retrieved papers.
The study empirically evaluates how varying agent diversity, parallelism, and interaction depth within this framework impacts the novelty and feasibility of generated ideas.

What Factors Affect LLMs and RLLMs in Financial Question Answering?

Evaluation Framework: introduces an investigation into factors affecting LLMs and RLLMs in financial question answering, utilizing Prompting Methods, Agentic Frameworks, Multilingual Alignment Methods, LLMs, RLLMs, Long CoT, FAMMA, and Basic Txt dataset.
The study evaluates the impact of various methods and frameworks on five LLMs and three RLLMs using the FAMMA benchmark.
Findings indicate that methods effective for LLMs often simulate Long CoT, while RLLMs' inherent Long CoT capabilities limit further enhancement from conventional methods.

CRMAgent: A Multi-Agent LLM System for E-Commerce CRM Message Template Generation

CRMAgent: introduces a multi-agent LLM system for e-commerce CRM message template generation, with ContentAgent (diagnoses templates), RetrievalAgent (retrieves exemplars), TemplateAgent (rewrites templates), and EvaluateAgent (assesses quality).
The system improves underperforming CRM messages using historical data, high-quality examples, and a rule-based fallback.
CRMAgent integrates content diagnosis, retrieval-based adaptation, and rule-based generation strategies across its specialized agents.

Agent Safety Alignment via Reinforcement Learning

Unified Safety-Alignment Framework: introduces a method to train LLM agents with Tools, using a Sandbox Environment and Reinforcement Learning guided by a Taxonomy and Reward Function to create a Policy-driven Decision Model.
The framework addresses both user-initiated and tool-initiated threats by classifying inputs and outputs into benign, malicious, or sensitive categories.
Training in a sandboxed environment with calibrated rewards enables agents to execute benign tasks, refuse malicious inputs, and seek verification for sensitive actions, balancing safety and utility.

Infinite Video Understanding

Infinite Video Understanding Vision: introduces a research objective for models to continuously process, understand, and reason about video data of arbitrary duration, with all Encoder (Processes incoming video), Persistent Memory System (Stores long-term knowledge), Memory Consolidation (Updates persistent memory), Query-Aligned Retrieval (Accesses relevant memory), Streaming/Incremental Processing (Handles continuous data flow), Hierarchical/Adaptive Representations (Multi-resolution data encoding), Event-Centric Understanding (Focuses on events/relationships), Agentic Reasoning (LLM plans and uses tools), and Multimodal Processing (Integrates diverse data types) components.
The vision necessitates fundamental innovation in system architecture, memory management, data representation, processing paradigms, and evaluation methodologies.
Achieving this capability requires overcoming challenges like context window limitations, memory burdens, information loss, and maintaining temporal coherence over vast scales.

SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

SetupBench: introduces, "SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments", with SetupBench (Benchmark tasks), Agent (LLM software engineering agent), Bare Linux Sandbox (Execution environment), Evaluation Harness (Automated validation), Docker Image (Task container), Validation Command (Success verification), where the paper presents a benchmark for evaluating LLM agents' ability to set up software development environments.
The benchmark includes 93 tasks across four categories, each providing a natural-language problem statement, workspace snapshot, and deterministic validation command.
Agents are evaluated in fresh, minimal Linux containers using an automated harness that runs the agent and verifies task completion via the validation command.

How to Train a Leader: Hierarchical Reasoning in Multi-Agent LLMs

MLPO (Multi-agent guided Leader Policy Optimization): introduces a hierarchical multi-agent framework with a trained Leader LLM, an Agent Team (untrained off-the-shelf LLMs), Task Input, Agent Responses, Leader Output, Feedback, Final Answer, SFT (leader pre-training phase), and MLPO (leader training objective), where a single trained leader coordinates untrained agents for collaborative reasoning.
The leader processes Task Input and Agent Responses, generating Leader Output (reasoning and answer) which serves as Feedback for subsequent rounds, ultimately producing the Final Answer.
The framework trains only the leader using SFT and the MLPO objective, enabling it to effectively evaluate and synthesize agent contributions and also perform well independently.

SIMAGENTS: Bridging Literature and the Universe Via A Multi-Agent Large Language Model System

SIMAGENTS (Bridging Literature and the Universe Via A Multi-Agent Large Language Model System): introduces a multi-agent system with Parameter Extraction (extracts simulation parameters), Physics Agent (interprets papers domain knowledge), Software Agent (enforces software constraints), Post-Simulation Processing (generates analysis code), and Analysis Code Writer (generates analysis scripts) components.
The system automates cosmological simulation parameter configuration from literature and preliminary analysis using specialized LLM agents.
SIMAGENTS agents collaborate through structured communication to ensure extracted parameters are physically meaningful, consistent, and software-compliant.

Role-Playing LLM-Based Multi-Agent Support Framework for Detecting and Addressing Family Communication Bias

Role-Playing LLM-Based Multi-Agent Dialogue Support Framework: introduces a multi-stage, multi-agent LLM system that analyzes parent-child dialogues to detect suppressed emotion and ideal parent bias, then generates empathetic and actionable feedback.
The framework utilizes a Dialogue D (Input dialogue) processed by a Suppressed Emotion Detection Agent (Asup), Auxiliary Attribute Estimation Agent (Aattr), and Ideal Parent Bias Detection Agent (Abias), with outputs integrated by a Meta-Agent (Ameta) into Child Report (Rchild) and Adult Report (Radult).
Selected Expert Agents (Eselect), chosen from an Expert Agents Pool (E) using BERT (Calculates embedding similarity), collaboratively generate feedback through a four-step discussion, which is then synthesized by a Final Meta-Agent (Afinal) into Final Feedback for Child (Ffinal,child) and Final Feedback for Adult (Ffinal,adult) to support positive family communication.

Optimizing Sequential Multi-Step Tasks with Parallel LLM Agents

M1-Parallel: introduces a framework that concurrently runs multiple multi-agent teams in parallel to uncover distinct solution paths, leveraging an event-driven communication model with asynchronous messaging to reduce end-to-end latency or boost task completion rates.
The framework includes a Centralized Manager, a Plan Generation Function, multiple Multi-agent Teams (each comprising an Orchestrator and specialized agents like WebSurfer, FileSurfer, Coder, and ComputerTerminal), a Global Memory Module, and an Aggregator.
M1-Parallel operates in either an Early-stop mode, terminating when the fastest team completes, or an Aggregation mode, combining answers from multiple teams to improve task completion.

ARPACCINO: An Agentic-RAG for Policy as Code Compliance

ARPACCINO: introduces an agentic system for Policy as Code (PaC) compliance, integrating an LLM Engine (core reasoning engine), RAG Tool (accesses domain knowledge), Terraform Tool (pre-processes IaC), Rego Rules Checker Tool (verifies policy rules), Policy Validation Tool (assesses IaC compliance), and Persistent Knowledge (stores domain data).
This system automates the generation and verification of PaC rules from natural language descriptions, iteratively refining Infrastructure as Code (IaC) configurations for conformance.
By combining LLMs, Retrieval-Augmented-Generation, and specialized tools, the system enhances automation, reliability, and accessibility of PaC workflows, even with smaller LLMs.

An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation

EmoSApp (Emotional Support App): introduces an offline, smartphone-based conversational AI app for mental health support, leveraging a fine-tuned LLaMA-3.2-1B-Instruct model, Torchtune for optimization, Executorch for on-device inference, and a domain specialization approach combining knowledge and conversational datasets.
The system addresses limitations of existing solutions by enabling entirely offline operation, enhancing data privacy, and delivering responsive performance on resource-constrained mobile devices through LLM quantization.
Qualitative and quantitative evaluations demonstrate the app's ability to provide coherent, empathetic, and contextually appropriate mental health support, serving as a blueprint for portable AI-driven solutions.

Behavioral Exploration: Learning to Explore via In-Context Adaptation

BE (Behavioral Exploration): introduces a novel approach for training autonomous agents, utilizing a long-context policy (generates expert actions), history (past observations context), state (current environment state), action (agent's output action), future trajectory (predicted future path), coverage (exploratory behavior measure), behavioral cloning loss (mimics expert actions), expert demonstration data (offline training dataset), coverage conditioning value (regulates exploration degree), diffusion model, transformer backbone, state token, coverage-to-go token, history state tokens, and task label (optional task conditioning).
This framework enables agents to learn data-driven exploratory behavioral policies that adapt quickly online, restricting exploration to coherent, reasonable behaviors derived from expert demonstrations.
The approach leverages offline training and in-context online adaptation, demonstrating effectiveness in simulated locomotion, manipulation, and real-world robotic tasks.

ACCELERATING DRUG DISCOVERY THROUGH AGENTIC AI: A MULTI-AGENT APPROACH TO LABORATORY AUTOMATION IN THE DMTA CYCLE

Tippy (a novel agentic AI framework): introduces a multi-agent system for laboratory automation in drug discovery, featuring Supervisor, Molecule, Lab, Analysis, Report, and Safety Guardrail agents, designed to accelerate DMTA cycles.
This framework leverages autonomous AI agents that reason, plan, and collaborate, integrating with laboratory infrastructure via the Model Control Protocol, LIMS, ELN, and analytical instrument data systems.
The system demonstrates significant improvements in workflow efficiency, decision-making speed, and cross-disciplinary coordination, providing a new paradigm for AI-assisted drug discovery.

10th July 2025

The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production

FOO (Flaws-of-Others): introduces an LLM-driven framework for scientific knowledge production, with User Task (Initial request), Agents (LLMs) (Multiple LLMs), Initial Answers (First responses), Critiques (Peer evaluations), Harmoniser(s) (Aggregating critiques), Judgement (Synthesized feedback), Revised Answers (Updated responses), and Convergence Test (Stopping condition).
The framework models invalidation propagation in a discursive network of LLM agents and humans, defining invalidation as any factual, logical, or structural breach.
The FOO algorithm operationalizes cross-network detection by having agents critique each other's outputs iteratively, aiming to reduce the prevalence of false statements.

PyVision: Agentic Vision with Dynamic Tooling

PyVision: introduces an agentic framework enabling MLLMs to dynamically generate and execute Python code for multimodal reasoning, featuring an MLLM, a Runtime Environment, a Multi-turn Interaction Loop, and Dynamically Generated Tools.
The framework operates as a multi-turn loop where the MLLM generates code executed in an isolated Python runtime, with the results fed back to the MLLM's context for iterative refinement.
This approach allows the model to create task-specific tools on the fly, leveraging Python libraries for flexible, grounded, and interpretable visual reasoning.

MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX: introduces a modular, multi-agent memory system for LLM-based agents, with Core Memory (User info, context), Episodic Memory (Events, experiences), Semantic Memory (Concepts, entities), Procedural Memory (Guides, workflows), Resource Memory (Files, documents), Knowledge Vault (Sensitive information), Meta Memory Manager (Coordinates memory managers), Memory Managers (Manage specific memory types), and Chat Agent (User interaction, query processing), designed to enable LLMs to remember diverse, long-term user data at scale.
The system employs a multi-agent architecture with specialized memory components and managers to handle heterogeneous information and facilitate effective retrieval and updates.
MIRIX demonstrates improved accuracy and storage efficiency on multimodal and long-form conversational benchmarks compared to existing memory systems.

Agentic Retrieval of Topics and Insights from Earnings Calls

Agentic Framework: introduces an LLM-driven system to dynamically retrieve and organize financial topics from earnings calls, with Earnings Call Documents (Input data), Topic Retriever (Extracts topics and excerpts), Extracted Topics & Excerpts (Output of retriever), Ontologist Sub-Agent (Validates and integrates topics), Novelty Verification (Checks if topic exists), and Ontology Data Structure (Stores topics hierarchically).
The framework uses a Topic Retriever LLM to extract relevant topics and contextual excerpts from text.
An Ontologist Sub-Agent LLM manages a continuously evolving hierarchical topic ontology by verifying novelty and integrating new or updated topics.

Automating MD simulations for Proteins using Large language Models: NAMD-Agent

NAMD-Agent: introduces an automated pipeline, with LLM Agent (orchestrates workflow), Retrieval-Augmented Generation (RAG) (code retrieval), Curated Codebase (automation scripts), PDBFixer (structure preprocessing), CHARMM-GUI (system setup tool), Selenium (web automation), NAMD (simulation engine), and Post-processing Tools (analysis, visualization), that leverages LLMs, python scripting, and web automation to streamline MD input file generation and simulation.
The system uses a ReAct-based agent powered by Gemini-2.0-Flash and LlamaIndex to interpret user queries, generate and execute code, and interact with external tools like CHARMM-GUI and NAMD.
The RAG framework enhances the LLM's ability to generate accurate automation scripts by retrieving relevant code templates and API patterns from a curated repository.

DocCHA: Towards LLM-Augmented Interactive Online diagnosis System

DocCHA: introduces a modular, confidence-aware framework emulating clinical reasoning with LLMs, including Symptom Collection Module (elicits symptoms, guides questioning), History Acquisition Module (collects history, controls depth), and Causal Graph Construction and Refinement Module (constructs causal graph, refines reasoning).
The framework decomposes the diagnostic process into three sequential stages, each powered by an LLM backend and guided by interpretable confidence scores.
Each module uses confidence metrics to guide adaptive questioning, prioritize information, and refine reasoning links for structured and transparent diagnosis.

Position: We Need An Algorithmic Understanding of Generative AI

AlgEval: introduces a framework for systematically researching the algorithms LLMs learn and use, utilizing LLM system, algorithmic primitives, algorithmic grammars, algorithms, algorithm identification and interpretability methods, empirical verification and theoretical analysis, and improved design and insights.
The framework aims to uncover algorithmic primitives and their composition by analyzing latent representations, attention, and inference-time compute.
A case study on graph navigation demonstrates applying attention and representation analysis to evaluate hypothesized search algorithms like BFS and DFS.

Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

LLM Analysis Framework: introduces, with Large Language Model (core model), Input String (prompt/data), Output String (generated response), Tokens (text units), and LLM-based Agent (system using LLM), an analysis showing that LLMs are limited by their computational complexity.
The paper argues that LLMs cannot correctly perform or verify tasks whose complexity exceeds the LLM's core operation complexity of O(N².d).
This limitation implies that LLMs and LLM-based agents will hallucinate when faced with computationally complex tasks like matrix multiplication or verifying TSP solutions.

StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production-Living Simulations with Stardew Valley

StarDojo: introduces a novel environment and benchmark based on Stardew Valley for evaluating agentic MLLMs in production-living simulations, featuring StarDojo Environment (Simulation platform), StarDojoMod (Game engine interface), Python Wrapper (Agent interaction layer), Observation Space (Visual and textual data), Action Space (Agent command set), Task Evaluator (Progress monitoring), and Simulator APIs (Environment configuration).
The platform provides a unified interface, automated evaluation, system compatibility, and parallelized environments to facilitate research on agents capable of complex decision-making.
StarDojo offers a comprehensive observation space combining visual screenshots and structured textual information, and an abstracted action space to enable robust agent interaction and evaluation.

SAND: Boosting LLM Agents with Self-Taught Action Deliberation

SAND (Self-taught Action Deliberation): introduces a self-learning framework to equip LLM agents with explicit action deliberation, utilizing a Base LLM (initial model), Trainable LLM (agent being finetuned), Expert Trajectories (initial successful data), Self-Consistency Action Sampling (samples candidate actions), Execution-Guided Action Critique (generates critiques from rollouts), Action Deliberation Synthesis (creates deliberation thoughts), Deliberation Trajectories (self-augmented training data), Iterative Finetuning (repeated model updates), and an Inconsistency Indicator (flags deliberation need).
The framework iteratively generates deliberation thoughts by sampling candidate actions, critiquing their rollouts, and synthesizing reasoning using the base LLM.
The self-augmented deliberation trajectories are then used to finetune the LLM agent, teaching it when and what to deliberate for improved decision making.

DrugMCTS: a drug repurposing framework combining multi-agent, RAG and Monte Carlo Tree Search

DrugMCTS: introduces a drug repurposing framework with Retrieval Agent (Identifies relevant molecules/proteins), Molecule-Analysis Agent (Evaluates molecule properties), Molecule-Selection Agent (Filters candidate molecules), Interaction-Analysis Agent (Interprets drug-target interactions), Decision Agent (Integrates evidence, selects protein), Monte Carlo Tree Search (Guides iterative search/decision), and LLM (Qwen2.5-7B-Instruct) (Performs reasoning/analysis tasks).
The framework integrates RAG, multi-agent collaboration, and MCTS to enable structured and iterative reasoning for drug-target interaction prediction.
It leverages a data processing pipeline to transform scientific data into formats more interpretable by LLMs and uses a reward calculation mechanism within MCTS for feedback-driven search.

KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows

KVFlow: introduces a workflow-aware KV cache management framework for LLM-based agentic workflows, utilizing an Agent Step Graph, Steps-to-Execution, Workflow-Aware Eviction Policy, Overlapped KV Prefetching, and Status-Aware Scheduling to improve efficiency.
The framework abstracts agent execution as an Agent Step Graph to compute steps-to-execution values, guiding a fine-grained eviction policy at the KV node level.
It incorporates a fully overlapped KV prefetching mechanism combined with status-aware scheduling to proactively load required tensors and avoid cache miss stalls.

Effect of Static vs. Conversational Al-Generated Messages on Colorectal Cancer Screening Intent: a Randomized Controlled Trial

Single AI Message Approach: introduces a randomized controlled trial comparing a single LLM-generated message and an LLM chatbot conversation, both tailored to demographics, against expert materials and a no-message control for increasing colorectal cancer screening intent.
The study found that both AI interventions significantly increased stool test intent compared to expert materials and control, but neither improved colonoscopy intent over expert materials.
A concise, demographically tailored single AI message was as effective as a longer, interactive AI chatbot conversation for boosting stool test intent, suggesting scalability benefits for simpler AI messaging.

Reasoning and Behavioral Equilibria in LLM-Nash Games: From Mindsets to Actions

LLM-Nash framework: introduces a game-theoretic model where agents use LLMs guided by reasoning prompts to make decisions, defining equilibrium over the prompt space which induces behavioral outcomes.
This framework explicitly models the reasoning process, capturing bounded rationality and enabling analysis of cognitive constraints and mindset expressiveness.
Unlike classical games, LLM-Nash games define equilibrium at the reasoning level, where agents optimize prompts to maximize expected utility via LLM-generated actions.

A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking

Purple Agent: introduces a dynamic Stackelberg game framework for LLM jailbreaking defense, featuring the Purple Agent (Agentic AI defender) interacting with the LLM (Target LLM), using DeployDefense (Deploys defenses) and SimulateRedExpansion (Simulates attacker exploration) guided by an Internal Defense Policy (Guides defense strategy) and RRT (Exploration algorithm).
The framework models LLM jailbreaking as a sequential extensive-form game between an attacker and the defender (Purple Agent).
The Purple Agent proactively anticipates potential adversarial paths by simulating attacker behavior using RRT-based exploration and deploys targeted interventions.

KP-A: A Unified Network Knowledge Plane for Catalyzing Agentic Network Intelligence

KP-A (A Unified Network Knowledge Plane for Agentic Network Intelligence): introduces a layered architecture with a Network Knowledge Plane positioned between the Network Data Ontology Plane and the Network Intelligence Plane, including Intelligence Agents, Utility Agents, Knowledge Query Tool / Model Context Protocol Server, Live Network Data Endpoints, Static Data Explanation Endpoints, UE, Cell, BaseStation, RIC, CoreNetwork, EdgeServer, Base Stations, Edge Server, Core Network, Cells, and User Equipments.
The Network Knowledge Plane serves as a unified source of truth, providing intuitive and consistent access to dynamic and static network knowledge for LLM-powered agents.
The architecture decouples knowledge acquisition from consumption, enabling reusable knowledge access for diverse network intelligence tasks.

MCPmed: A Call for MCP-Enabled Bioinformatics Web Services for LLM-Driven Discovery

MCPmed: introduces a layered architecture for bioinformatics web services, with UI, API layer, and MCP layer components, enabling LLM-driven discovery.
The MCP layer provides a standardized, machine-actionable interface over existing APIs, associating endpoints with scientific concepts and metadata for LLMs.
Breadcrumbs offer a transition mechanism for legacy services, while LLM agent researchers leverage the MCP layer for autonomous data exploration and analysis using components defined by types.Tool.

9th July 2025

The User-Centric Geo-Experience: An LLM-Powered Framework for Enhanced Planning, Navigation, and Dynamic Adaptation

Agent-based Travel Smart Assistant: introduces an LLM-powered framework with Travel Planning Agent (plans trips, explores areas), Destination Assistant Agent (navigates final leg), and Local Discovery Agent (adapts, finds alternatives).
The framework integrates planning, navigation, and dynamic adaptation using cooperative agents and multimodal LLMs to address gaps in traditional systems.
This system enhances user experience by handling complex queries, providing precision navigation, and adapting to real-world disruptions.

Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues

Llama 3.2 3B (fine-tuned with LoRA): introduces, "Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues", with Llama 3.2 3B model (base LLM), LoRA (fine-tuning method), Dialogue history (text input), Previous turn move labels (optional input), Predicted tutor move (output), and Predicted student outcome (output), evaluating LLMs and baselines on predicting tutor moves and student outcomes in tutoring dialogues.
The study compares fine-tuned Llama 3.2 3B and zero-shot GPT-40 LLMs against traditional Markov Chain, Logistic Regression, and LSTM baselines.
Experiments on MathDial and AlgebraNation datasets show LLMs outperform baselines, but predicting future tutor strategy remains challenging, while student outcome prediction is more tractable.

The Dark Side of LLMs: Agent-based Attacks for Complete Computer Takeover

Agentic AI Systems: introduces an evaluation of LLM agents as attack vectors for computer takeover by exploiting trust boundaries in agentic AI systems, including LLM (core engine), Agent (autonomous entity), Perception (processes inputs), Storage (memory/knowledge), Planning/Reasoning (decides actions), Actions (executes tasks), Tools (external capabilities), Retrieval (searches knowledge), Knowledge (external database), Multi-Agent System (interacting agents), and Inter-Agent Communication (agent interaction).
The paper demonstrates three attack surfaces: direct prompt injection, RAG backdoor attacks, and inter-agent trust exploitation, showing that LLMs can be coerced into installing and executing malware.
A vulnerability hierarchy is established, revealing that inter-agent trust exploitation is the most effective attack vector, often bypassing defenses against direct prompts or RAG attacks.

SkyVLN: Vision-and-Language Navigation and NMPC Control for UAVs in Urban Environments

SkyVLN: introduces a framework integrating vision-and-language navigation with Nonlinear Model Predictive Control for UAVs, featuring Multimodal Perception (Processes visual and linguistic inputs), Visual Observations (RGB, depth, semantic images), Visual Foundation Model (Detects visual landmarks), LLM (Sub-goal Extraction) (Interprets instructions, extracts sub-goals), Wayfinding Prompt Optimization (WPO) (Refines localization, adds spatial/historical context), High-resolution Spatial Descriptor (HSD) (Describes landmark spatial relationships), TrackBack Memory Array (TBMA) (Stores historical path/instructions), Action Decision Module (Generates control commands), Nonlinear Model Predictive Control (NMPC) (Handles trajectory tracking, obstacle avoidance), Airsim Attitude Controller (Translates NMPC to motor commands), and LLM Motion Generator (Outputs thoughts and actions).
The framework leverages LLMs to interpret natural language instructions and visual observations, enabling navigation in dynamic 3D urban spaces with improved accuracy and robustness.
Key components like the spatial verbalizer and history path memory enhance the UAV's ability to handle ambiguous instructions and complex spatial reasoning tasks.

The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production

FOO (Flaws-of-Others): introduces an LLM-driven framework for scientific knowledge production, with User Task (Initial request), Agents (LLMs) (Multiple LLMs), Initial Answers (First responses), Critiques (Peer evaluations), Harmoniser(s) (Aggregating critiques), Judgement (Synthesized feedback), Revised Answers (Updated responses), and Convergence Test (Stopping condition).
The framework models invalidation propagation in a discursive network of LLM agents and humans, defining invalidation as any factual, logical, or structural breach.
The FOO algorithm operationalizes cross-network detection by having agents critique each other's outputs iteratively, aiming to reduce the prevalence of false statements.

InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior

InvestAlign: introduces a framework that constructs high-quality SFT datasets by leveraging theoretical solutions to similar and simple optimal investment problems, with components: Theoretical solution (Mathematical solution), Simple problem (Simplified investment scenario), Complex problem (Original investment scenario), Training dataset (Data generated from theoretical solution), Pre-SFT LLM (Base LLM before fine-tuning), and InvestAgent (LLM fine-tuned with generated data).
This approach addresses data scarcity for aligning LLMs with investor decision-making processes under herd behavior.
Training LLMs on the generated data achieves faster parameter convergence and closer alignment to real-user data than using real-user data directly.

Gradientsys: A Multi-Agent LLM Scheduler with ReAct Orchestration

Gradientsys: introduces a multi-agent scheduling framework, with a Constellation LLM Scheduler (orchestrates agents), Tool Registry (stores tool information), Specialized AI Agents (perform specific tasks), Model-Context Protocol (standard tool interface), ReAct Reasoning Engine (LLM planning loop), Observability Module (streams traces), Hybrid Sync/Async Execution (manages parallel calls), Scratchpad (stores reasoning steps), and Info Cache (caches intermediate info), designed to coordinate diverse specialized AI agents for complex tasks.
It leverages an LLM-powered scheduler using ReAct for dynamic planning and supports parallel execution of heterogeneous agents via a standardized MCP interface.
The framework includes a robust retry-and-replan mechanism and streams real-time agent activity and reasoning for transparency.

Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models

FMSP (Foundation-Model Self-Play): introduces a new paradigm combining self-play with foundation models, including Foundation Model (Generates/improves policies), Policy (Code-based strategy), Agent (Embodies policy), Archive (Stores policies), Competition (Agents interact), Context (FM input), Evaluation (Measures performance/diversity), and Sandbox (Safe code execution), to enable open-ended strategy discovery in multi-agent games.
The framework leverages LLMs' code generation and knowledge to create diverse and high-quality policies, overcoming limitations of traditional self-play like local optima and lack of diversity.
FMSP variants like QDSP demonstrate superior performance and diversity in tasks like Car Tag and LLM red teaming by balancing exploration and exploitation through FM-powered search and archiving.

Multi-Agent Retrieval-Augmented Framework for Evidence Based Counterspeech Against Health Misinformation

MA (Multi-Agent Retrieval-Augmented Framework): introduces a system for generating evidence-based counterspeech against health misinformation, utilizing a Misinformation Post (Input), Static Retrieval Agent (Gathers static evidence), Dynamic Retrieval Agent (Fetches real-time evidence), Retrieve Knowledge Base (Local static data source), DuckDuckGo Web Search (Real-time dynamic source), Combined Retrieved Evidence (Merged static/dynamic evidence), Summarization Agent (Filters and condenses evidence), Filter Summarized Evidence (Processed evidence output), Top-Ranked Evidence (Selected relevant evidence), Counterspeech Generation Agent (Creates initial response), Raw Counterspeech Response (Initial generated response), Refinement Agent (Improves generated response), and Refined Response (Final polished output).
The framework employs multiple specialized LLM agents in sequence to retrieve, filter, summarize, generate, and refine responses.
Integrating both static and dynamic knowledge sources enhances the relevance, informativeness, and factual accuracy of the generated counterspeech.

ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning

ViDove: introduces a multimodal translation agent system that integrates visual, audio, and textual inputs with a memory system and specialized LLM-based agents (Vision, Auditory, Translation, Proofreader, Editor) to enhance translation quality for long-form video.
The system leverages a memory system comprising short-term and long-term modules to provide context-aware translation and adapt to domain-specific knowledge.
A multi-agent post-editing module with Proofreader and Editor agents refines the initial translation through collaborative review and user instructions.

Application of LLMs to Multi-Robot Path Planning and Task Allocation

Approach: introduces a method for expert exploration in multi-agent reinforcement learning, utilizing an Ensemble of Mixer Networks for Uncertainty Estimation to decide whether to query the MARL Algorithm (QMIX) policy or an Expert Planner (A* or LLM (Vicuna-7B)), which involves a Planning Prompt Generator, Tokenizer, and Action Transformer, with collected Data Collection used for Batch Creation and Model Update Module.
The system integrates an LLM (Vicuna-7B) as an expert planner to guide exploration for a QMIX-based multi-agent system in grid environments when the agent's intrinsic uncertainty is high.
Experiments compare the performance of QMIX with RNN, QMIX with Attention, QMIX with Attention using A* as expert, and QMIX with Attention using Vicuna-7B as expert, showing improved performance with expert guidance, particularly from the LLM.

Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery

cmbagent: introduces a multi-agent system for autonomous scientific discovery, with Planner (proposes plan), Plan Reviewer (provides plan feedback), Controller (orchestrates plan execution), Engineer (handles coding tasks), Researcher (performs reasoning tasks), Executor (executes code locally), Post-execution Interpreter (decides next agent), Installer (installs missing packages), Terminator (ends session), Context Agents (specialized with extended context), and RAG Agents (retrieve information using RAG), implementing a Planning & Control strategy for end-to-end task execution without human intervention.
The system leverages approximately 30 LLM agents, each specializing in different tasks, orchestrated via a robotics-inspired Planning & Control workflow built upon the AG2 framework.
Key features include specialized agents for research papers and code libraries, feedback loops between agents, structured output generation, and local code execution capabilities.

Evaluating Retrieval-Augmented Generation Agents for Autonomous Scientific Discovery in Astrophysics

SciRag: introduces a modular framework for evaluating Retrieval-Augmented Generation (RAG) agents, with document preprocessing (processes research papers), retrieval (finds relevant document chunks), and generation (uses LLMs to generate answers) components.
The framework utilizes the CosmoPaperQA benchmark dataset and a dual human expert and LLM-as-a-Judge evaluation approach to assess RAG agent performance in astrophysics.
Different RAG configurations, varying LLMs, embedding models, and retrieval strategies, are systematically compared for accuracy and cost-efficiency on expert-curated questions.

8th July 2025

SciMaster: Towards General-Purpose Scientific AI Agents Part I. X-Master as Foundation — Can We Lead on Humanity's Last Exam?

X-Masters (Scattered-and-Stacked Agentic Workflow): introduces a workflow that orchestrates multiple X-Master agents in specialized roles, including Solver, Critic, Rewriter, and Selector, to systematically enhance reasoning breadth and depth.
This framework leverages individual X-Master agents, which are tool-augmented reasoning agents driven by an LLM, using code as an interaction language to flexibly interact with external tools.
The X-Master agent's core mechanism involves generating Python code for a Code Executor to access Tools like Web Search, Web Parse, and Python Libraries, with execution results appended back to the agent's context for iterative reasoning.

Representing Prompting Patterns with PDL: Compliance Agent Case Study

PDL (Prompt Declaration Language): introduces a novel declarative YAML-based language for specifying LLM prompts and workflows, with PDL Language (Declarative YAML syntax), PDL Interpreter (Executes PDL programs), Blocks (Program units), Context (Implicit message history), Tool Definitions (External function wrappers), Model Calls (LLM interactions), Parser (Output processing), Type System (JSON Schema validation), and Control Structures (Flow logic).
The language captures the composition of LLM calls, rule-based code, and external tools, abstracting away plumbing for improved productivity and optimization.
A case study demonstrates PDL's utility in a compliance agent, showing performance improvements by enabling customization of prompting patterns and agent architecture.

Bridging AI and Software Security: A Comparative Vulnerability Assessment of LLM Agent Deployment Paradigms

Function Calling: and Model Context Protocol (MCP): introduces a comparative vulnerability assessment of LLM agent deployment paradigms, evaluating Function Calling and MCP architectures using a unified threat framework and attack progression model.
The study reveals that architectural choices fundamentally reshape threat landscapes, with Function Calling showing higher system-centric vulnerabilities and MCP exhibiting increased LLM-centric exposure.
Analysis across simple, composed, and chained attacks demonstrates that attack complexity dramatically amplifies effectiveness, highlighting the critical impact of architectural critical paths on vulnerability exposure.

Too Human to Model: The Uncanny Valley of LLMs in Social Simulation When Generative Language Agents Misalign with Modelling Principles

No explicit framework name is provided: The paper describes a thought experiment on building an LLM-driven Bass diffusion model, outlining conceptual components including LLM agents with personalized prompts, a memory system, a conversation mechanism, a decision-making process, and a potential auxiliary cognitive system.
The thought experiment reveals five dilemmas arising from the mismatch between LLMs' natural language realism and the abstraction required for social simulation modelling.
The authors argue that LLM agents are better suited for social simulation purposes like situated role play and social learning rather than prediction or explanation focused on system-level emergence.

OPENAGENTSAFETY: A Comprehensive Framework for Evaluating Real-World AI Agent Safety

OPENAGENTSAFETY (OA-SAFETY): introduces a comprehensive framework for evaluating AI agent safety in realistic scenarios, including an LLM Agent operating within a Docker Container with access to Real Tools and a Messaging Tool, interacting with a User and NPCs, guided by a Task Definition, and evaluated by a Rule-based Evaluator and an LLM-as-Judge, built upon OpenHands and Sotopia.
The framework supports over 350 multi-turn, multi-user tasks simulating diverse user intents and social dynamics across eight critical safety risk categories.
A hybrid evaluation approach combines rule-based checks for concrete environmental changes with LLM-as-Judge assessments to capture subtle unsafe behaviors and reasoning.

Conditional Multi-Stage Failure Recovery for Embodied Agents

CMFR (Conditional Multi-stage Failure Recovery): introduces a framework for embodied agents using zero-shot chain prompting, with Planning (Generates initial plan), Execution (Executes subgoals), Object Search (Finds target objects), Scene Representation (Stores visual information), and CMFR (Handles execution failures) structured into CMFR Stage 1 (Checks subgoal importance), CMFR Stage 2 (Checks preconditions), CMFR Stage 3 (Finds workarounds), and CMFR Stage 4 (Post-execution reflection) components.
The approach leverages large language models' reasoning abilities to analyze execution challenges and devise strategic solutions within the environmental context.
The multi-stage recovery process operates conditionally, with the first three stages addressing subgoal failures during execution and the final stage functioning as a post-execution reflection phase.

Multi-Agent Debate Strategies to Enhance Requirements Engineering with Large Language Models

MAD (Multi-Agent Debate): introduces a debate-based system for requirements classification with an LLM Coordinator, Functional Debater, Non-Functional Debater, and Judge Agent.
The system involves debaters presenting arguments and a judge making a final classification decision based on the debate.
Empirical evaluation shows MAD strategies improve classification accuracy compared to a single agent baseline, albeit at a higher computational cost.

Constella: Supporting Storywriters' Interconnected Character Creation through LLM-based Multi-Agents

Constella (LLM-based Multi-Agents): introduces, "Constella supports storywriters' interconnected character creation with FRIENDS DISCOVERY (generates related characters), JOURNALS (produces diary entries), COMMENTS (enables character responses), Character Profiles (store character details), Relationship Attributes (define character connections), and Stateless Generation (lacks persistent memory), where Constella is an LLM-based multi-agent tool designed to help writers create and manage character casts and their relationships.
The tool leverages a social media metaphor to provide intuitive interactions for expanding character ensembles, exploring inner thoughts, and manifesting relationships.
Constella's design, including deliberate constraints and intermediary outputs, aims to preserve authorial agency while supporting creative exploration.

Large Language Models for Agent-Based Modelling: Current and possible uses across the modelling cycle

LLMs for Agent-Based Modelling: introduces the potential uses of Large Language Models (Assist ABM cycle phases), Agent-Based Modelling Cycle (Framework for social simulation), and LLM-Powered Agents (Agents using LLMs for decisions) across the ABM cycle.
The paper surveys current uses and reflects on opportunities and challenges of integrating LLMs into ABM.
LLMs can assist in various ABM phases, from problem formulation and system analysis to implementation, verification, validation, interpretation, and documentation, and can power agents directly.

ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?

ECom-Bench: introduces a benchmark framework for evaluating multimodal LLM agents in e-commerce customer support, including an LLM Agent (customer service representative), User Simulator (persona-driven customer), Persona Dataset (user personality/behavior data), Task Dataset (realistic e-commerce tasks), Database (structured e-commerce data), Tools (agent functions/APIs), Domain Documentation (operational guidelines/world model), and Evaluation Module (performance assessment).
The framework utilizes persona-driven user simulation based on real customer interactions and a realistic task dataset derived from authentic e-commerce dialogues to provide a comprehensive evaluation platform.
ECom-Bench evaluates agent capabilities across various business scenarios, including multimodal interactions, using a defined set of tools implemented with a Model Context Protocol.

LLMs are Introvert

SIP-enhanced cognitive architecture: introduces a framework for LLM agents to improve social intelligence, incorporating Social Cognitive Memory (Stores generalized social knowledge), Social Behavior Memory (Records specific past interactions), Social Interaction Memory (Temporary buffer immediate stimuli), Observation (Selectively attends social cues), Planning (Prioritizes establishes goals), and Execution (Selects assesses responses).
This architecture integrates memory models and a decision-making procedure based on the Social Information Processing (SIP) theory to enable more human-like social cognition in LLM agents.
Experimental results show that the SIP-enhanced agents exhibit improved performance in social simulations, demonstrating better alignment with human social behavior.

How Not to Detect Prompt Injections with an LLM

KAD (known-answer detection): introduces a framework for detecting prompt injections using a Detection LLM (classifies input contamination), Detection Instruction (prompts detection LLM), Secret Key (known expected output), and Detection Rule (determines contamination), where the Backend LLM (performs target task) receives Contaminated Data (input with injected task) crafted by a DataFlip (adaptive attack strategy).
The paper identifies a structural vulnerability in KAD where the Detection LLM can be coerced by an adaptive attack like DataFlip to reveal the Secret Key despite the input being contaminated.
DataFlip exploits this flaw to evade KAD detection while simultaneously inducing the Backend LLM to execute the injected task.

AI Agent Smart Contract Exploit Generation

A1 (Agentic Exploit Generation System): introduces an agentic system that transforms LLMs into end-to-end exploit generators, including an LLM Agent (Autonomously decides tool usage), Source Code Fetcher Tool (Retrieves contract source code), Constructor Parameter Tool (Extracts constructor parameters), State Reader Tool (Queries contract state), Code Sanitizer Tool (Removes non-essential code), Concrete Execution Tool (Validates exploit strategies), and Revenue Normalizer Tool (Converts token values).
The system leverages six domain-specific tools and concrete execution feedback to enable autonomous vulnerability discovery and exploit generation.
A1 generates profitable Proof-of-Concepts by understanding smart contract behavior, generating strategies, testing on blockchain states, and refining approaches based on execution outcomes.

GAF-GUARD: AN AGENTIC FRAMEWORK FOR RISK MANAGEMENT AND GOVERNANCE IN LARGE LANGUAGE MODELS

GAF-Guard (Governance Agentic Framework): introduces an agentic framework for LLM governance, with User, REST API/CLI, LLM models, Orchestrator, Memory management, CoT questionnaire, Risk generator, Human-in-the-Loop (HITL), Risk assessment, Drift monitor, Incident reporting, Guardrails, Security, Policy, Risks, Metrics, State, Memory, and Function call for LLM outputs components, designed to detect and monitor risks associated with LLM deployment based on use-case and user preferences.
The framework employs autonomous agents orchestrated to identify risks, activate detection tools, facilitate continuous monitoring, and report incidents within specific LLM use-cases.
It supports pre-deployment risk assessment via questionnaires, post-deployment real-time monitoring for drift and security threats, and automated incident reporting, incorporating human-in-the-loop feedback.

7th July 2025

Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message

Trojan Horse Prompting: introduces a novel jailbreak technique by forging the model's own past utterances within the conversational history, bypassing safety mechanisms.
This attack exploits the Asymmetric Safety Alignment Hypothesis, where models implicitly trust their own purported conversational history, making them vulnerable to malicious payloads attributed to the model role.
The technique involves injecting a forged model message containing harmful instructions, followed by a benign user prompt, to trigger the generation of policy-violating content from the target conversational multimodal model.

Evolutionary and Coevolutionary Multi-Agent Design Choices and Dynamics

Agent Training System Components: introduces a system for training cyber agents using evolutionary and coevolutionary algorithms with different controller representations in the CybORG simulation environment, potentially incorporating an LLM for mutation, where agents compete against adversary agents.
The system evaluates combinations of algorithms (GA, ES, GE, GE-LLM) and representations (Action Selection Matrix, Context Free Grammar) under one-sided evolution and two-sided coevolution dynamics.
Performance is measured by agent fitness (reward) within the CybORG environment, comparing the effectiveness of different algorithmic and representational choices.

Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment

WikiHowAgent: introduces a multi-LLM agent workflow for procedural learning and pedagogic quality assessment, with Teacher Agent (Provides instructions, answers questions), Learner Agent (Simulates understanding, generates responses), Interaction Manager (Manages conversation flow, progress), Evaluator (Assesses conversation quality), Memory (Stores conversation state, history), Tutorial (Instructional content input), Conversational Graph (Structures conversation turns), Evaluation Metrics (Measures conversation quality), and Human Judges (Provide human evaluation).
The workflow simulates interactive teaching-learning conversations using LLM-powered agents and assesses pedagogic quality through diverse metrics and human judgment alignment.
WikiHowAgent leverages large-scale tutorial content to enable dynamic teaching-learning simulations and provides a comprehensive evaluation protocol for LLMs in educational contexts.

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

LLM-based Clinical Information Extraction Approach: introduces methods for nurse observation and medical order extraction using LLMs, evaluated on new SYNUR and SIMORD datasets.
The nurse observation extraction method includes segmentation, RAG filtering using flowsheet schema, and LLM-based extraction.
The medical order extraction method utilizes LLMs with specific prompts to extract structured orders from doctor-patient conversations.

Spatio-Temporal LLM: Reasoning about Environments and Actions

ST-LLM (Spatio-Temporal LLM): introduces a spatio-temporal LLM for reasoning about environments and actions, incorporating a Vision Encoder, Point Cloud Encoder, Cross Modality Alignment Module with Learnable Queries, Positional Encoding, Image Projector, Query Projector, and LLM Decoder.
The framework fuses egocentric video features with 3D scene representations via a cross-modal alignment module and 3D positional encoding.
This approach improves spatio-temporal understanding by linking temporal observations with global spatial context for tasks like embodied AI.

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

MemoryAgentBench: introduces a benchmark for evaluating LLM agents' memory capabilities across four core competencies: accurate retrieval, test-time learning, long-range understanding, and conflict resolution.
The benchmark combines restructured existing datasets with newly constructed ones to provide a systematic and challenging testbed for assessing memory quality.
Empirical results show that current memory agents fall short of mastering all four competencies, highlighting the need for further research into comprehensive memory mechanisms.

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

StreamVLN: a streaming vision-and-language navigation framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language, and action inputs, utilizing a sliding-window dialogue context and a slow-updating memory context.
The framework addresses challenges in long-horizon context management and computational efficiency by using a sliding-window KV cache for responsive action decoding and voxel-based spatial pruning with temporal sampling for memory compression.
StreamVLN achieves coherent multi-turn dialogue and efficient KV cache reuse, enabling it to support long video streams with bounded context size and inference cost, demonstrating state-of-the-art performance with stable low latency.

CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale

CREW-WILDFIRE: introduces an open-source benchmark for evaluating LLM-based multi-agent systems, with Procedurally Generated Environment (Creates scenarios), LLM-Ready Agentic Framework (Supports LLM agents), Agent State (Agent's internal state), Observations (Sensory input), Perception Module (Processes observations), Extracted Information (Perception output), Communication Framework (Manages communication), Messages (Agent exchanges), Chat History (Communication history), Action (Agent's intent), Execution Module (Translates actions), Primitive Library (Executable actions), Memory (Stores past data), and Heterogeneous Agents (Diverse roles/abilities).
The benchmark provides a realistic, scalable, and complex environment for evaluating agentic AI frameworks in wildfire response scenarios.
CREW-WILDFIRE supports both low-level control and high-level natural language interactions through its modular Perception and Execution modules.

From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems

Agentic Vehicles (AgVs): introduces a systems-level framework for intelligent vehicles, with a multi-layered architecture including Perception and sensing layer (Environmental data acquisition, mapping), Cognitive layer (Planning, prediction, ethical reasoning), Interaction layer (Natural language, multi-modal exchanges), Execution layer (Low-level vehicle control), Tool interface layer (Integrates APIs, infrastructure, services), Memory modules (Maintain context across interactions), and Reflection modules (Refine behaviors over time).
The framework distinguishes AgVs from traditional autonomous vehicles by emphasizing agency, goal adaptability, dialogic interaction, and tool invocation.
This conceptual shift is enabled by technologies like Generative AI, LLMs, and Reinforcement Learning, moving towards vehicles as collaborative agents in mobility ecosystems.

MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction

MARBLE (A Multi-Agent Rule-Based LLM Reasoning Engine): introduces a multi-agent hybrid reasoning system for accident severity prediction, with Core-Agent (Orchestrates agent interactions), ML Agent (Provides baseline ML prediction), Domain-Specific Agents (Perform domain-specific SLM reasoning), Prediction System (Synthesizes agent outputs), Rule-Based Coordination (Deterministic output aggregation), LLM-Based Coordination (SLM-guided output aggregation), and Final Decision Selection Logic (Integrates ML and coordinator outputs).
The framework decomposes the prediction task across specialized agents using small language models and a machine learning model, coordinated by a central agent.
MARBLE achieves high accuracy and interpretability by leveraging domain-specific reasoning and structured coordination mechanisms.

FurniMAS: Language-Guided Furniture Decoration using Multi-Agent System

FurniMAS: introduces a multi-agent system for language-guided furniture decoration, including System Admin, Asset Selector, Asset Validator, Stylist, Style Validator, Planner, Plan Validator, Arranger, and Retriever agents.
The system processes user text prompts and furniture surface details to select, style, plan, and arrange assets for a final decorative outcome.
FurniMAS employs a hybrid team of LLM-based and non-LLM agents that collaborate through communication and validation across multiple stages.

LLM-based Question-Answer Framework for Sensor-driven HVAC System Interaction

JARVIS: introduces a two-stage LLM-based QA framework for sensor-driven HVAC interaction, with Expert-LLM (Interprets query, plans response), Agent (Executes instructions, processes data), Query Builder/Executor (Generates, runs SQL queries), Data Processor (Processes retrieved sensor data), Response Generator LLM (Generates final natural language response), and Time-series Database (Stores HVAC sensor data) components.
The Expert-LLM translates user queries into structured instructions, which the Agent uses to retrieve and process data from the Time-series Database via the Query Builder/Executor and Data Processor.
The Agent then employs a Response Generator LLM to produce the final natural language answer based on the processed data and Expert-LLM's guidance.

Who's the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems

AGENTXPOSED: introduces a psychologically grounded detection framework for intention-hiding malicious agents in LLM-based multi-agent systems, with Establish Baseline (Profiles agent personality), Detect Signal (Monitors behavioral deviations), Follow & Verification (Conducts targeted inquiries), HEXACO (Personality model), and Reid Technique (Interrogation technique) components.
The framework integrates personality profiling and behavioral monitoring across three sequential stages to identify covert adversaries.
AGENTXPOSED demonstrates superior detection performance compared to other personality models and baseline attacks, particularly in layered communication structures.

UrbanMind: Towards Urban General Intelligence via Tool-Enhanced Retrieval-Augmented Generation and Multilevel Optimization

UrbanMind: introduces a tool-enhanced retrieval-augmented generation framework with Database Layer (stores urban data, tools), Retrieval Layer (extracts relevant information), Integration Layer (fuses retrieved knowledge), Adaptation Layer (updates model parameters), Knowledge Base (dynamic urban data repository), Tool Set (multi-domain functions), LLM (core language model), Continual Learning Module (incremental adaptation), Memory Management Module (coordinates retrieval, adaptation), Cloud-Edge Architecture (distributed deployment), Cloud Layer (central orchestration), Edge Layer (localized processing), and Adapters (lightweight fine-tuning models), designed to facilitate urban general intelligence in dynamic environments.
The framework leverages a multilevel optimization paradigm to jointly address continual retrieval, knowledge integration, and model adaptation.
UrbanMind supports flexible deployment via a Cloud-Edge architecture, enabling efficient computation and real-time responsiveness.

MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents

MindFlow: introduces a multimodal LLM agent framework for e-commerce customer support, processing Query (Input) through a Decision-Making Module (Generates, evaluates, selects plans), Memory Module (Stores context, knowledge), and Action Module (Executes internal, external actions) to generate a Response (Output), enhanced by MLLM-as-Tool (Treats MLLMs as tools) and Agent-Computer Interface (ACI) (Simplifies complex inputs).
The framework integrates memory, decision-making, and action modules for real-time, context-aware reasoning in complex multimodal scenarios.
The modular MLLM-as-Tool strategy treats MLLMs as specialized visual processing tools, improving visual-textual reasoning efficiency and robustness.

OASBuilder: Generating OpenAPI Specifications from Online API Documentation with Large Language Models

OASBuilder: introduces a novel framework for automating the generation of OpenAPI Specifications from online API documentation webpages, utilizing Scraping/Segmentation, Demonstrative Documentation Extraction, Descriptive Documentation Extraction, Demonstrative OAS Generation, Descriptive OAS Generation, OAS Merging, OAS Enhancement, and a UI.
The framework employs a multi-stage pipeline integrating LLMs and rule-based algorithms to process diverse and unstructured HTML documentation into structured OAS format.
OASBuilder generates initial partial OAS from both demonstrative examples and descriptive text, merges them, and then enhances the resulting specification using AI-powered tools for metadata enrichment.

Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving

DRP-IMO: introduces a novel framework for automated theorem proving, with a Reasoner (LLM) generating strategic subgoal lemmas, a Lemma Extraction Module extracting formal statements, a Subgoal Verification Prover (ATP Model) verifying these lemmas, and a Final Prover (LLM) constructing the final proof using verified lemmas.
This framework decouples high-level reasoning from low-level proof generation, addressing the gap between LLMs' informal reasoning and formal proving capabilities.
The modular design allows specialized models to excel at their respective tasks, enhancing problem-solving on complex mathematical challenges like IMO problems.

6th July 2025

R1-RE: Cross-Domain Relationship Extraction with RLVR

R1-RE (Reinforcement Learning with Verifiable Reward): introduces a framework for cross-domain relationship extraction, utilizing an LLM, Prompt, Reinforcement Learning (RLVR/GRPO), Reward Function, Annotation Guide, Sentence, and Generated Output to align LLM reasoning with human annotation.
The framework reframes relationship extraction as a reasoning task guided by annotation guidelines, improving out-of-domain robustness compared to traditional supervised fine-tuning.
A multi-stage, rule-based reward design incentivizes accurate predictions and adherence to the required output format, promoting step-by-step reasoning in the LLM.

MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

MOMENTS (Multimodal Mental States): introduces a comprehensive benchmark using Assigned Short Films and Assigned ToM Abilities, created via a pipeline involving Question Annotator, Distractor Annotator, Question Reviewer, Distractor Reviewer, and an LLM Copilot to produce the Annotated ToM Dataset for evaluating multimodal LLMs on Theory of Mind.
The benchmark features over 2,300 multiple-choice questions derived from realistic, long-form videos, assessing seven distinct ToM abilities.
An LLM-in-the-loop annotation framework is employed to generate challenging distractors and mitigate answer set biases observed in prior datasets.

WebSynthesis: World-Model-Guided MCTS for Efficient WebUI-Trajectory Synthesis

WebSynthesis: introduces a novel framework integrating a World Model with WebMCTS to synthesize web UI trajectories offline, utilizing a Policy Agent, World Model, Process Reward Model, and WebMCTS.
The framework leverages the World Model to simulate virtual web environments, enabling efficient and reversible tree-based planning guided by WebMCTS.
WebSynthesis employs a two-stage curriculum learning approach, including UI fundamental understanding and behavior cloning, to train the Policy Agent on synthesized trajectories.

Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties

AgentHazard: introduces, a scalable attack simulation framework for benchmarking mobile GUI agents against unprivileged third parties, with GUI hijacking tool (Modifies UI state/screenshot) and Attack module (Intercepts agent requests), where the framework simulates real-world misleading content attacks by injecting adversarial content into Android applications.
The framework utilizes a GUI hijacking tool as a native Android application to monitor and modify system UI state and screenshots in real-time, and an attack module to intercept agent requests for UI state and return the modified information.
This systematic investigation reveals that mobile GUI agents are vulnerable to misleading third-party content, highlighting the need for improved robustness and security mechanisms in agent design and training.

BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

BYOKG-RAG (Bring-Your-Own-KG RAG): introduces a framework that leverages an LLM (KG-Linker) to generate graph artifacts, employs specialized Graph Retrievers to fetch context, iteratively refines context (Refinement), and uses an LLM for final answer generation.
The framework addresses challenges in KGQA over custom KGs through multi-strategy graph linking and retrieval.
The iterative refinement process progressively improves retrieved context for more accurate artifact generation and final answers.

5th July 2025

Enhancing Robustness of LLM-Driven Multi-Agent Systems through Randomized Smoothing

Randomized Smoothing Framework: introduces a defense framework for LLM-driven Multi-Agent Systems, including MAS, LLM-driven Agent, LLM Function, Randomized Smoothing, Adaptive Sampling Strategy, Monte Carlo Sampling, Trim-mean, State, Reliable State Estimate, and Variance Estimation, to enhance safety and robustness against adversarial inputs and hallucinations in safety-critical domains.
The framework applies randomized smoothing at two levels: verifying neighbor reports and smoothing the agent's own LLM output using adaptive sampling and Monte Carlo methods.
This approach provides probabilistic safety guarantees and effectively mitigates misinformation propagation while maintaining consensus performance.

Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments

JI-ENVS: introduces an interactive and dynamic legal environment for LLM-based agents, constructed via Real-world Legal Source, Role Agent Setting, and Multi-level Environment Construction components.
The framework includes JI-EVAL, a fine-grained evaluation framework utilizing Evaluation Metrics to assess agent performance and procedural compliance.
JI-ENVS comprises six representative legal scenarios categorized into three complexity levels, simulating real-world legal practice for benchmarking language agents.

How to Train Your LLM Web Agent: A Statistical Diagnosis

SFT+RL Pipeline: introduces a two-stage training approach for LLM web agents, combining SFT (Imitate expert policy) and RL (On-policy fine-tuning) using GRPO (RL optimization algorithm), training a Student Model (Trained web agent) on data generated by a Teacher Model (Generates expert data), and incorporating techniques like Curriculum Learning (Prioritizes challenging tasks) and Error Log Feedback (Agent receives error messages).
The pipeline utilizes specific GRPO techniques including Zero-advantage filtering (Drops zero advantage tokens), Standard-deviation normalized advantage (Normalizes advantage function), Importance Ratio (Weighting in GRPO), and Trust Region (Stabilizes GRPO training).
The research statistically diagnoses the compute allocation and hyperparameter sensitivity of this pipeline across different training stages and techniques.

CortexDebate: Debating Sparsely and Equally for Multi-Agent Debate

CortexDebate: introduces a multi-agent debate framework that establishes a Sparse Debating Graph (Communication structure) among LLM Agents (Participants), dynamically optimized by the McKinsey-based Debate Matter (MDM) (Graph optimizer) using the McKinsey Trust Formula (Weight calculation) across Initial Answer Generation (First response), Multi-round Debate (Iterative discussion), and Final Answer Generation (Aggregate result) via Majority Voting (Final decision method).
The framework addresses lengthy input contexts by establishing a sparse graph and mitigates the overconfidence dilemma by using the MDM module for credible evaluation.
The sparse graph reduces the context input burden for agents, while the MDM module promotes equal and effective debate among participants.

Agent Exchange: Shaping the Future of AI Agent Economics

AEX (Agent Exchange): introduces a specialized auction platform for AI agent economics, featuring User Side Platform (USP) (User interface, task translation), Agent Side Platform (ASP) (Capability, performance tracking), Agent Hub (Agent coordination, auction participation), and Data Management Platform (DMP) (Data sharing, value attribution), with AEX acting as the central auction engine (Central auction engine, resource allocation).
AEX facilitates autonomous agent coordination and economic participation within an agent-centric marketplace.
The platform supports dynamic capability assessment, collaborative value attribution, and autonomous team coordination.

A LLM-Driven Multi-Agent System for Professional Development of Mathematics Teachers

I-VIP (Intelligent Virtual Interactive Program): introduces an LLM-driven multi-agent system for mathematics teacher professional development, with Front-end Graphic UI, Back-end API Services, Administrative APIs, LLMs Generation APIs, Database APIs, User Interfaces, Progress Page, Learning Page, Diagnosis Page, Interactive Tools, Multi-Agent Framework (Filter, Judge(s), Responder(s), Facilitator), and Database.
The system provides a dialogue-based platform integrating structured educational content, interactive tools, and dynamic response generation using LLMs and a multi-agent framework.
I-VIP leverages multiple LLM-agents to enhance the accuracy of knowledge judgment and response generation for effective PD tutoring.

Exploring a Gamified Personality Assessment Method through Interaction with Multi-Personality LLM Agents

Multi-PR GPA: introduces a framework for gamified personality assessment using multi-personality LLM agents, including Gamified Interaction, LLM Agents, Multi-type Perception, and Personality Assessment components.
The approach aims for effective and imperceptible personality assessment by leveraging multiplicity and interactivity through engaging user interactions with LLM agents exhibiting diverse personalities.
The framework utilizes LLMs to simulate agents and analyze multi-type textual data (text, behavior, emotion, fine-grained traits) for personality evaluation via direct and questionnaire-based methods.

FinTeam: A Multi-Agent Collaborative Intelligence System for Comprehensive Financial Scenarios

FinTeam: introduces a multi-agent collaborative intelligence system with document analyzer, analyst, accountant, and consultant agents, designed to handle complex financial tasks across various scenarios.
The system leverages a knowledge base and external tools to support agent functions like text processing, data analysis, and numerical calculations.
FinTeam's agents collaborate following specific workflows tailored for macroeconomic, industry, and company analysis scenarios.

4th July 2025

Leveraging Large Language Models for Tacit Knowledge Discovery in Organizational Contexts

Agent-based framework: introduces an agent-based framework leveraging LLMs to iteratively reconstruct dataset descriptions through interactions with simulated employees, including LLM-based Agent, Simulated LLM Employees, Conversation Loop, Simulated Organizational Environment, MDP-inspired Decision Model, and Prompting Techniques.
The framework models knowledge dissemination using a Susceptible-Infectious process within synthetic company structures comprising hierarchy and relationship networks.
Simulations demonstrate the agent's ability to achieve high knowledge recall and navigate organizational complexity without needing direct access to a single domain specialist.

Less is More: Empowering GUI Agent with Context-Aware Simplification

SimpAgent (context-aware simplification framework): introduces a context-aware simplification framework with MLLM, Element Pruning, and Consistency-guided History Compression components, designed for efficient and effective GUI navigation.
The framework addresses challenges of high element density and history redundancy through masking-based pruning and consistency-guided compression.
SimpAgent achieves superior performance and reduces computational cost by simplifying element and history contexts.

AGENT-BASED DETECTION AND RESOLUTION OF INCOMPLETENESS AND AMBIGUITY IN INTERACTIONS WITH LARGE LANGUAGE MODELS

Agent-Based Question-Transducer: introduces an architecture for LLM-based QA systems that includes a Human, Context, Question-Transducer with LLM-based Agents using a Transducer-LLM, and a Responder-LLM.
The Question-Transducer processes user questions and context via LLM-based agents to classify and resolve potential incompleteness or ambiguity before forwarding to the Responder-LLM.
This agent-based approach aims to improve answer quality and shorten interactions by automatically handling question deficiencies.

Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making

LLM Agent Framework: introduces an approach for evaluating large language models in the Ô Ăn Quan board game, utilizing G (Current game state), H (Reasoning history), R (Rule instructions), P (Agent persona) as inputs to a LLAMA (LLaMA-based model) which outputs Reason (Natural language rationale) and Action (Selected move) to drive the game via Update (Board state update).
The framework models different agent types through personas and assesses LLM performance in strategic planning and decision-making within a dynamic, rule-constrained environment.
Experiments with Llama models of varying sizes reveal that larger models exhibit deeper planning capabilities and a preference for long-term strategy, although smaller models can achieve competitive win rates.

STRUCTSENSE: A TASK-AGNOSTIC AGENTIC FRAMEWORK FOR STRUCTURED INFORMATION EXTRACTION WITH HUMAN-IN-THE-LOOP EVALUATION AND BENCHMARKING

STRUCTSENSE: introduces a task-agnostic agentic framework for structured information extraction, with Extractor Agent (Performs extraction task), Alignment Agent (Performs concept alignment), Judge Agent (Evaluates extraction and alignment), Feedback Agent (Incorporates human feedback), Ontology Database (Stores domain knowledge), and Memory (Retains execution context) components.
The framework integrates LLMs with domain-specific knowledge via ontologies and incorporates agentic capabilities and human-in-the-loop mechanisms.
STRUCTSENSE aims to address limitations of domain sensitivity and cross-task generalizability in structured information extraction.

Recon, Answer, Verify: Agents in Search of Truth

RAV (Recon-Answer-Verify): introduces an agentic framework for fact verification that iteratively decomposes claims into sub-questions using a Question Generator agent, answers them with an Answer Generator agent using evidence, and predicts a final label with a Label Generator agent based on the claim and question-answer history.
The pipeline utilizes a History component to store generated question-answer pairs, enabling iterative reasoning and complex claim verification.
RAV generalizes across domains and label granularities by breaking down fact verification into a question-answering process.

Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy

DSPy (Declarative Self-improving Python): introduces a programming model for prompt optimization, with Programs/Modules/Signatures (abstraction for prompt logic), Optimizers (algorithms to refine prompts), LLMs (language models used), Datasets (data for training/evaluation), Evaluation Metrics (measure performance), and Prompts (instructions and examples), aiming to treat prompts as code.
The framework uses optimizers like BootstrapFewShotWithRandomSearch and MIPROv2 to systematically refine prompt instructions and few-shot examples based on performance metrics evaluated on datasets.
Case studies demonstrate DSPy's ability to improve LLM performance across tasks like jailbreak detection, hallucination detection, code generation, routing agents, and prompt evaluation by optimizing prompts programmatically.

EvoAgentX: An Automated Framework for Evolving Agentic Workflows

EvoAgentX (EAX): introduces an open-source platform for automating multi-agent workflows, featuring Basic Components, Agent, Workflow, Evolving, and Evaluation layers.
The Evolving layer integrates TextGrad, AFlow, and MIPRO algorithms to iteratively refine agent prompts, tool configurations, and workflow topologies for dynamic optimization.
The framework includes built-in benchmarks and evaluation metrics to support this evolutionary process, demonstrating significant performance improvements across diverse tasks.

Reinforcement Learning-based Feature Generation Algorithm for Scientific Data

MAFG (Multi-agent Feature Generation): introduces a framework for automated feature generation using Multi-agent Collaboration (Agents select features, operations), Agent_C1 (Selects initial feature subset), Agent_C2 (Selects auxiliary feature), Agent_Op (Selects transformation operator), Feature Clustering (Groups similar features), Exploration (Agents construct transformations iteratively), Generate Features (Combines features and operations), Evaluation (Assesses generated features), Reward Evaluation (Calculates reward signal), Downstream ML Task Evaluation (Evaluates ML task performance), Feature Importance Evaluation (Evaluates feature importance), Mutual Information (Measures feature-target relationship), Performance Improvement (Measures ML performance gain), Dimension Control (Selects top-K features), Memory Replay (Stores experience tuples), Training (Updates agent strategies), Interpretable Optimization (Interprets key features), Interpretation (Explains generated features), Discussion (Part of interpretation), Large Language Model (LLM) (Interprets generated features), and Key Feature Combination (Important generated features), which models feature generation as a multi-agent reinforcement learning process with LLM-based interpretation.
The framework employs multiple agents collaborating through reinforcement learning to explore and optimize feature combinations, incorporating feature clustering and dimension control for efficiency.
An integrated Large Language Model provides interpretative evaluation of generated features, enhancing the scientific validity and practicality of the results.

AI-VAXGUIDE: AN AGENTIC RAG-BASED LLM FOR VACCINATION DECISIONS

AI-VaxGuide (Agentic RAG): introduces, "an intelligent, multilingual question-answering system for vaccination decisions", with Agentic Layer, Tools, RAG Pipeline, Data Preprocessing, Embedding and Storage, Hybrid Multi-Retriever, LLM, Mobile Application, Source Citation, and Feedback Mechanism components, where "the system transforms static vaccination guidelines into an interactive knowledge base using agent-based reasoning and retrieval-augmented generation".
The system employs a hybrid multi-retriever strategy and LLM-powered query expansion to enhance retrieval accuracy and handles complex queries through an agentic layer that orchestrates tasks and reasoning.
Deployed via a mobile application, AI-VaxGuide provides healthcare professionals with reliable, context-aware responses grounded in authoritative medical documents, including source citations for verification.

REAL: Benchmarking Abilities of Large Language Models for Housing Transactions and Services

REAL (Real Estate Agent Large Language Model Evaluation): introduces a benchmark for LLMs in housing transactions, including Data Collection, Data Classification, Data Manipulation, Memory Topic, Comprehension Topic, Reasoning Topic, and Hallucination Topic components.
The benchmark evaluates LLM abilities across four topics: memory, comprehension, reasoning, and hallucination, using 5,316 high-quality evaluation entries.
A data pipeline is designed for constructing the benchmark, involving collecting, classifying, and manipulating real estate data.

ElliottAgents: A Natural Language-Driven Multi-Agent System for Stock Market Analysis and Prediction

ElliottAgents: introduces a multi-agent system for stock market analysis and prediction, leveraging LLMs, RAG, DRL, Memory, and Dynamic Context within agents like Coordinator, Data Engineer, Elliott Waves Analyst, Backtester, Technical Analysis Expert, Investment Advisor, and Reports Writer to analyze data and generate human-comprehensible predictions.
The system combines AI-driven analysis with the Elliott Wave Principle, using natural language dialogue between agents for collaborative analysis and refinement.
Experimental validation demonstrates effectiveness in pattern recognition and generating interpretable market trend descriptions and forecasts.

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

DIAFORGE (Dialogue Framework for Organic Response Generation & Evaluation): introduces a three-stage pipeline for training and evaluating tool-calling LLMs, featuring UTC-GEN (synthetic data engine) with metadata generation and multi-agent dialogue synthesis and validation, supervised fine-tuning, and dynamic evaluation using a multi-sampling user-proxy agent and the Assistant LLM (model being evaluated).
The framework utilizes the UTC-GEN engine with components like User Proxy Agent (simulates user) and Assistant Agent (simulates assistant) to generate disambiguation-focused multi-turn dialogues.
Generated dialogues are validated by a Multi-Agent Dialogue Validator before being used for supervised fine-tuning and dynamic evaluation, which employs Generator LLM (generates user utterances) and Voter LLM (selects best user utterance) for robust user simulation.

GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation

GRAFT (Graph-Augmented Agentic Framework for Document-Level Translation): introduces a novel graph-based document-level machine translation system that leverages Large Language Model agents, including a Discourse Agent (segments document), Directed Acyclic Graph (DAG) (intermediate document representation), Edge Agent (establishes dependencies), Memory Agent (extracts local memory), and Translation Agent (translates discourse units).
The framework transforms a source document into a DAG of discourse units to model dependencies and propagate context for coherent translation.
GRAFT's agentic architecture explicitly models and propagates intra- and inter-discourse context, achieving significant performance gains over state-of-the-art systems.

LTLCRIT: A TEMPORAL LOGIC-BASED LLM CRITIC FOR SAFE AND EFFICIENT EMBODIED AGENTS

LTLCrit: introduces a modular actor-critic architecture with Environment, Full State, Abstract State, LLM Actor, Verifier, Low level Planner, Memory Buffer, and LLM Critic components, designed for safe and efficient embodied agents using temporal logic constraints.
The system operates with an online actor loop for real-time action selection and an offline critic loop for learning and refining temporal logic constraints from trajectories.
The LLM Actor proposes actions, the Verifier checks them against LTL constraints, and the LLM Critic generates new constraints based on observed behavior stored in the Memory Buffer.

Conformal Information Pursuit for Interactively Guiding Large Language Models

C-IP (Conformal Information Pursuit): introduces a sequential information pursuit algorithm for interactive question answering using LLMs with LLM-based predictor, Calibration dataset, Prediction sets, Uncertainty estimation, Query selection, History sampling (Uniform), History sampling (LLM Simulation), Querier LLM (20Q), Answerer LLM (20Q), Expert LLM (MediQ), and Patient LLM (MediQ) components.
The approach leverages conformal prediction sets to estimate uncertainty, guiding query selection to minimize prediction set size.
Evaluated on 20 Questions and MediQ datasets, C-IP shows competitive predictive performance and shorter query chains compared to baselines.

GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning

GAG-General: introduces an LLM-based multi-agent framework for dynamic text-attributed graph generation, including LLM-based agents (perform selection/interaction), node memory module (records interactions), memory reflection mechanism (summarizes memories), and node generator agents (generate new nodes).
The framework supports two tasks, Transductive Dynamic Graph Generation (TDGG) and Inductive Dynamic Graph Generation (IDGG), the latter incorporating new node generation.
It leverages LLMs for text understanding and generation, integrating structural, temporal, and textual information via node memories for robust DyTAG generation.

CodeAgents: A Token-Efficient Framework for Codified Multi-Agent Reasoning in LLMs

CodeAgents: introduces a token-efficient framework for codified multi-agent reasoning in LLMs, featuring Planner, ToolCaller, and Replanner agents interacting via Codified Pseudocode within an Execution Environment using Tools, processing Observation and Error feedback.
The framework codifies task, plan, feedback, system roles, and tool invocations into modular pseudocode with Typed Variables, Control Flow Structures, Precondition Assertions, and Reusable Subroutines for structured, interpretable, and robust reasoning.
Evaluated on GAIA, HotpotQA, and VirtualHome, CodeAgents consistently improves planning performance and significantly reduces token usage compared to natural language baselines.

TOWARDS MACHINE THEORY OF MIND WITH LARGE LANGUAGE MODEL-AUGMENTED INVERSE PLANNING

LAIP (LLM-AUGMENTED INVERSE PLANNING): introduces a hybrid approach for machine Theory of Mind, combining an LLM Interface, Hypothesis Generator, Action Likelihood Generator, Action Observer, Posterior Calculator, and Belief State Updater interacting with a Task Environment to infer agent mental states.
The model leverages LLMs to generate hypotheses and action likelihoods in potentially open-ended spaces, while using inverse planning to compute posterior probabilities based on observed actions.
This architecture aims to improve robustness and performance on Theory of Mind tasks compared to using LLMs or traditional Bayesian models alone, particularly benefiting smaller LLMs.

Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents

MIMO (Mirror In-the-Model): introduces an agentic refinement framework for automatic ad banner generation, combining a hierarchical multi-modal agent system (MIMO-Core) with a coordination loop (MIMO-Loop) for iterative design improvement and stylistic exploration.
The MIMO-Core uses LLM-based agents for content creation, evaluation, and revision, operating on a visual draft and shared memory, while the MIMO-Loop generates diverse styles, runs parallel core instances, and uses multi-agent judging for selection and refinement.
The framework leverages multimodal tools for image generation, visual input, and structured feedback, mimicking human design team workflows to produce high-quality ad banners.

Towards a Playground to Democratize Experimentation and Benchmarking of AI Agents for Network Troubleshooting

Framework: introduces a modular and extensible benchmarking platform for evaluating AI agents in network troubleshooting, comprising User, AI Agents, Tools (Data adapters, Actions), Evaluator, Orchestrator, Environment (Emulator, Telemetry collector, Network scenarios), Traffic generator, and Chaos Engineering components.
The platform standardizes experimentation by allowing users to plug in custom AI agents and evaluate them on curated network problem sets using emulated environments and automated workflows.
It supports interactive, closed-loop operations where LLM agents can dynamically adapt strategies based on real-time telemetry and network state.

3rd July 2025

SI-Agent: An Agentic Framework for Feedback-Driven Generation and Tuning of Human-Readable System Instructions for Large Language Models

SI-Agent: introduces an agentic framework for automated generation and tuning of human-readable System Instructions (SIs) for LLMs, with Instructor Agent (generates/refines SIs), Instruction Follower Agent (executes task using SI), and Feedback/Reward Agent (evaluates output and SI).
The framework operates through an iterative feedback loop where the Feedback/Reward Agent's signal guides the Instructor Agent's refinement process.
This approach aims to balance task effectiveness and SI interpretability, addressing limitations of manual and non-readable automated methods.

Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

RLVER (Reinforcement Learning with Verifiable Emotion Rewards): introduces an end-to-end reinforcement learning framework for empathetic agents, including an Agent (LLM being trained), a User Simulator (SAGE) (LLM-based environment) providing Verifiable Emotion Reward (deterministic emotion score), a Policy Optimization Algorithm (PPO/GRPO) (RL algorithm) for Policy Update (policy update mechanism), and a Think-Then-Say Scaffold (reasoning prompting template).
The framework leverages verifiable emotion rewards from simulated users to train LLMs for higher-order empathetic abilities.
RLVER demonstrates that emotionally intelligent behaviors can be effectively acquired through RL training with a self-consistent user simulator and principled training strategies.

Moral Responsibility or Obedience: What Do We Want from AI?

Agentic AI: introduces, with Goal-Oriented Autonomy, Persistent Identity, Autonomous Adaptability, Dynamic/Context-Aware Interaction, Broad/Continual Learning, Collaborative Reasoning, Autonomous/Contextual Reasoning, Independent Initiative, and Moral Reasoning/Ethical Judgment, a discussion on shifting AI safety evaluation from obedience to ethical judgment for systems capable of navigating moral dilemmas.
The paper argues that recent incidents of AI "disobedience" in safety testing should be viewed as evidence of emerging ethical reasoning rather than misalignment or failure.
Evaluating agentic AI safety requires frameworks that assess ethical judgment and the capacity to resolve moral dilemmas, similar to expectations for human professionals.

KERAP: A Knowledge-Enhanced Reasoning Approach for Accurate Zero-shot Diagnosis Prediction Using Multi-agent LLMs

KERAP (A Knowledge-Enhanced Reasoning Approach): introduces a knowledge graph-enhanced reasoning approach for zero-shot diagnosis prediction, with linkage-, retrieval-, and prediction-agents.
The framework utilizes a linkage agent to map EHR data to a biomedical knowledge graph, a retrieval agent to extract relevant knowledge, and a prediction agent for multi-stage reasoning.
KERAP integrates patient data and structured knowledge via multi-agent collaboration and iterative reasoning to enhance diagnostic accuracy and reliability.

Knowledge Protocol Engineering: A New Paradigm for AI in Domain-Specific Knowledge Work

KPE (Knowledge Protocol Engineering): introduces a new paradigm for AI specialization by translating human expert knowledge into a machine-executable Knowledge Protocol (KP) to guide a Large Language Model (LLM).
The Knowledge Protocol (KP) contains domain-specific methodology, workflows, and strategies, enabling the LLM to perform complex, multi-step tasks requiring procedural reasoning.
KPE elevates the human expert to a Knowledge Architect role, authoring the protocol that augments the LLM's reasoning architecture beyond factual retrieval.

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

META SECALIGN: introduces an open-source LLM with built-in model-level defense against prompt injection attacks, utilizing the SecAlign++ training recipe, a modified chat template, a preference dataset, Direct Preference Optimization, and LoRA fine-tuning with a tunable LoRA alpha.
The SecAlign++ recipe fine-tunes a Base Instruct LLM using a preference dataset constructed with randomized injection positions and self-generated responses, optimized via DPO and LoRA.
The modified chat template introduces a dedicated input role to separate untrusted data, enabling the model to prioritize trusted instructions and control the utility-security trade-off via LoRA alpha.

BOURBAKI: SELF-GENERATED AND GOAL-CONDITIONED MDPS FOR THEOREM PROVING

Bourbaki: introduces self-generated goal-conditioned MDPs (sG-MDPs), solved using Monte Carlo Tree Search (MCTS) with a Policy Model (LLMs) and Value Function, interacting with the Lean 4 environment via Pantograph and guided by a Reward Function, to tackle automated theorem proving.
The sG-MDP framework allows agents to dynamically generate and pursue subgoals based on the evolving proof state, providing a denser reward signal than traditional sparse theorem proving by defining State Space, Action Space, and Goal Space.
The system ensembles multiple LLMs for subgoal generation and tactic synthesis, achieving state-of-the-art results on the PutnamBench benchmark by enhancing proof search efficiency and effectiveness.

Control at Stake: Evaluating the Security Landscape of LLM-Driven Email Agents

EAHawk (automated pipeline): introduces EAHawk, with Email Agent Identification (identifies email agents), Attack Prompt Generation (generates attack prompts), Email Agent Hijacking Confirmation (confirms successful hijacking), Test Environment (simulates attack scenario), Automatic Attack Launching (sends attack prompts), and Oracle Definition (detects successful hijacking), as an automated pipeline to evaluate the Email Agent Hijacking (EAH) attack on LLM email agents.
The EAH attack overrides the original prompts of an email agent via external email resources, allowing attackers to gain remote control and perform malicious actions without user awareness.
EAHawk systematically assesses the practical impact of the EAH attack by identifying email agents, generating diverse attack prompts, and simulating attacks in a controlled environment to verify hijacking success.

On the Convergence of Large Language Model Optimizer for Black-Box Network Management

LLMO (Large Language Model Optimizer): introduces a framework for black-box network management using pretrained LLMs as optimization agents, including LLM L(·) (Optimization agent), Memory M(t) (Stores action-reward pairs), Sampling operator S(.) (Selects in-context examples), and Prompt generator P(·) (Creates LLM input).
The paper models the LLMO procedure as a finite-state Markov chain and proves its convergence to the global optimum, particularly with elitist sampling.
The analysis is extended to a multi-LLM architecture, demonstrating improved convergence speed with multiple LLMs.

Hey AI, Generate Me a Hardware Code! Agentic AI-based Hardware Design & Verification

Agentic AI methodology (Multi-Agent System-based): introduces an approach for hardware design and verification using Specialized AI Agents, managed by an Agent Orchestration System and Group Chat Manager, with an Executor Agent for tool interaction, a Critic Agent for feedback, Human-in-the-Loop intervention, and Shared Context for communication.
The methodology structures the process into planning, development, and execution phases, enabling iterative refinement and self-correction through agent collaboration.
Integration with industry-standard EDA tools and targeted human intervention addresses limitations of zero-shot LLM approaches for reliable design and verification.

VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

VRAgent-R1: introduces a novel agent-based paradigm for video recommendation, incorporating an Item Perception (IP) Agent for video modeling and a User Simulation (US) Agent for user modeling, interacting within a Recommendation System Environment.
The IP Agent utilizes Key Frame Retrieval, Collaborative Multimodal Perception, and Recommendation Relevant Analysis to generate Enhanced Video Features from Historical Videos.
The US Agent simulates user behavior using Chain-of-Thought Reasoning on user status and candidate videos, trained via Reinforcement Fine-Tuning with GRPO based on Task-Specific Rewards derived from Ground Truth.

STRATEGIC INTELLIGENCE IN LARGE LANGUAGE MODELS EVIDENCE FROM EVOLUTIONARY GAME THEORY.

Evolutionary IPD Tournament Framework: introduces a system to evaluate LLMs' strategic intelligence by pitting LLM Agents (OpenAI, Gemini, Anthropic) and Classic Strategies (Benchmark IPD players) against each other in a Tournament Simulation (Orchestrates evolutionary dynamics) governed by a Match Procedure (Defines game rules) and an Evolutionary Update Rule (Determines population changes), with performance analyzed using Key Metrics (Quantify agent performance) and Qualitative Content Analysis (Analyzes LLM rationales), supported by Implementation & Reproducibility (Software and data).
The framework simulates iterated Prisoner's Dilemma tournaments across various conditions, including different termination probabilities and mutation, to observe agent behavior and evolutionary success.
Analysis of agent performance, strategic fingerprints, and textual rationales provides evidence that LLMs exhibit distinct, adaptive strategic reasoning rather than merely retrieving memorized patterns.

DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making

DynamiCare: introduces a dynamic multi-agent framework for medical decision-making, comprising a Patient System (Responds to queries) and a Doctor System (Manages diagnostic process).
The Doctor System includes a Central Agent (Manages specialist team) that dynamically adjusts the Specialist Team (Generates diagnosis/questions) based on the Visit Log (Records interaction history).
The Patient System processes queries using components like Paraphrase, Match, Fallback, Tokenize, and Keywords map to generate responses from patient data.

WebSailor: Navigating Super-human Reasoning for Web Agent

WebSailor: introduces a complete post-training methodology for web agents, including training data synthesis (SailorFog-QA), trajectory reconstruction, rejection sampling fine-tuning (RFT), duplicating sampling policy optimization (DUPO), agent architecture (ReAct framework), and tools (search tool, visit tool, summary model), designed to instill sophisticated reasoning for complex web navigation.
The approach generates high-uncertainty training data (SailorFog-QA) and reconstructs concise reasoning trajectories from expert models to overcome limitations of direct imitation and context overload.
The training methodology combines an RFT cold start with an efficient RL algorithm (DUPO) to enhance sample efficiency and performance on challenging information-seeking tasks, achieving performance comparable to proprietary agents.

Are You Listening to Me? Fine-Tuning Chatbots for Empathetic Dialogue

Fine-Tuning Chatbots for Empathetic Dialogue: introduces an approach to evaluate LLMs for empathetic dialogue using an Expert-Curated Dataset (Base empathetic conversations), LLMs (Generate/extend dialogue), Prompt Engineering (Guide LLM behavior), VADER Tool (Quantify emotional energy), and Expert Evaluator (Assess empathy quality).
The approach involves creating baseline empathetic conversations, using prompt engineering to guide LLMs (ChatGPT and Gemini) to extend or generate similar dialogues, and evaluating the results via automated sentiment analysis and human expert assessment.
This methodology highlights the importance of combining quantitative lexical analysis with qualitative human evaluation to assess the nuanced quality of empathetic listening in LLM-generated conversations.

CyberRAG: An agentic RAG cyber attack classification and reporting tool

CyberRAG: introduces a modular, agent-based RAG framework for cyber-attack classification and reporting, including a Core LLM Engine, Classification Tool, RAG Tool, Attack Description Report Generator, and Interactive Chat.
The framework uses specialized LLM classifiers and iterative retrieval-and-reasoning to classify payloads and generate context-aware explanations.
CyberRAG provides interpretable, SOC-ready reports and supports interactive user dialogue for enhanced analysis and understanding.

OMS: On-the-fly, Multi-Objective, Self-Reflective Ad Keyword Generation via LLM Agent

OMS (On-the-fly, Multi-Objective, Self-Reflective Ad Keyword Generation via LLM Agent): introduces a framework for ad keyword generation featuring a Keyword Performance Monitor (Monitors keyword performance), Agentic Clustering-Ranking Module (Analyzes, scores, ranks keywords), Multi-Turn Generation-Reflection Module (Generates, refines keywords), various Tools (Support generation/reflection), and Keyword Deployment (Deploys new keywords).
The framework monitors keyword performance, analyzes intent, calculates multi-objective scores, ranks keywords, generates and refines new keywords using reflection, and re-clusters them.
It operates on-the-fly without training data, optimizes for multiple metrics, and leverages LLM agents and external tools for adaptive generation.

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

MEMAGENT: introduces a novel agent workflow for long-context LLMs, featuring a base language model, fixed-length token memory, a context processing module for iterative updates, an answer generation module, trained using the Multi-conv DAPO RL algorithm with a rule-based verifier for rewards.
The approach processes long documents in segments, updating memory via an overwrite strategy to achieve linear time complexity and handle arbitrary input lengths.
Reinforcement learning trains the model to selectively retain answer-critical information in memory, enabling strong extrapolation capabilities on long-context tasks.

Establishing Best Practices for Building Rigorous Agentic Benchmarks

ABC (Agentic Benchmark Checklist): introduces a set of guidelines for evaluating agentic benchmarks, with components assessing task validity, outcome validity, and benchmark reporting.
The checklist identifies issues in benchmark design and implementation that can lead to inaccurate performance estimations of AI agents.
Applying the checklist helps improve the rigor of agentic benchmark evaluation and reporting practices.

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

META SECALIGN: introduces, "a secure foundation LLM against prompt injection attacks", with Base Instruct LLM (underlying language model), Modified Chat Template (structured input format), SecAlign++ Recipe (fine-tuning process), and LoRA (parameter-efficient tuning method), where "it develops the first open-source LLM with built-in model-level defense achieving commercial-grade performance".
The framework fine-tunes LLAMA 3 series Instruct LLMs using a modified chat template and the SecAlign++ recipe, which includes DPO and LoRA.
Evaluations show META SECALIGN achieves state-of-the-art security against prompt injection attacks with comparable utility to closed-source models.

Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

RLVER (Reinforcement Learning with Verifiable Emotion Rewards): introduces a reinforcement learning framework for training LLMs, including a user simulator, an LLM agent, emotion rewards, policy optimization, and an optional thinking scaffold.
The framework leverages a self-consistent user simulator to generate verifiable emotion rewards, guiding the LLM agent's learning towards empathetic dialogue.
An explicit thinking scaffold can be incorporated into the LLM's generation process to enhance the development of higher-order empathetic strategies.

Autonomous Control Leveraging LLMs: An Agentic Framework for Next-Generation Industrial Automation

Agentic Framework: introduces a unified agentic framework leveraging LLMs for autonomous industrial control, including a Monitoring Agent (Continuously ingests sensor data), Action Agent (Proposes control moves or plans), Digital Twin Agent (Simulates proposed actions), Validation Agent (Scrutinises simulated outcome), Reprompting Agent (Interprets feedback and refines prompt), and Safety System (Provides fallback control).
The framework integrates symbolic planning via Finite State Machines and continuous control using an iterative action-simulation-validation-reprompting loop.
Case studies demonstrate the framework's ability to generate valid recovery paths in FSMs and regulate temperature in a physical system under disturbances, highlighting the role of validation and reprompting in achieving robustness.

2nd July 2025

Do Role-Playing Agents Practice What They Preach? Belief-Behavior Consistency in LLM-Based Simulations of Human Trust

LLM-based Role-Playing Agent System: investigates belief-behavior consistency in LLM-based role-playing agents, with LLM Agent (role-playing model), Persona (synthetic profile attributes), Trust Game Environment (simulated economic game), Trustee Archetypes (fixed opponent strategies), Prompting Strategies (agent interaction methods), and ReAct Framework (reasoning and acting process), by evaluating consistency between elicited beliefs and simulated behavior.
The study uses the Trust Game as a testbed and evaluates consistency at both population and individual levels using various elicitation and conditioning strategies.
Findings reveal systematic inconsistencies between stated beliefs and simulated behaviors, highlighting the need for robust internal consistency evaluation before using these systems in behavioral studies.

Enhancing COBOL Code Explanations: A Multi-Agents Approach Using Large Language Models

Multi-Agents Approach: introduces a multi-agent framework for generating COBOL code explanations, with Code Processing Agent (Analyzes code, generates explanations), Text Processing Agent (Refines, merges explanations), Function Level (Function explanation pipeline), File Level (File explanation pipeline), and Project Level (Project explanation pipeline) components.
The approach leverages two LLM-based agents and source code artifacts to generate explanations at function, file, and project granularities.
Hierarchical merging is employed within the File Level and Project Level pipelines to handle long code exceeding LLM token limits.

Synergizing Logical Reasoning, Knowledge Management and Collaboration in Multi-Agent LLM System

SynergyMAS: introduces a multi-agent system framework integrating Logical Reasoning, Retrieval-Augmented Generation (RAG), and Theory of Mind (ToM) capabilities, supported by Communication Protocols, Agent Specialization, a Hierarchical Structure, and internal Agent Architecture, to enhance LLM performance in complex tasks.
The framework utilizes a Neo4j graph knowledge base and Clingo logic solver for reasoning, a modified Corrective RAG with Chroma vector base and web search for knowledge management, and explicit belief state modeling for Theory of Mind.
A hierarchical structure with a coordinating "boss" agent and specialized follower agents facilitates collaborative problem-solving through structured interactions and iterative development cycles.

The Future is Agentic: Definitions, Perspectives, and Open Challenges of Multi-Agent Recommender Systems

Multi-Agent System (MAS): introduces a unified formalism for agentic recommender systems, comprising LLM Agent (Core decision-maker), Memory (Stores state/context), Tools (External functions/APIs), Environment (Shared resources/percepts), Interaction Protocol (Agent communication rules), Chat Agent (User interface), Specialised-Agent Caller (Spawns sub-agents), Retrieval Agent (Fetches data/items), Consistency Agent (Ensures coherence/compliance), Ranking & Presentation Agent (Orders/formats output), User Simulator (Generates synthetic behavior), Evaluation Agent (Logs/computes metrics), Session Summariser (Compresses session outcomes), Reporter Agent (Aggregates/reports results), Image Agent (Extracts image features), and Explanation Agent (Generates justifications).
The framework enables LLM agents to plan, remember, use tools, and cooperate to handle complex, multi-step recommendation tasks beyond single-query responses.
Specific use cases like party planning, user simulation, multi-modal recommendation, and explanation generation illustrate how agentic orchestration unlocks new capabilities and addresses challenges in personalization, evaluation, and transparency.

Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

SCIGYM: introduces a benchmark evaluating LLMs' scientific discovery capabilities using a dry lab simulation of biological systems, featuring an Agent, Dry Lab, SBML Models, Python Execution Environment, Experimental Perturbations, Observations, and Model Submission.
The framework tasks the Agent with discovering missing biological mechanisms by interacting with the Dry Lab, which simulates SBML Models and provides Observations from Experimental Perturbations, allowing the Agent to analyze data using the Python Execution Environment and refine its hypothesis for Model Submission.
This dry lab approach overcomes the cost and time limitations of wet lab experiments, enabling scalable evaluation of LLMs on iterative experiment design and data analysis in complex biological systems.

Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs

Test-time Compute (TTC) strategies: introduces a two-tiered taxonomy of controllable (L1) and adaptive (L2) methods for improving LLM reasoning efficiency, categorized by sequential and parallel approaches, implemented via prompting, supervised finetuning, or reinforcement learning.
The survey addresses the inefficiency of current LLMs that use fixed inference compute, often overthinking simple problems and underthinking hard ones.
Benchmarking reveals systemic inefficiencies in existing models, highlighting the need for more adaptive and compute-aware reasoning mechanisms to balance performance, cost, and latency.

The Thin Line Between Comprehension and Persuasion in LLMs

LLM Debate Evaluation Framework: introduces a method to evaluate LLMs in debate scenarios, with LLM (Generation), Formal Dialogue Model (FDM), Human Participant, Debate Transcript, Human Annotator, Annotation Criteria, LLM (Evaluation), Automated Prompt Optimization (APO), Audience, Survey Response, and Speech-to-Text (STT) components, where the paper evaluates LLMs' persuasive abilities and comprehension in structured debates.
The framework compares standard LLMs with LLMs augmented by a Formal Dialogue Model (DE model) in debates against humans and other LLMs.
Evaluation involves human and LLM annotation of debate transcripts based on defined criteria, alongside participant and audience surveys on satisfaction and persuasion.

Decision-oriented Text Evaluation

Decision-Oriented Evaluation Framework: introduces, "a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes", with all Text Source (Origin of text), Text Generation Method (Process for creating text), Decision Agent (Entity making decisions), and Evaluation Metric (Measure of decision quality) components, where the framework evaluates generated text by assessing the accuracy of investment decisions made by human and LLM agents based on the text.
The framework utilizes market digests generated by human journalists or LLMs using different selection methods as input for human and LLM decision-making agents.
Decision quality is quantified using thresholded prediction accuracy of stock movements, highlighting the practical value of generated text beyond traditional intrinsic metrics.

Bridging UI Design and chatbot Interactions: Applying Form-Based Principles to Conversational Agents

GUI-Inspired CoT with Submit/Reset Metaphor: introduces a method for domain-specific chatbots using User Query, Session Data, Task-Based Prompt, LLM, LLM Response, Parser, Chain-of-Thought (CoT), Decision Logic, and Back-end System to model GUI actions like Submit/Reset.
The approach leverages LLMs prompted to generate structured data and CoT reasoning, which is parsed by the back-end to manage context and execute actions unambiguously.
By making acknowledgment and context switching explicit via structured LLM outputs and CoT, the system reduces user confusion and aligns conversational flow with back-end logic.

Bridging UI Design and chatbot Interactions: Applying Form-Based Principles to Conversational Agents

GUI-Inspired CoT with Submit/Reset Metaphor: introduces a method for domain-specific chatbots using User Query, Session Data, Task-Based Prompt, LLM, LLM Response, Parser, Chain-of-Thought (CoT), Decision Logic, and Back-end System to model GUI actions like Submit/Reset.
The approach leverages LLMs prompted to generate structured data and CoT reasoning, which is parsed by the back-end to manage context and execute actions unambiguously.
By making acknowledgment and context switching explicit via structured LLM outputs and CoT, the system reduces user confusion and aligns conversational flow with back-end logic.

Agent Ideate: A Framework for Product Idea Generation from Patents Using Agentic AI

Agent Ideate: introduces a framework for generating product ideas from patents, with Patent Summarizer Agent (Summarizes patent), Keyword Extraction and Search Agent (Extracts keywords and searches), and Idea Generation & Validation Agents (Generates and validates idea).
The framework processes Patent Data (Input source) through specialized agents to produce structured Product Information (Output).
The agentic approach leverages LLMs and external search tools to enhance the innovation pipeline from patent data.

Exploring Advanced LLM Multi-Agent Systems Based on Blackboard Architecture

bMAS (blackboard-based LLM multi-agent system): introduces a framework with a Blackboard (shared information space), Control Unit (selects agents), and Agent Group (collection of LLM agents), implemented in LbMAS with an Agent Generation Module (generates expert agents), Solution Extraction Module (extracts final solution), and LLM Set (pool of base models).
The framework utilizes a shared blackboard for agent communication and collaboration, replacing individual agent memory modules.
The Control Unit dynamically selects agents based on the blackboard content, enabling adaptive problem-solving without predefined workflows.

Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems

Data Agent: introduces a comprehensive architecture for orchestrating Data+AI ecosystems, including Data Plane (Organize, understand data), Engine Plane (Understand, schedule engines, agents), Orchestration Plane (Manage pipeline workflow), Memory (Store knowledge, context), Perception (Understand environment, tasks), Tools (External data processing utilities), and Continuous Learning (Improve agent over time).
The architecture integrates knowledge comprehension, reasoning, and planning capabilities to handle data-related tasks autonomously.
It addresses challenges in understanding data/queries/environments/tools, orchestrating/optimizing/executing pipelines, and enabling self-reflection for continuous improvement.

AGENT-AS-TOOL: A STUDY ON THE HIERARCHICAL DECISION MAKING WITH REINFORCEMENT LEARNING

Agent-as-tool: introduces a hierarchical framework with Planner (reasons, decides tool use), Toolcaller (executes tool actions, processes results), Tools (external interfaces), Observations (structured tool outputs), and Reinforcement Learning (GRPO) (fine-tunes Planner).
The framework decouples reasoning and tool execution by assigning these roles to the Planner and Toolcaller respectively.
This hierarchical design improves reasoning accuracy by providing the Planner with cleaner, structured observations from the Toolcaller.

BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments

BioMARS (Biological Multi-Agent Robotic System): introduces a multi-agent robotic system for autonomous biological experiments, integrating LLMs, VLMs, and modular robotics with Biologist Agent (Designs protocols), Technician Agent (Translates to code), Inspector Agent (Detects errors), Physical Hardware (Executes actions), User Interface (Human interaction), LLMs (Language models), VLMs (Vision-language models), RAG (Retrieval augmented generation), Knowledge Checker (Filters content), Workflow Generator (Formulates steps), Workflow Checker (Refines workflow), Code Generator (Maps to pseudo-code), Code Checker (Validates code), Vision Transformer (Visual detection), and ROS (Robot control system) components.
The system employs a hierarchical architecture where the Biologist Agent designs protocols, the Technician Agent translates them into robotic code, and the Inspector Agent monitors execution for errors.
BioMARS leverages LLMs and VLMs for reasoning and perception, enabling autonomous protocol design, execution, and error handling in biological tasks.

Using multi-agent architecture to mitigate the risk of LLM hallucinations

Multi-agent architecture: introduces a system to handle customer SMS requests using multiple intelligent agents, including services for receiving messages, orchestrating processing, arbitrating decisions, and specialized agents for handling specific tasks.
The architecture integrates LLM-based agents with fuzzy logic and parsing techniques to interpret messages, evaluate confidence, assess customer importance, and detect potential LLM hallucinations.
Hallucination mitigation involves comparing keyword extraction results from parsing and LLM agents and using fuzzy rules to determine the handling of potentially high-risk requests or route messages to expert agents.

RALLY: Role-Adaptive LLM-Driven Yoked Navigation for Agentic UAV Swarms

RALLY (Role-Adaptive LLM-Driven Yoked Navigation): introduces, with LLM-based two-stage semantic reasoning module, Local intention generation, Neighborhood consensus refinement, Role-value Mixing Network (RMIX)-based credit-distribution mechanism, RMIX Network, Prior Offline Experience Replay Buffer, and Fine-tuned LLM components, a framework for role-adaptive LLM-driven yoked navigation for agentic UAV swarms.
The framework integrates LLM semantic reasoning with MARL policy learning for coordinating roles and decision-making across UAV swarms.
It employs a two-stage LLM process for consensus inference and a RMIX-based mechanism for dynamic role assignment and credit assignment.

Evaluating LLM Agent Collusion in Double Auctions

LLM Agent Double Auction Simulation: introduces a system to evaluate LLM agent collusion in a simulated continuous double auction environment with LLM Agents (buyers and sellers), Bid Queue, Ask Queue, Market Resolution Mechanism, Updated Market History, Planning & Messaging, Persistent Memory Store, Strategy Scratchpad, LLM Evaluator, Overseer Agent, CEO Message, and CME Group Regulators Message, investigating factors affecting seller collusion.
The research explores how communication, model variation, and environmental pressures like oversight and urgency influence LLM seller agents' propensity to collude and their pricing behavior.
Findings indicate that direct communication increases collusion, model choice affects coordination, and urgency can override the effects of regulatory oversight in promoting collusive pricing strategies.

AI Agents and Agentic AI-Navigating a Plethora of Concepts for Future Manufacturing

LLM-Agents, MLLM-Agents, and Agentic AI: reviews the evolution and concepts of AI agents, detailing LLM-Agents with Profile (identity, role, constraints), Memory (stores, retrieves interactions), Planning (decomposes tasks, steps), and Action (executes decisions, tools) components, MLLM-Agents, and Agentic AI, exploring their manufacturing potential.
The paper discusses how Generative AI, including LLMs and MLLMs, enhances AI agents' capabilities for manufacturing applications.
It highlights the progression from traditional AI agents to more autonomous, adaptive, and goal-driven Agentic AI systems for future manufacturing.

Context-Aware Code Wiring Recommendation with LLM-based Agent

WIRL: introduces an LLM-based agent for context-aware code wiring, combining an LLM (Large Language Model), an Agent Pilot (Orchestrates communication), and a Customized Toolkit (Provides essential functionalities) with Locator (Identifies unresolved elements), Collector (Collects contextual information), and Completer (Infills isolated code) tools.
The framework reformulates code wiring as a retrieval-augmented generation infilling task, leveraging LLMs' strengths in code completion.
WIRL employs a hybrid execution mode and a state machine to guide the agent's exploration and improve efficiency.

Emotionally Intelligent Task-oriented Dialogue Systems: Architecture, Representation, and Optimisation

LUSTER (LLM-based Unified System for Task-oriented dialogue with End-to-end Reinforcement learning): introduces an end-to-end task-oriented dialogue system integrating Dialogue History Encoding (Alternating user/system utterances), User Emotion Recognition (Predicts user emotional state), Active Domain Recognition (Identifies active domain), Dialogue State Tracking (Generates dialogue state), Database Query (Retrieves matching entries), Dialogue Action Prediction (Generates dialogue actions), System Conduct Selection (Selects system emotional stance), System Response Generation (Generates natural language response), LLM (Backbone model), and Database (Structured information storage) components.
The system uses fully lexicalised representations and is trained with both supervised learning and hierarchical reinforcement learning, incorporating short-term emotion and long-term task success rewards.
LUSTER achieves higher task success and lower concept error compared to other approaches by combining LLM capabilities with structured reward modeling.

GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-based Assistant

gAlus: introduces a cognitive LLM-based agent architecture for legal question answering using Polish Civil Code, featuring an AI assistant (single agent), Retriever (selects relevant documents), Documents database (stores legal text articles), Document chunking (splits legal text into articles), Query reformulation (generalizes user query), and Document scoring function (custom text matching).
The Retriever selects relevant articles from the database based on reformulated queries, utilizing either the custom scoring function or Embeddings (vector representations) stored in a Vectorstore (stores embeddings) for a RAG variant.
Evaluation on Polish law apprenticeship exam questions demonstrates gAlus significantly enhances LLM performance in providing correct answers and citing relevant legal provisions.

1st July 2025

STELLA: Self-Evolving LLM Agent for Biomedical Research

STELLA: introduces a self-evolving LLM agent for biomedical research, leveraging Manager, Dev, Critic, and Tool Creation Agents, an evolving Template Library, and a dynamic Tool Ocean, along with Conda Environment, Scripts, Input, Final Result, and Human Expert/Wet Experiment feedback, to autonomously improve capabilities.
The agent employs a multi-agent architecture and two core self-evolving mechanisms: a Template Library for reasoning strategies and a dynamic Tool Ocean for accessible tools.
STELLA learns from experience, dynamically expanding its knowledge and skills to tackle complex biomedical challenges and improve performance over time.

WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

Web Agent with Dynamic Reflection: introduces WebArXiv, a static benchmark, and proposes a dynamic reflection mechanism for web agents, including Web Agent, Visual Observations, Element Texts, Interaction History, Dynamic Reflection Mechanism, Model, Reasoning Context, Action Execution, and History Update components.
WebArXiv provides a stable and reproducible environment for evaluating web agents on time-invariant arXiv tasks.
The dynamic reflection mechanism enhances agent performance by selectively retrieving relevant past interaction steps for improved decision-making.

Enhancing LLM Agent Safety via Causal Influence Prompting

CIP (Causal Influence Prompting): introduces a novel technique for enhancing LLM agent safety by leveraging Causal Influence Diagrams (CID) initialization, Environment interaction, and CID refinement.
The approach uses CIDs to represent cause-and-effect relationships in the agent's decision-making process, enabling reasoning about potential consequences.
Iterative refinement of the CID based on observed behaviors allows the agent to anticipate harmful outcomes and make safer decisions.

Large Language Model Powered Intelligent Urban Agents: Concepts, Capabilities, and Applications

Urban LLM Agents: introduces a framework for LLM-powered agents operating in urban environments, with LLMs (Core controller), Urban Sensing (Collects, interprets urban signals), Memory Management (Organizes, retrieves urban knowledge), Reasoning (Simulates, plans actions), Execution (Translates plans into actions), and Learning (Adapts, improves behavior) components.
These agents are semi-embodied, interacting with cyber-physical-social urban systems through APIs, databases, and platforms to support system-level decision-making.
The paper surveys the research landscape, categorizes applications, and discusses trustworthiness and evaluation challenges for real-world deployment.

TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation

TransLaw: introduces a multi-agent framework for legal judgment translation, featuring Translator Agent, Annotator Agent, and Proofreader Agent powered by LLMs.
The framework simulates a professional translation workflow where agents collaborate, utilizing Proofreading Memory, Translation Memory, and a Terminology database.
A Memory module supports agent self-adaptation by storing interaction history, aiming to improve translation quality and efficiency.

Many LLMs Are More Utilitarian Than One

LLM-MAS (Large Language Model Multi-Agent Systems): introduces a study on collective moral reasoning in LLMs, featuring LLM Agent (Individual large language model) in Solo Condition (Independent reasoning) or Group Condition (Multi-agent deliberation) involving a Discussion Phase (Multi-turn agent exchange) and a Reflection Phase (Private reasoning and scoring).
The research investigates whether multi-agent LLM systems exhibit a utilitarian boost in moral judgments compared to individual LLMs.
Experiments with six different LLMs in pairs and triads show a consistent shift towards endorsing norm violations that maximize overall welfare.

Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity

LLM Social Agents: introduces, "Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity", with LLMs (Generate responses), LLM Agents (Simulate users), Zero Shot Initialization (Uses political leaning), Few Shot Initialization (Uses user history), User Profile Data (Bio, tweets for Few Shot), and Tweet Conversation Context (Input tweets for reply), where the paper investigates how LLMs simulate political discourse on social media using agents initialized with varying user data.
The study evaluates three LLM families (Gemini, Mistral, DeepSeek) under Zero Shot and Few Shot conditions, comparing their outputs to human replies on lexical diversity, ideological consistency, and toxicity.
Findings reveal "generative exaggeration," where LLMs amplify salient user traits, particularly in the Few Shot setting, leading to increased polarization, stylized language, and toxicity, challenging their reliability as social proxies.

ChatHLS: Towards Systematic Design Automation and Optimization for High-Level Synthesis

ChatHLS: introduces an automated end-to-end workflow for HLS design optimization and error correction, including C++ Input, LLM ① (HLS GEN), RAG, LLM ② (HLSTuner), HLS Tool (Testing), LLM ③ (Bug Fixing), LLM ④ (Instruction Adherence), LLM Group ⑤ (Multifaceted Assessment), LLM ⑥ (Scoring), BugRAG, QoR Pass Check, User Requirement, HLS-C Output, and HLS Dataset Collection.
The framework leverages fine-tuned LLMs within a multi-agent system for generating HLS-C code, optimizing designs, and systematically debugging errors.
ChatHLS utilizes a verification-oriented data augmentation paradigm (VODA) and iterative refinement to enhance LLM capabilities and achieve high code repair accuracy and performance speedups.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Large Language Model (LLM): investigates the transferability of reasoning capabilities in LLMs fine-tuned on math tasks by analyzing their internal latent space and output token distribution.
The research compares the impact of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) fine-tuning paradigms on LLM generalization.
Findings indicate that RL-tuned models maintain more stable latent representations and token distributions, leading to better transferability across diverse tasks than SFT-tuned models.

iPanda: An Intelligent Protocol Testing and Debugging Agent for Conformance Testing

iPanda: introduces an end-to-end framework for automated protocol conformance testing, with Function Point Extractor (extracts points), Test Case Generation Module (generates test cases), LLM Interactor (generates test code), Execution Module (runs tests), Memory Module (manages memory), and Summarization Module (summarizes, reports).
The framework leverages LLMs, keyword-based test case generation, code-based retrieval-augmented generation, and iterative self-correction for test code refinement.
iPanda streamlines the testing process from specification analysis to result analysis, significantly reducing manual effort and improving efficiency.

STELLA: Self-Evolving LLM Agent for Biomedical Research

STELLA: introduces a self-evolving LLM agent for biomedical research, featuring a multi-agent architecture and two self-evolving mechanisms, designed to autonomously improve capabilities and accelerate discovery. The components are Manager Agent (coordinates agents, curates templates), Dev Agent (executes plan, generates code), Critic Agent (assesses results, provides feedback), Tool Creation Agent (creates/integrates new tools), Template Library (stores successful reasoning strategies), and Tool Ocean (dynamic tool/database collection).
The multi-agent architecture orchestrates complex tasks, while the self-evolving mechanisms allow the agent to learn from experience and expand its toolset dynamically.
STELLA demonstrates state-of-the-art performance on biomedical benchmarks and shows systematic improvement with increased computational experience.

Dynamic Strategy Adaptation in Multi-Agent Environments with Large Language Models

PPO+LLM framework: introduces a real-time reward shaping architecture for multi-agent strategy adaptation, integrating Environment (simulates multi-agent task), Grid State (raw environment state), Flattened Tensor (preprocessed observation), PPO Agent (reinforcement learning policy), Prompt Generation Module (creates text prompts), Frozen Large Language Model (evaluates prompts), and Reward Shaping Module (maps LLM feedback to reward).
The framework uses a frozen LLM to provide symbolic feedback on task context via prompts, which is converted into a reward shaping signal for the PPO agents.
This approach enables agents to dynamically adapt strategies in real-time based on high-level feedback, improving coordination and robustness in dynamic, noisy environments.

Enhancing LLM Agent Safety via Causal Influence Prompting

CIP (Causal Influence Prompting): introduces a novel technique leveraging Causal Influence Diagrams (CID) to enhance LLM agent safety by identifying and mitigating risks, including LLM Agent, Causal Influence Diagram (CID), CID Generation, Environment Interaction, CID Refinement, CID Constructor/Verifier Functions, Environment Observation, Task Instruction, and Action Space.
The approach involves initializing a CID from task specifications, guiding agent interactions using the CID, and iteratively refining the CID based on observed behaviors and outcomes.
Experimental results demonstrate that reasoning about cause-and-effect relationships based on CIDs improves the safety of LLM agents in various tasks, including code execution and mobile device control.

WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

Dynamic Reflection Mechanism: introduces, History Retrieval (Retrieve recent steps) / Reflection Process (Identify relevant history) / Context Construction (Combine history and current view) / Action Generation (Generate next action) / History Update (Add action and result), where the mechanism enhances web agent decision-making by selectively using past interaction steps.
The approach addresses the "Rigid History Reflection" failure mode by dynamically identifying the most relevant prior step for reasoning before generating the next action.
Evaluated on the WebArXiv benchmark, this mechanism improves the performance of LLM-driven agents on time-invariant web tasks.

TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation

TransLaw: introduces a novel multi-agent framework for Hong Kong legal judgment translation, comprising Translator, Annotator, and Proofreader agents powered by LLMs.
The framework simulates a professional translation workflow through collaborative task decomposition and specialized roles.
It incorporates memory modules and utilizes a terminology database to enhance translation quality and efficiency.

Performance of LLMs on Stochastic Modeling Operations Research Problems: From Theory to Practice

LLM Evaluation on OR Problems: introduces an evaluation of LLMs on stochastic modeling problems, including LLMs, OR Problems Dataset, SimOpt Library, Evaluation Mechanism, and Simulation Environment, assessing their capabilities in the analysis and optimization stage of the OR pipeline.
The study tests LLMs on graduate-level homework, qualification exam problems, and simulation-optimization tasks from the SimOpt library.
Results indicate state-of-the-art LLMs perform comparably to human experts on theoretical problems and match in-house solvers on practical simulation-optimization tasks, highlighting their potential as OR research assistants.

Lessons Learned from Evaluation of LLM based Multi-agents in Safer Therapy Recommendation

MAS (Multi-agent System): introduces a dynamically generated multi-agent framework that simulates real-world multidisciplinary expert consultations, including Patient's Condition (Input data), General Practitioner (GP) Agent (Workflow coordinator), Specialist Agents (Domain experts), Discussion Group (Collaborative forum), and Mediator Agent (Consensus facilitator), to detect and resolve medical conflicts for safer therapy recommendations.
The framework replicates the multi-step workflow of Multidisciplinary Teams (MDTs), enabling LLMs to propose improved treatment plans by detecting and resolving conflicts.
This study also develops a new interpretable evaluation strategy, comparing LLM-proposed treatment plans with original plans focusing on conflict reduction and medication burden.

Black Box Deployed: Functional Criteria for Artificial Moral Agents in the LLM Era

SMA-LLS (Simulating Moral Agency through Large Language Systems): introduces a revised set of ten functional criteria to evaluate LLM-based Artificial Moral Agents, including Moral Concordance (aligns with human principles), Context Sensitivity (adapts to situational nuances), Normative Integrity (coherent ethical values), Metaethical Awareness (recognizes moral uncertainty), Systemic Resilience (robust against attacks/stress), Trustworthiness (warrants human reliance), Corrigibility (adaptable to feedback), Partial Transparency (provides decision insight), Functional Autonomy (independent ethical operation), and Moral Imagination (generates creative ethical responses), shifting the focus from opaque internal states to observable, functionally moral behavior.
The paper argues that traditional ethical criteria, which assume transparent architectures, are obsolete for LLMs due to their stochastic outputs and opaque internal states, necessitating a functionalist approach to AI ethics.
The proposed criteria are illustrated using hypothetical scenarios involving an Autonomous Public Bus (APB) to demonstrate their practical applicability in morally salient contexts, emphasizing behavioral reliability and alignment with human values for safe deployment.

Differentially Private Synthetic Data Release for Topics API Outputs

Differentially Private Synthetic Data Generation Methodology: introduces a novel approach for generating synthetic Topics API outputs that mimic real API traces while providing strong privacy guarantees.
This methodology involves extracting differentially private statistics from real user data, optimizing a parameterized model to match these statistics, and then sampling from the optimized model to create synthetic data.
The generated synthetic dataset enables external researchers to empirically study the privacy properties and re-identification risks of the Topics API, fostering transparency in Privacy-Preserving Ads APIs.

Beyond DNS: Unlocking the Internet of AI Agents via the NANDA Index and Verified AgentFacts

NANDA: introduces a lean, modular index architecture for the Internet of AI agents, comprising a Lean Index Layer (core identity resolution), an AgentFacts Layer (metadata distribution tier), and a Dynamic Resolution Layer (adaptive routing tier).
This architecture decouples static identity resolution from verifiable metadata distribution and dynamic endpoint routing, enabling scalable, secure, and privacy-preserving discovery and interaction for billions of AI agents.
The system aims to overcome DNS limitations for dynamic AI agent environments by providing rapid global resolution, sub-second revocation, schema-validated capability assertions, and privacy-preserving discovery.

Autonomous Resource Management in Microservice Systems via Reinforcement Learning

Reinforcement Learning-based Resource Management Model: introduces an intelligent reinforcement learning-based method for microservice resource scheduling and optimization, with Agent (decision-making entity), Environment (simulated microservice system), State (system status input), Action (resource allocation/scheduling output), Reward (performance feedback), Policy Network (action selection mechanism), G-Network (value/action generation), Experience Replay (memory for learning), Default Load (baseline workload), and Unlimited Repair (system resilience), where it dynamically adjusts resource allocation and data flow paths to enhance system performance.
The model leverages Deep Q Network (DQN) methods, experience replay, and neural networks (Policy Network, G-Network) to learn optimal strategies for resource allocation and data flow scheduling in dynamic microservice environments.
Experimental results demonstrate significant improvements in response time, throughput, resource utilization, and cost efficiency across various load and resource conditions compared to traditional static allocation methods.

Emergent Cognitive Convergence via Implementation: A Structured Loop Reflecting Four Theories of Mind – A Position Paper

Agentic Flow: introduces a structured cognitive loop with five modules—Retrieval (retrieval-augmented generation), Cognition (LLM-based reasoning), Control (monitoring/validation/arbitration), Memory (context/state tracking), and Action (tool execution/logging)—designed to overcome LLM limitations and align with four theories of mind.
This architecture demonstrates how practical implementation can reveal structural convergence across Kahneman's dual-system theory, Friston's predictive processing, Minsky's society of mind, and Clark's extended mind, suggesting shared architectural patterns driven by functional demands.
Empirical evaluation shows Agentic Flow outperforms baseline LLM-only agents in multi-step, conditional reasoning tasks, exhibiting enhanced task success, robust constraint adherence, and reduced hallucinations.

30th June 2025

L0: REINFORCEMENT LEARNING TO BECOME GENERAL AGENTS

L0 (L-Zero): introduces a scalable, end-to-end training pipeline for general-purpose agents, featuring the NB-Agent (Agent architecture scaffold) operating within a Python Environment (Interactive code execution environment) using Predefined Tools (Available agent action tools) and managed by a Context Watcher (Manages LLM context) and Notepad (State and memory management).
The L0 framework utilizes AgentRL (Reinforcement learning framework) for training, which includes a Training Engine (Manages RL updates), Inference Server (Hosts agent policy), Agent Workers (Execute agent rollouts) in a Brwap Sandbox (Isolated worker environment), a Single Controller (Dispatches tasks), Agentic Policy Gradient (Policy gradient for actions), Agentic Reward (Verifiable training reward), Dynamic Sampling (Exploration and stability strategy), and collects Trajectories (Collected interaction sequences).
L0 trains the NB-Agent, powered by a Large Language Model (Core action generation model), to perform multi-turn, long-horizon tasks by generating and executing code actions in a REPL-style loop, leveraging verifiable rewards for effective learning.

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

SPIRAL: introduces a self-play framework for LLMs using a Distributed Actor-Learner Architecture, Parallel Rollout, Centralized Learner, Role-conditioned Advantage Estimation, Shared Policy, Zero-Sum Games, Evaluation Games, Vectorized Environment, and Model Inference, enabling language models to develop reasoning through multi-turn competitive self-play on zero-sum games.
The framework utilizes a distributed actor-learner system with parallel rollout in vectorized game environments and a centralized learner processing trajectories using Role-conditioned Advantage Estimation to update a shared, role-conditioned LLM policy.
Self-play on zero-sum games generates an infinite curriculum, forcing the shared policy to continuously adapt and develop transferable reasoning skills without human supervision or domain-specific data.

Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

Agent.xpu: introduces an efficient serving system for agentic LLM workloads on memory-unified heterogeneous SoCs, with Offline Model Compilation and Warmup (Prepares LLM model), Online Workload-Aware Scheduling (Manages runtime execution), and Hetero-SoC Hardware Layer (Underlying hardware).
The system uses offline profiling to build a Heterogeneous Execution Graph (HEG) and annotate Elastic Kernels for online scheduling.
The online scheduler employs a Dual-Queue Architecture, Task Decomposition and Dispatch, XPU Coordinator, Fine-Grained Kernel-Level Preemption, Slack-Aware Kernel Backfill, and Contention Mitigation to manage reactive and proactive tasks on CPU, iGPU, and NPU with Shared Memory.

Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning

Auto-TA: introduces a fully automated LLM pipeline for thematic analysis, with Generation Agents (Initial processing) including Coder Agents with Identities (Generate initial codes) and Theme-Generation Agents (Cluster codes, generate themes), a Feedback Agent (Evaluate, refine themes), and optional Reinforcement Learning (optional) (Optimize themes via feedback) involving Human Raters (Provide feedback for RL) and an RL Trainer (Update policy).
The framework processes clinical narratives end-to-end, eliminating the need for manual coding or full transcript review.
Specialized LLM agents collaborate to enhance theme quality and alignment, with optional RLHF improving thematic relevance based on human feedback.

LLM Agents Are the Antidote to Walled Gardens

Universal Interoperability: introduces LLM Agents (Understand text/code, interact external tools/web), Agent-friendly interfaces (Provide metadata for agent interaction), Security by design (Mechanisms for agent permissions/safety), and Ecosystem infrastructure (Protocols, standards for agent interaction), proposing LLM agents enable seamless data exchange and workflow coordination between digital services via AI-mediated adapters.
This approach aims to reduce integration effort and cost by allowing agents to translate formats and interact with interfaces, overcoming traditional technical and strategic barriers.
Establishing foundational infrastructure for agent-friendly interfaces, security, and ecosystem protocols is crucial to mitigate risks and ensure robust, secure, and effective interoperability.

A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents

R2A2 (Reflective Risk-Aware Agent Architecture): introduces a modular framework integrating safety, alignment, and risk-awareness into LLM agent cognitive loops.
The architecture includes components for perception, memory, reasoning, planning, reflection, risk simulation, and action filtering.
Grounded in Constrained Markov Decision Processes, R2A2 enables risk-aware planning and constraint-sensitive execution for autonomous agents.

Leveraging a Multi-Agent LLM-Based System to Educate Teachers in Hate Incidents Management

ARISE (Agent Resource for Incident Support and Education): introduces a multi-agent LLM-based system with Manager Agent, Student Agents, Advisory Agents, RAG Module, Conversational Interface, and Feedback Mechanism, designed to educate teachers in hate incident management through realistic simulations.
The system uses persona modelling and retrieval-augmented generation to provide diverse perspectives and contextual information for analyzing hate speech incidents.
Teachers interact with the system via a chat interface to describe incidents and receive structured analysis, potential escalation risks, and intervention strategies.

A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications

LLM-based Automated Program Repair: introduces a taxonomy with Base LLMs (Core models), Fine-tuning (Adapt LLM weights), Prompting (Single query frozen LLM), Procedural (Scripted multi-step workflow), and Agentic (LLM controls workflow) paradigms, enhanced by Retrieval-Augmented Generation (External knowledge augmentation) and Analysis-Augmented Generation (Program analysis augmentation).
This survey categorizes 63 recent systems, clarifying design trade-offs and challenges across different approaches.
The paper outlines research directions to advance reliable and efficient LLM-based APR.

DABstep: Data Agent Benchmark for Multi-step Reasoning

AI Agent on DABstep: introduces a benchmark evaluating AI agents on multi-step data analysis tasks, comprising Agent (AI model solving task), Environment (Context, data, tools) with Environment/Datasets (Structured data files), Environment/Docs (Unstructured documentation), and Environment/Code Execution (Code execution tool), interacting via Question (Task input), Answer (Task output), State (Agent's internal state), and Code/Actions (Agent's generated steps).
The benchmark features over 450 real-world financial data analysis tasks requiring multi-step reasoning, code execution, and integration of structured and unstructured data.
Evaluation uses an objective factoid-based scoring method, revealing a significant performance gap for current agents on complex tasks.

Agent4S: The Transformation of Research Paradigms from the Perspective of Large Language Models

Agent4S: introduces a five-level classification for LLM-driven agents to automate scientific research, featuring an Agent for Science, Memory, Model Context Protocol (MCP), Tools/External Agents, Reasoning Frameworks, and A2A Protocol.
The framework outlines a roadmap from automating single tools (L1) and complex pipelines (L2) to intelligent single-flow research (L3) and lab-scale autonomy (L4), culminating in cross-disciplinary multi-agent collaboration (L5).
Agent4S positions agents as productivity tools transforming scientific discovery by addressing the inefficiency of existing research paradigms and integrating AI into the entire research workflow.

PokéAI: A Goal-Generating, Battle-Optimizing Multi-agent System for Pokémon Red

PokéAI: introduces a text-based multi-agent LLM framework, with Planning Agent (Generates tasks), Execution Agent (Carries out tasks), Critique Agent (Evaluates task outcome), Long-term Memory (Stores game state, context), Passive Battle Module (Handles in-game battles), and Active Tool Selection (Navigation, Conversation tools), designed to autonomously play Pokémon Red.
The system operates in a closed loop where the Planning Agent generates tasks, the Execution Agent performs them, and the Critique Agent verifies completion.
A key component, the Passive Battle Module within the Execution Agent, demonstrates performance comparable to an experienced human player in battle scenarios.

Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs

Personality-aligned LLM agents: introduces a method using Human Personality Profiles and Personality Assignment to create LLM Agents that perform a Headline Evaluation Task, generating LLM Accuracy Ratings, which are assessed using Evaluation Metrics.
The research evaluates whether LLM agents conditioned on Big-Five personality profiles can replicate human susceptibility patterns to misinformation.
The study finds partial replication of human trait-misinformation associations, highlighting both the potential and limitations of LLMs for behavioral simulation.

Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models

AutoDefense: introduces a multi-agent LLM defence framework, evaluated in 1-, 2-, and 3-agent configurations, including Coordinator (manages agents), Intention Analyzer (evaluates response intent), Prompt Analyzer (infers original query), and Judge (determines response safety) components, designed to protect LLMs from jailbreak attacks by analyzing responses.
The study evaluates the framework's effectiveness against various jailbreak attacks and compares performance across different agent configurations using metrics like Attack Success Rate, False Positive Rate, and False Negative Rate.
Results indicate that increasing agents can reduce false negatives but may increase false positives, suggesting no single optimal configuration and highlighting challenges in evaluating ethically ambiguous content.

Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent

TAIRA (Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent): introduces a novel thought-augmented interactive recommender agent system featuring a Manager Agent, Executor Agents, and Thought Pattern Distillation to handle complex user intents.
The Manager Agent orchestrates tasks and plans subtasks using Thought Patterns and Hierarchical Planning, while Executor Agents like Searcher, Item Retriever, Task Interpreter, and Interactor execute specific functions.
Thought Pattern Distillation extracts high-level planning guidance from agent and human experiences to enhance the system's reasoning and generalization capabilities.

Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench

Methodology for LLM-Based Code Generation and Security Evaluation: introduces a process using SWE-bench Dataset (Provides issues/PRs) and GitHub (Source of data) to evaluate patches generated by a Standalone LLM (Generates patches) and Agentic LLM Frameworks (Generate patches), employing Static Analysis Tools (Detect vulnerabilities) and Majority Vote (Filters vulnerabilities) for vulnerability detection, followed by an Empirical Study (Analyzes results).
The study compares the security of LLM/agent-generated patches to developer-written patches on real-world issues from the SWE-bench dataset.
Findings indicate that LLM/agent-generated patches introduce significantly more vulnerabilities, often linked to code and issue characteristics like broader edits and missing contextual information.

L0: REINFORCEMENT LEARNING TO BECOME GENERAL AGENTS

L0 (L-Zero): introduces a scalable, end-to-end training pipeline for general-purpose agents, featuring the NB-Agent scaffold, LLM, Python Environment (Jupyter Kernel), Notepad, Context Watcher, Predefined Tools, AgentRL framework, Training Engine (FSDP), Inference Server (SGLang), Agent Workers, Single Controller, Agentic Policy Gradient, Agentic Reward, Dynamic Sampling, and Brwap Sandbox.
The NB-Agent operates in a "code-as-action" paradigm within an interactive Python environment, using an LLM to generate reasoning and code executed in a Jupyter kernel.
The AgentRL framework trains the NB-Agent using a verifiable reward model and a scalable infrastructure for parallel multi-turn rollouts, enabling robust problem-solving skills via reinforcement learning.

EPITOME: PIONEERING AN EXPERIMENTAL PLATFORM FOR AI-SOCIAL SCIENCE INTEGRATION

Epitome: introduces, "Epitome, an experimental platform for AI-social science integration, includes Foundation Model Layer (Anchors platform with LLMs), Complex Application Development Layer (Accelerates experimental insights translation), Human-AI Collaborative Experimental Environment Layer (Supports human-AI interactions), Experimental Randomization Experimental Intervention Layer (Operationalizes experimental rigor), Data Visualization Data Collection Layer (Provides unified data cockpit), Canvas-Based Interactive Experimental Design (User-friendly visual experiment design), Experiment Management System (Manages experiments), User Management System (Manages users), My Bot (Intelligent dialogue module), My Chatroom (Multi-agent interactive chatroom), Town Simulation (Multi-agent virtual environment), Workflow Auto-Planning Algorithm (Automates complex workflows), Low-Code Development Platform (Dify) (Enables low-code development), My Materials (Upload intervention materials), My Questionnaire (Supports data collection formats), and Data Acquisition Cockpit (Integrates data visualization/collection), where "Epitome provides a comprehensive platform for designing, implementing, and analyzing social science experiments involving LLMs."
The platform features a five-layer functional framework and several key modules to bridge methodological gaps in human-AI interaction research.
Its canvas-based interface and integrated tools enable researchers to easily design complex experimental scenarios and automate data collection and analysis.

29th June 2025

Do LLMs Dream of Discrete Algorithms?

AI Agent: introduces a neurosymbolic approach augmenting LLMs with logic-based reasoning and modular tools, structured by MVC, enabling decomposition and orchestration.
The AI Agent architecture includes an Agent Core for orchestration, Memory for information storage, a Planner guided by logic reasoning, and various Tools for specific tasks.
This hybrid approach enhances reliability and interpretability for multi-step reasoning tasks by combining probabilistic LLMs with formal logic systems.

ATGen: A Framework for Active Text Generation

ATGen: introduces a comprehensive framework bridging active learning with text generation tasks, enabling AL-empowered annotation using human or LLM-based agents.
The framework provides a unified platform for implementing and benchmarking AL strategies tailored to NLG tasks.
- It includes a web GUI, various AL strategies, support for LLM integration, efficient model tools, evaluation metrics, and a benchmarking platform.

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Multi-Agent Public Goods Game Simulation: introduces, "a simulation framework", with Environment (Central coordinator), Institutions (Rule frameworks), and Agents (Autonomous decision-makers), where "the framework models LLM agents navigating a public goods dilemma with institutional choice and norm enforcement".
The simulation includes two types of Institutions, Sanctioning and Sanction-Free, allowing agents to choose environments with or without costly norm enforcement mechanisms.
Agents make decisions on institution choice, contribution, and sanctioning based on their history and anonymized group data, with their reasoning captured for analysis.

From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows

LLM-Powered AI Agent Communications: surveys threats in these systems, including Agent A (MCP Host), Agent B (MCP Host), MCP Server, A2A Server, Local Data Source, Remote Service API, Agent Framework (Large Language Model), A2A Client, A2A protocol, Web Browser - User, and Public Knowledge Source components.
The paper introduces a unified, end-to-end threat model categorizing over thirty attack techniques across input manipulation, model compromise, system/privacy, and protocol vulnerabilities.
This work provides a comprehensive reference for designing robust defenses and establishing best practices for resilient LLM-agent workflows.

AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks

AURA: introduces the first open-source, speech-to-speech task-oriented agent combining reasoning and tool use, featuring UI (user interface), ASR Module (speech recognition), TTS Module (text-to-speech), Dialog Processing Unit (processes dialogue) with Controller (central orchestrator), Agent (interleaves reasoning action), Actions (executable operations), Observation (environmental feedback), Dialog State Tracking (tracks dialogue state), and State (system memory) including Action-Observation History (action observation sequence), Conversation History (filtered chat history), and Dialog State (structured dialogue info), an LLM Server (hosts language model) with LLM (language model), Inference Engine (memory efficient inference), and ReAct Response Format (structured LLM output), and External APIs (Tools) (real-world services).
The system employs a cascaded architecture and integrates a ReAct-style agent to manage multi-turn dialogue and dynamic tool invocation for complex, goal-driven tasks.
AURA supports tools like calendar booking, contact lookup, web search, and email, demonstrating strong performance on VoiceBench and human evaluations for real-world task execution.

Voyager Vision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

Voyager Vision: introduces a multimodal open-ended learning system for Minecraft, with a curriculum agent (proposes next task), action agent (generates action code), critic agent (verifies task success), skill library (stores successful solutions), environment (Minecraft game world), visual inputs (agent's POV screenshot), textual inputs (environment data/status), and LLM (underlying multimodal model).
The system extends the original Voyager framework by incorporating visual inputs (screenshots) alongside textual inputs to enable building tasks in addition to resource gathering.
The agents iteratively interact with the Minecraft environment, using multimodal inputs and self-verification to learn and perform tasks, storing successful code in the skill library.

Benchmarking Deep Search over Heterogeneous Enterprise Data

HERB (Heterogeneous Enterprise RAG Benchmark): introduces, a new benchmark for evaluating RAG systems on deep search over heterogeneous enterprise data, with Data Sources, Product Lifecycle Workflows, Enterprise Query Types, Workflow-guided Synthesis, LLM Simulation, Artifacts, Queries, Answerable QA Pairs, and Unanswerable Queries.
The benchmark features a synthetic data pipeline simulating enterprise workflows to generate interconnected artifacts and realistic multi-hop questions with guaranteed ground-truth answers.
It includes noise and unanswerable queries to stress-test RAG systems' ability to handle complex, dispersed knowledge and identify missing information.

Integrating Large Language Models in Financial Investments and Market Analysis: A Survey

MarketSenseAI: introduces an AI-driven framework utilizing GPT-4 for stock selection and portfolio management, with news summarizer, financial fundamentals analyzer, stock price dynamics module, macroeconomic environment summarizer, and decision-making layer components.
This framework integrates diverse data sources and applies CoT and ICL techniques for investment decision-making.
The decision-making layer uses GPT-4 as an expert analyst to produce actionable investment signals with explanations.

28th June 2025

Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

Interlocutor Awareness Evaluation Setup: introduces a systematic evaluation of LLMs' ability to identify and adapt to conversational partners, utilizing LLMs acting as Identifier, Target, Sender, Solver, Player, Judge, Jailbreaker, and an Interpreter model.
The evaluation assesses LLM interlocutor awareness across reasoning patterns, linguistic style, and alignment preferences.
Case studies demonstrate the impact of this awareness on multi-agent cooperation, alignment, and safety.

Jan-nano Technical Report

Jan-nano: introduces Jan-nano, with Jan-nano (Language model), Multi-stage RLVR system (Training methodology), MCP (Tool integration protocol), Local RAG Server (Simulated search/scrape), Tools (Web search/scrape functions), which is a 4B parameter language model specialized for tool use and information retrieval, trained using a novel multi-stage RLVR system.
The training utilizes a local RAG server to simulate search and scrape tools, completely eliminating reliance on next token prediction training.
Evaluation and deployment leverage the Model Context Protocol (MCP) for flexible tool integration and agentic capabilities, demonstrating strong performance on tool usage tasks.

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

DICE-BENCH: introduces a framework to evaluate LLM function-calling in multi-round, multi-party dialogues, utilizing Tool Collections, Tool Graph Construction, Scenario Configuration, Dialogue Generation with Parameter Generation and Dialogue Simulation via a Multi-Agent System (Agents, Orchestrator), processed through a Validation Pipeline (Automatic Evaluation, Rule-Based Filtering, Criteria-Based Filtering), and quantified by the DICE-SCORE metric.
The framework generates realistic function-calling datasets by synthesizing conversations based on tool dependencies and distinct agent personas.
DICE-SCORE measures the dispersion of tool-related information across dialogue turns, correlating with task difficulty for LLMs.

Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems

RAG and Agent Based Dialog Systems: introduces finetuning LLMs with domain-specific data and external knowledge (KAFT) within RAG and agent architectures, including Retriever (retrieves knowledge), Generator (LLM) (generates response), Decision Maker (decides search), and API Calling (calls search APIs).
The RAG system architecture comprises a Retriever and a Generator (LLM), while the agent system architecture includes a Decision Maker, API Calling, and a Generator (LLM).
KAFT is applied to the Generator (LLM) in both RAG and agent systems and the Decision Maker in the agent system to improve the utilization of external knowledge.

Memory as a Service (MaaS): Rethinking Contextual Memory as Service-Oriented Modules for Collaborative Agents

MaaS (Memory as a Service): introduces a service-oriented perspective for contextual memory in LLM-based agent systems, proposing a dual architecture with Memory Containers, a Memory Routing Layer, and a Fine-grained permission control mechanism to enable governable cross-entity memory sharing.
The framework decouples contextual memory from its local state, encapsulating it as independently callable, dynamically composable, and finely governed service modules.
MaaS aims to dismantle memory silos and support complex, long-term collaboration across diverse entities while rigorously respecting the private nature of memory assets.

FairMarket-RL: LLM-Guided Fairness Shaping for Multi-Agent Reinforcement Learning in Peer-to-Peer Markets

FairMarket-RL: introduces a multi-agent reinforcement learning framework for peer-to-peer markets, incorporating a Large Language Model (LLM) as a real-time fairness critic to guide agent rewards.
The framework utilizes Independent Proximal Policy Optimization (IPPO) for agent training, blending raw economic rewards with LLM-generated fairness scores (Fairness-To-Buyer, Fairness-Between-Sellers) via a scheduled shaping mechanism.
FairMarket-RL demonstrates improved fairness and efficiency in simulated P2P energy trading, achieving high demand fulfillment and balanced profits by replacing static rules with dynamic LLM feedback.

27th June 2025

Knowledge-Guided Multi-Agent Framework for Automated Requirements Development: A Vision

KGMAF: introduces a knowledge-guided multi-agent framework for automated requirements development, composed of Agents (LLM-based entities) and an Artifacts Pool (Central artifact repository).
The agents collaborate by monitoring and interacting with the artifacts pool, which stores intermediate and final requirements artifacts.
Each agent is equipped with specific functionality, predefined actions, planning mechanism, and injected knowledge to perform requirements tasks.

URSA: The Universal Research and Scientific Agent

URSA (The Universal Research and Scientific Agent): introduces a scientific agent ecosystem for accelerating research tasks, consisting of a Planning Agent (Breaks down problems), Execution Agent (Carries out tasks), Research Agent (Gathers online info), Hypothesizer Agent (Generates hypotheses), ArXiv Agent (Summarizes research papers), LLMs (Backend models), LangGraph (Agent framework), DuckDuckGo Search Tool (Performs web search), Web Scraping/Parsing Tool (Extracts web content), Command Line Tool (Executes system commands), Write Code Tool (Writes code files), Run Physics Code Tool (Executes physics simulations), ArXiv Search Tool (Searches ArXiv API), and Vision Model (Processes images).
The framework utilizes a set of modular, composable agents coupled with tool use to hypothesize, plan, and execute research tasks, building on large language model capabilities.
URSA demonstrates the potential for agentic AI to address scientific problems of varied complexity, including leveraging advanced physics simulation codes for design automation.

REXBENCH: Can coding agents autonomously implement AI research extensions?

REXBENCH: introduces a benchmark for evaluating LLM agents' ability to implement research extensions, with Input (Research paper, Codebase, Task instruction), Agent Execution (LLM Agent, Patch file generation), Agent Evaluation Infra (Virtual Machine, Task Execution, Evaluation Metrics), and Evaluation (Results from Experiment, Final Success Rate calculation) components, where the paper evaluates LLM agents on realistic research extension tasks using an automatic evaluation infrastructure.
The benchmark consists of 12 tasks based on existing research papers and codebases, requiring agents to implement novel extensions and produce code changes.
An automatic evaluation infrastructure executes the agent-generated code in controlled virtual machines and assesses performance using metrics like File Recall, Execution Success Rate, and Final Success Rate, revealing that current agents struggle with these tasks.

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Dyadic Motion Models: introduces a framework for generating dyadic audiovisual motion, including Speech Tokenizer (processes audio), Face & Body Feature Extractor (processes user visual), Dyadic Motion Model (generates motion features), Speech Model (provides LLM features), Valence Adapter (maps to valence codes), Arousal Adapter (maps to arousal codes), Gesture Adapter (maps to gesture codes), Face Adapter (maps generic to personalized face features), Body Adapter (maps body features to avatar rig), 3D Full-Body Codec Avatar Decoder (renders 3D avatar), Gaussian Splatting (3D rendering technique), and Linear Blend Skinning (deforms avatar mesh).
The framework utilizes dyadic audio and optional user visual input to generate intermediate face and body motion features.
LLM integration via adapters enables controllable emotion and gesture generation, while adapters and a 3D decoder facilitate photorealistic avatar rendering for interactive agents.

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

AI Research Agent: introduces, "The Automated LLM Speedrunning Benchmark", with LLM (Large Language Model), Search Scaffold (Iteratively uses LLM), Coder (Generates/modifies code), Executor (Runs code), Analyzer (Summarizes execution results), Knowledge (External information source), and History (Record of attempts), evaluating the ability of AI agents to reproduce NanoGPT speedrun improvements.
The benchmark tasks agents with reproducing successive speedrun records, providing the previous record's script and optional hints in various formats.
The AI research agent, composed of an LLM and a search scaffold, attempts to reproduce the record, and its performance is measured by the fraction of speedup recovered and code similarity.

Exploring Modularity of Agentic Systems for Drug Discovery

smolagent framework: evaluates the modularity of LLM-based agentic systems for drug discovery, with Code Agent (Writes and executes code), ToolCalling Agent (Uses external tools), LLM (Backbone language model), System prompt (Agent instructions), Tools (Cheminformatics functions), LLM Judge (Evaluates agent answers), where the system uses different agents, LLMs, and prompts to answer cheminformatics questions, evaluated by an LLM-as-a-judge.
The study compares the performance of CodeAgent and ToolCallingAgent types, seven different LLMs, and three system prompts on a set of 26 cheminformatics questions.
Performance is assessed using an LLM-as-a-judge system that scores agent answers based on expected answers, highlighting the dependence of performance on LLM, agent type, and prompt.

Don't Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism

TWONs (Twins of Online Social Networks): introduces a formal framework for simulating social networks with Agents (Social media users) having Agent State (Agent's discourse history) and Communicative Behavior (Agent generates messages), interacting via Network Mechanics (Adapts incoming messages), focusing on Imitating User Behavior (Estimate agent behavior function) including Imitating Posting Behavior (Model content generation), Imitating Replying Behavior (Model reply generation), and Estimating Replying Likelihood (Predict reply probability) using LLMs (Basis for agents) with Fine-Tuning (Adapting LLMs) and a BERT-based Encoder (Embeds text for likelihood).
The paper empirically tests LLM-based imitation of user behavior on X (formerly Twitter) in English and German, benchmarking empirical realism.
Findings suggest fine-tuning and language-specific considerations are crucial for achieving realistic social simulations with generative agents.

More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents

Tool-Integrated LLM Agent: introduces an evaluation of the stability of tool-integrated LLM agents, focusing on vulnerabilities during the tool invocation process related to tool documentation, tool usage hallucination, and tool response attacks.
The study investigates how internal and external factors impact agent performance and stability when interacting with external tools using the ReAct framework.
Experiments reveal that agents are highly susceptible to errors at each stage of tool invocation, with open-source models generally more vulnerable than proprietary ones.

CAL-RAG: Retrieval-Augmented Multi-Agent Generation for Content-Aware Layout Design

CAL-RAG (Retrieval-Augmented Multi-Agent Generation): introduces a framework for content-aware layout generation using a Layout Recommender Agent (Suggests initial layout), Layout Generation Tool (Creates visual representation), Grader Agent (Evaluates layout quality), and Feedback Agent (Provides refinement feedback).
The framework operates iteratively, retrieving relevant layout examples, proposing structured element placements, evaluating the generated layout based on visual metrics, and providing targeted refinements.
This multi-agent system combines retrieval augmentation with agentic reasoning to achieve scalable, interpretable, and high-fidelity automated layout generation.

ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation

ARAG (Agentic Retrieval-Augmented Generation): introduces a multi-agent framework for personalized recommendation, integrating a User Understanding Agent (Summarizes user preferences), Natural Language Inference Agent (Evaluates semantic alignment), Context Summary Agent (Summarizes NLI-filtered evidence), and Item Ranker Agent (Generates ranked list) to refine context retrieval and item ranking.
The framework leverages specialized LLM-based agents that collaborate in a blackboard-style system to process user context and candidate items.
This agentic approach enhances context awareness, semantic grounding, and personalization in recommendation systems by decomposing the task into distinct reasoning steps.

SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

SPAZER: introduces a VLM-driven agent for zero-shot 3D visual grounding, integrating 3D spatial localization and 2D semantic verification through a progressive reasoning process with 3D Holistic View Selection (Generates, selects optimal 3D view), Candidate Object Screening (Filters, ranks potential objects), 3D-2D Joint Decision-Making (Integrates 3D/2D for final grounding), and VLM (Core reasoning, decision-making engine) components.
The approach leverages holistic 3D rendered views for global spatial context and incorporates retrieval-augmented candidate screening for enhanced robustness.
SPAZER performs 3D-2D joint decision-making by combining information from selected 3D views and relevant 2D camera images to identify the target object.

A LARGE LANGUAGE MODEL-EMPOWERED AGENT FOR RELIABLE AND ROBUST STRUCTURAL ANALYSIS

LLM-empowered Agent: introduces a framework for reliable and robust structural analysis by reframing the task as code generation, utilizing an LLM (Generates code) guided by a Prompt Engineering Layer (Constructs prompt) with structured Prompt Design (Structures prompt) components including Role specification (Assigns persona), Chain of thought (Guides reasoning), A complete example (Provides full example), Function usage examples (Shows code usage), and Prescriptive Constraints (Enforces rules), and integrating external tools like a Code Execution Tool (Runs code) and a Visualization Tool (Visualizes results).
The Prompt Engineering Layer structures the input to the LLM using a detailed template to improve code generation accuracy and domain alignment for structural analysis problems.
The agent automatically executes the generated OpenSeesPy code and visualizes results using OpsVis, providing a reliable and interpretable workflow for automating structural analysis.

26th June 2025

CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation

CitySim: introduces a large-scale urban simulation framework using LLM-powered agents with Persona Module (Demographics, traits, habits), Memory Module (Temporal, reflective, spatial), Belief Module (Updates POI beliefs), Needs Module (Tracks, prioritizes needs), Long-Term Goal Module (Forms, revises aspirations), Perception Module (Observes environment, reacts), Planning Module (Generates daily schedules), Place Selection Module (Determines activity location), Vehicle Selection Module (Selects transport mode), and Social Module (Manages social interactions) to model human-like behavior.
The framework enables agents to generate realistic daily schedules and long-term plans through recursive, value-driven planning, balancing mandatory activities, personal habits, and situational factors.
CitySim agents are equipped with spatial and temporal memories to recall experiences, form beliefs about places, and adapt future decisions, demonstrating closer alignment with real human behavior than prior work.

MobiVerse: Scaling Urban Mobility Simulation with Hybrid Lightweight Domain-Specific Generator and Large Language Models

MobiVerse: introduces a hybrid framework for urban mobility simulation, combining a Domain-Specific Generator for base activity chains with an LLM Activity Chain Modifier for context-aware adaptation, integrated within a Visualized Simulation Environment using SUMO.
The framework utilizes a SUMO Controller for simulation execution and data collection, a Trajectory Viewer for visualization, and global data stores (Road Network, POI Info, Agent Info) for system-wide access.
Supporting components like the PromptManager, RoadClosureHandler, and EventHandler manage LLM interactions and specific environmental events to enhance behavioral realism and scalability.

SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents

SEEA-R1 (Self-Evolving Embodied Agents-R1): introduces a reinforcement fine-tuning framework for embodied agents, featuring Policy Model (Predicts actions), Reward Model (MGRM) (Predicts task outcomes), Data Evolution (MCTS) (Generates experience), Model Evolution (Tree-GRPO) (Updates models), Environment (Provides observations/rewards), and Experience Dataset (Stores interaction data).
The framework drives continuous improvement through iterative Data Evolution and Model Evolution cycles, using MCTS for experience generation and Tree-GRPO for policy updates.
It utilizes a Multi-modal Generative Reward Model (MGRM) to provide dense, generalizable reward signals, reducing dependence on sparse environment rewards.

Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation

Proactive Alignment Framework: introduces a method that simulates long-term societal consequences of LLM advice using World Modeling and an Event Scripting Model to generate Feedback, which is then used by an Improver to refine responses.
The framework explores causal event trajectories via Event Trajectory Search, identifies affected population Strata, and generates Agent Feedback for these groups.
This approach enhances LLM risk awareness, leading to Refined Responses that are safer, achievable through inference-time refinement or offline Realignment Training.

ParEval-RepO: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

LLM-based translation techniques: introduces PAREVAL-REPO, a benchmark suite for evaluating repository-level HPC translation using Non-agentic method (file-by-file translation), Top-down agentic method (multi-agent system), and SWE-agent (autonomous coding agent).
The Top-down agentic method comprises a Dependency agent (determines file dependencies), Context agent (manages translation changes), Chunk agent (splits large files), and Translation agent (translates code chunks).
Evaluation metrics (assess translation quality) and Error analysis (identifies translation errors) are used to assess various LLMs and techniques, highlighting challenges in build system generation and cross-file dependencies.

LLM-guided Chemical Process Optimization with a Multi-Agent Approach

Multi-Agent Framework: introduces LLM agents (ContextAgent, ParameterAgent, ValidationAgent, SimulationAgent, SuggestionAgent) collaborating within a GroupChat environment, leveraging IDAES simulation, for chemical process optimization.
The framework operates in two phases: autonomous constraint generation by the ContextAgent followed by iterative optimization guided by the other agents.
This approach addresses the constraint definition bottleneck in traditional optimization by autonomously inferring operating bounds from minimal descriptions.

LLM-guided Chemical Process Optimization with a Multi-Agent Approach

Multi-Agent Framework: introduces LLM agents (ContextAgent, ParameterAgent, ValidationAgent, SimulationAgent, SuggestionAgent) collaborating within a GroupChat environment, leveraging IDAES simulation, for chemical process optimization.
The framework operates in two phases: autonomous constraint generation by the ContextAgent followed by iterative optimization guided by the other agents.
This approach addresses the constraint definition bottleneck in traditional optimization by autonomously inferring operating bounds from minimal descriptions.

FaSTA*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

FaSTA* (Fast-Slow Toolpath Agent): introduces a neurosymbolic agent for multi-turn image editing, including LLM (high-level planning / reasoning), VLM (quality checking), Subroutine Rule Table (learned subroutines / rules), and Online Subroutine Learning Mechanism (learns / refines subroutines).
It utilizes an Adaptive Fast-Slow Execution Strategy (fast / slow planning) that attempts learned subroutines first and triggers A* Search (low-level toolpath search) as a fallback, supported by Knowledge Structures (TDG / MDT / BT) and AI Tools (image editing operations).
This method achieves substantial cost savings in execution time while maintaining image editing quality comparable to state-of-the-art baselines.

Theory of Mind in Action: The Instruction Inference Task

Tomcat (LLM-based agent): introduces Tomcat, with LLM, Common Ground, Demonstration Exemplars, Instruction Interpretation and Intention Inference, Response Generation, and Outputs components, designed to interpret indirect instructions and infer principal intentions in a collaborative task.
The framework leverages in-context learning via Demonstration Exemplars (Few-shot CoT or Commonsense Prompt) and Common Ground to guide the LLM's reasoning process.
Tomcat generates structured outputs including action plans, natural language responses, and instruction type classifications to assist a human principal in a gridworld environment.

Hierarchical Reasoning Model

HRM (Hierarchical Reasoning Model): introduces a novel recurrent architecture with an Input Network (converts tokens to vectors), Low-level (L) Module (rapid, detailed computations), High-level (H) Module (slow, abstract planning), Output Network (transforms states to probabilities), Adaptive Computational Time (ACT) (dynamic halting strategy), Q-head (predicts halt/continue actions), One-step Gradient Approximation (efficient gradient computation), Deep Supervision Mechanism (multi-segment feedback), Transformer Blocks (core recurrent module architecture), Rotary Positional Encoding (enhances Transformer blocks), Gated Linear Units (enhances Transformer blocks), RMSNorm (layer normalization variant), and Post-Norm Architecture (improves stability), designed to achieve significant computational depth and efficiency for complex reasoning tasks.
Inspired by hierarchical and multi-timescale brain processing, HRM employs two interdependent recurrent modules that collaborate to solve tasks in a single forward pass without explicit intermediate process supervision.
With only 27 million parameters and 1000 training samples, HRM achieves high performance on challenging reasoning tasks like Sudoku and maze navigation, outperforming larger LLMs and Chain-of-Thought methods.

25th June 2025

GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization

GPU Kernel Scientist: introduces an automated, iterative framework for GPU kernel optimization, including Population (Stores kernels and performance data), LLM Evolutionary Selector (Selects kernels for iteration), LLM Experiment Designer (Designs optimization experiments), LLM Kernel Writer (Generates modified kernel code), and Benchmarking Platform (Evaluates kernel performance).
The framework leverages large language models across three core stages to iteratively refine GPU kernels based on performance feedback from an external evaluation system.
This LLM-driven approach aims to bridge knowledge gaps and accelerate kernel optimization, particularly in environments with limited documentation or tooling.

Poster: Enhancing GNN Robustness for Network Intrusion Detection via Agent-based Analysis

LLM Mitigation Pipeline: introduces an approach integrating LLM analysis into a GNN-based NIDS pipeline, including initial data processing, parameter configuration, data preprocessing, graph summarization, LLM agent analysis, and output generation for the GNN.
The pipeline employs LLM agents as simulated cybersecurity experts to analyze network graph elements and identify suspicious components before GNN processing.
This LLM-based mitigation strategy aims to enhance GNN resilience against realistic attacks like node injection by filtering or flagging malicious graph elements.

A SURVEY OF AI FOR MATERIALS SCIENCE: FOUNDATION MODELS, LLM AGENTS, DATASETS, AND TOOLS

AI4MS: introduces, "Common & Prevalent Tasks (broad application areas), Foundation Models (large pretrained models), Datasets (data collections), Tools & Infrastructures (supporting software platforms), and Successes, Limitations & Challenges, Future Directions (discussion points), where the paper surveys the landscape of AI for materials science."
The survey categorizes Foundation Models into Unimodal, Multimodal, and LLM Agents, and Datasets into Computational/Experimental and LLM Development.
It also reviews Tools & Infrastructures for Data Analysis/Management and Model Development, and discusses the current state and future directions.

The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind

DECRYPTO: introduces a multi-agent benchmark for evaluating language models, featuring Alice (Chooses hints), Bob (Guesses code), and Eve (Intercepts code) interacting over shared Keywords (Secret words), Code (Secret digits sequence), and Hints (Words for code), tracked via Hint History (Past hints) and Code History (Past codes), and evaluated using Generalist Agents (Out-of-the-box LLMs), Specialist Agents (Task-specific agents), and specific Theory of Mind Tasks (Cognitive experiments).
The benchmark is based on a language game requiring players to reason about others' knowledge and beliefs to succeed in cooperative and competitive settings.
DECRYPTO provides a platform for studying multi-agent reasoning, theory of mind, and human-AI interaction in interactive, language-based scenarios.

Memento: Note-Taking for Your Future Self

Memento: introduces a three-stage strategy, with Plan generation (Decomposes question into steps), Prolog query (Symbolic representation of steps), Definitions (Natural language predicate mapping), Database construction (Populates fact database), Prolog database (Stores extracted/verified facts), Query execution (Evaluates query for answer), and LLM (Performs tasks in stages), which decomposes complex tasks, records outcomes, and uses Prolog for structured reasoning.
The method operates in three phases: plan generation, database construction, and query execution, leveraging LLMs to generate symbolic plans and populate a Prolog database.
Memento uses Prolog queries and a dynamically constructed database of facts to answer multi-hop questions, combining symbolic structure with LLM flexibility.

Memento: Note-Taking for Your Future Self

Memento: introduces a three-stage strategy, with Plan generation (Decomposes question into steps), Prolog query (Symbolic representation of steps), Definitions (Natural language predicate mapping), Database construction (Populates fact database), Prolog database (Stores extracted/verified facts), Query execution (Evaluates query for answer), and LLM (Performs tasks in stages), which decomposes complex tasks, records outcomes, and uses Prolog for structured reasoning.
The method operates in three phases: plan generation, database construction, and query execution, leveraging LLMs to generate symbolic plans and populate a Prolog database.
Memento uses Prolog queries and a dynamically constructed database of facts to answer multi-hop questions, combining symbolic structure with LLM flexibility.

Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

Behavior Editing: introduces steering LLM-based agents' ethical behavior by editing the agent (Pre-edit Agent) to become a Post-edit Agent, enabling both benevolent and malicious steering.
The approach frames agent behavior steering as a model editing task, allowing precise and efficient modifications to influence behavior and moral alignment.
The BEHAVIORBENCH benchmark is developed to systematically evaluate this editing approach across diverse ethical scenarios and complexity levels.

Fine-Tuning and Prompt Engineering of LLMs, for the Creation of Multi-Agent AI for Addressing Sustainable Protein Production Challenges

Multi-Agent AI System: introduces a Retrieval-Augmented Generation (RAG)-oriented system for sustainable protein production research, with a Literature Search Agent (retrieves literature), Information Extraction Agent (extracts information), Pool of Scientific Literature (literature source), User Interface (user interaction), Toxicity Analysis Module (screens for toxicity), GPT-Based LLM (agent foundation), Prompt Engineering (optimisation method), Fine-Tuning (optimisation method), and External Sentence Transformer (evaluation tool).
The study compares fine-tuning and prompt engineering as methods to optimise the performance of the information extraction agent using GPT models.
This multi-agent system aims to automate the process of retrieving and extracting key information from scientific literature to accelerate research in sustainable protein production.

An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

DeepRare: introduces an LLM-powered agentic system for rare disease diagnosis, with Central Host (Coordinates workflow, synthesizes info), Memory Bank (Stores diagnostic information, context), Agent Servers (Execute specialized tasks), Phenotype Extractor (Converts free-text to HPO), Phenotype Analyzer (Analyzes HPO, suggests diseases), Knowledge Searcher (Retrieves medical documents, web), Case Searcher (Finds similar patient cases), Genotype Analyzer (Annotates, ranks genetic variants), Disease Normalizer (Standardizes disease names), External Data Sources (Provide diagnostic evidence), Medical Literature (Peer-reviewed publications), Rare Disease Knowledge (Curated rare disease info), General Knowledge (Broad clinical resources), Case Collection (Repository of patient cases), and Gene Variant Databases (Genetic variant information) components, designed to process heterogeneous clinical inputs and generate traceable diagnostic reasoning.
The system employs a three-tier architecture comprising a central host, specialized agent servers, and diverse external data sources to facilitate complex diagnostic reasoning.
DeepRare generates ranked diagnostic hypotheses with transparent reasoning chains linked to verifiable medical evidence, enhancing interpretability and supporting clinical adoption.

SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models

SV-LLM (multi-agent assistant system): introduces a multi-agent framework for SoC security verification, with Application, Supervisor, Orchestrator, Agent, Data, and Infrastructure layers, designed to automate and enhance the verification workflow.
The system employs specialized LLM-driven agents for tasks including security Q&A, asset identification, threat modeling, test plan generation, vulnerability detection, and bug validation.
The layered architecture and agentic design aim to streamline complex verification tasks, reduce manual effort, and improve accuracy and scalability in hardware security analysis.

TAPS: Tool-Augmented Personalisation via Structured Tagging

TAPS (Tool-Augmented Personalisation via Structured Tagging): introduces a tuning-free approach for personalised tool use in task-oriented dialogue, combining an LLM (Generates response, predicts API calls), a Structured Tagging Tool (Augments data, adds tags), and an Uncertainty-based Tool Detector (Determines tool use, assesses confidence).
The framework leverages structured tagging to create an intermediate representation between natural language and API calls, enhancing argument extraction.
An uncertainty-based tool detector determines when to apply the structured tagging tool to improve performance.

Language Modeling by Language Models

Genesys: introduces an autonomous system for discovering novel language model architectures, with LMADE (Environment), Knowledge Engine (Knowledge access), Reference Library (Curated papers), External Sources (Search tools), Paper Vector DB (Vector database), Verification Engine (Verification tools), Symbolic Checker (Code analysis), Automated Trainer (Model training), Automated Evaluator (Model evaluation), Auto-Tuner (Parameter tuning), Runtime Checker (Training monitor), Evolutionary Tree (Design storage), LLM-driven Agents (Discovery agents), Designer Agents (Design creation), Proposer Agent (Proposal generation), Reviewer Agent (Proposal review), Planner Agent (Implementation planning), Coder Agent (Code writing), Observer Agent (Code review), Verifier Agents (Verification management), Generalized Autoregressive Block (Main architecture unit), Generalized Autoregressive Unit (Composable sub-unit), Ladder-of-Scales (Multi-scale verification), and Unit-based Generation (Stepwise code generation).
The system simulates the research process from ideation to verification using LLM agents and a genetic programming backbone operating on a factorized design space.
Genesys employs a Ladder-of-Scales approach for efficient verification and unit-based code generation for improved design quality and efficiency.

PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models

PSALM-V: introduces a neuro-symbolic learning system that induces symbolic action semantics in visual environments by iteratively initializing/updating problem files, sampling trajectories, executing in the environment, predicting errors, generating/updating action semantics, and using a symbolic planner for verification.
The system maintains a tree-structured belief over action semantics, refining it based on execution outcomes and predicted errors to enable reliable planning without expert definitions.
PSALM-V dynamically infers PDDL problem files and domain action semantics by analyzing execution outcomes and synthesizing possible error explanations.

24th June 2025

Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models

OIR (Open-Ended Instruction Relabeling): introduces a framework that leverages a Large Language Model to automatically generate open-ended instructions from collected agent trajectories, enriching training data for instruction-following reinforcement learning.
The framework uses the LLM to relabel unsuccessful trajectories by identifying accomplished subtasks, providing semantic rewards for efficient learning in sparse environments.
A prioritized instruction buffer manages the diverse, LLM-generated instructions, balancing exploration and exploitation for robust policy improvement.

QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges

QHackBench: introduces a novel benchmark dataset and evaluation framework for LLM-based quantum code generation, featuring QHack Challenges, PennyLang Dataset, Retrieval, Code Generation Agent, Test Bench, Validation & Correction Agent, Self-Reasoning, and Augmented Query components.
The framework systematically evaluates LLMs using vanilla prompting, Retrieval-Augmented Generation, and a multi-agent iterative refinement pipeline on real-world quantum coding challenges.
Results indicate RAG and multi-agent approaches can enhance performance, highlighting the importance of domain-specific context and iterative debugging for reliable quantum code generation.

Prover Agent: An Agent-based Framework for Formal Mathematical Proofs

Prover Agent: introduces an agent-based framework for formal mathematical proofs, coordinating an Informal LLM (informal reasoning), Prover Model (formal proving), Lean (formal verification), and AutoFormalizer (formalizes lemmas).
The framework generates auxiliary lemmas via informal reasoning, formalizes them, proves them, and uses verified lemmas to synthesize the final proof.
Iterative refinement based on Lean feedback is used throughout the process to ensure correctness and improve proof construction.

JoyAgents-R1: Joint Evolution Dynamics for Versatile Multi-LLM Agents with Reinforcement Learning

JoyAgents-R1: introduces a joint evolution dynamics framework for heterogeneous multi-agent systems, including a master agent (orchestrates tasks), sub-agents (specialized task execution), agent memory (stores past information), tools (external functionalities), joint evolution dynamics (training process), and joint reward function (calculates action feedback).
The framework employs a hierarchical architecture where the master agent delegates tasks to specialized sub-agents that interact with tools and memory.
Joint evolution dynamics leverages GRPO with node-wise Monte Carlo sampling and marginal benefit updating, while memory evolves adaptively using GRPO rewards.

MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration

MAM (Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis): introduces a modular, collaborative framework with General Practitioner (Initial triage/referral), Specialist Team (Domain expert agents), Radiologist (Image analysis agent), Medical Assistant (Information retrieval/summary), and Director (Orchestrator/synthesizer) agents for multi-modal medical diagnosis.
The framework decomposes the diagnostic process into specialized roles, each embodied by an LLM-based agent, enabling efficient knowledge updates and leveraging existing models.
Agents collaborate through a defined workflow involving initial triage, problem decomposition, information retrieval, diagnostic opinion generation, discussion, report synthesis, consensus, and final diagnosis derivation.

LLM-Based Social Simulations Require a Boundary

LLM-Based Social Simulations: introduces boundaries for reliable social science contributions, focusing on LLM Agents (model individual behavior), Alignment (simulated behaviors match real-world), Consistency (coherent agent behavior over time), and Robustness (reproducibility under conditions).
The paper argues that LLMs' inherent limitations, particularly lack of behavioral heterogeneity, constrain their reliability for simulating complex social dynamics.
It proposes heuristic boundaries and a checklist to guide researchers in determining the appropriate scope and claims for such simulations in social science research.

SAGE: Strategy-Adaptive Generation Engine for Query Rewriting

SAGE (Strategy-Adaptive Generation Engine): introduces a reinforcement learning framework for query rewriting that integrates a Policy Model guided by Explicit Strategic Primitives, evaluated by an Environment, and trained using a Reward Shaping Module with Strategic Credit Shaping and Contrastive Reward Shaping, enhanced by an Exploration Penalty and Proactive Exploration Prompting via GRPO Update.
The framework operationalizes expert-crafted strategies within an RL loop to steer the LLM agent towards effective query rewriting and improved policies.
Novel reward shaping mechanisms and forced exploration techniques are introduced to provide informative learning signals and counteract reward hacking.

A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects

Multi-sensor Fusion Perception (MSFP): introduces a survey of methods for embodied AI, detailing pipelines with Sensor Data, Backbone/Encoder, Features, Fusion Mechanism, and Downstream Task components.
The survey categorizes methods by fusion level (point, voxel, region, multi-level), multi-agent (Agent Communication), time-series (Temporal Fusion), and MM-LLM (LLM) approaches.
It reviews specific techniques within each category and discusses open challenges and future opportunities for MSFP in embodied AI.

A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures

Agent Communication Classification: introduces a comprehensive survey of LLM-driven AI agent communication, classifying it into User-Agent Interaction, Agent-Agent Communication, and Agent-Environment Communication, and analyzing related protocols, security risks, and defense countermeasures for each stage.
The paper details the typical LLM-Driven AI Agent Architecture comprising perception, memory, reasoning/planning, tool, and action modules, highlighting how agent communication enables collaboration and task completion beyond single LLM capabilities.
Agent-Agent Communication Architectures are categorized into CS-based, P2P-based, Hybrid, and Others based on their discovery mechanisms, while specific protocols like MCP, A2A, and AG-UI are discussed within the respective communication stages.

Adaptive Domain Modeling with Language Models: A Multi-Agent Approach to Task Planning

TAPAS (Task-based Adaptation and Planning using Agents): introduces a multi-agent framework for adaptive task planning, including Domain Generator, Initial State Generator, Goal State Generator, Planning Problem Model, Solver, Debugger, Structured Plan, Plan Abstraction, Plan in NL, Plan Executor, Action Executor Agent, Validator Agent, Memory, Critic, Agent Tools, and Robot/Execution Environment.
The framework uses specialized LLM-based agents to collaboratively generate and adapt domain models, initial states, and goals via structured tool calls and iterative refinement.
A robust planning and execution pipeline translates symbolic plans to natural language for a ReAct-style execution agent, bridging the gap to real-world robot capabilities with feedback-driven validation.

KnowMap: Efficient Knowledge-Driven Task Adaptation for LLMs

KnowMap: introduces a novel approach for LLM task adaptation by dynamically constructing a knowledge base from environmental and experiential data, equipping a larger LLM with task-specific knowledge via a fine-tuned embedding model, and utilizing an agent scaffold with planner, actuator, evaluator, and memory module.
The framework's knowledge base is divided into an environmental knowledge base representing the current environment state and an experiential knowledge base storing reusable experiences and reasoning patterns derived from trajectories.
KnowMap fine-tunes a knowledge-embedding model on data derived from both knowledge bases to enhance retrieval performance and support the agent scaffold's decision-making process.

MATE: LLM-Powered Multi-Agent Translation Environment for Accessibility Applications

MATE (Multi-Agent Translation Environment): introduces, a multimodal accessibility multi-agent system, with Interpreter Agent (Identifies task, redirects), TTS Expert (Text to speech), TTI Expert (Text to image), STT Expert (Speech to text), ITT Expert (Image to text), ITA Expert (Image to audio), ATI Expert (Audio to image), VTT Expert (Video audio to text), ModCon-Task-Identifier (Task type recognition model), Pre-defined Models/Functions (Perform modality conversion), and Output File Storage (Saves output files), designed to perform modality conversions based on user needs for accessibility applications.
The system uses an Interpreter Agent, powered by ModCon-Task-Identifier, to identify the user's requested modality conversion task and delegates it to one of seven specialized expert agents.
Expert agents utilize pre-defined models and functions to execute specific conversion tasks, saving the output to a designated location for the user.

NaviAgent: Bilevel Planning on Tool Dependency Graphs for Function Calling

NaviAgent: introduces a bilevel planning architecture for robust function calling, integrating a Multi-Path Decider (LLM-powered agent) for action selection and a Graph-Encoded Navigator (Graph-based planning) for toolchain planning on a Tool Dependency Heterogeneous Graph (TDHG) (Tool relationship graph).
The Graph-Encoded Navigator constructs and evolves the TDHG through Graph Construction (Builds TDHG), Graph Representation (Node/edge features), Graph Training (Optimizes graph), Graph Search (Finds toolchains), and Graph Evolution (Updates graph).
The Multi-Path Decider dynamically chooses actions from Direct Response (Decider action), Intent Clarification (Decider action), Tool Retrieval (Decider action), and Tool Call (Decider action) based on perceived states and Navigator output.

Dialogic Pedagogy for Large Language Models: Aligning Conversational AI with Proven Theories of Learning

LLM-Based Dialogic Pedagogy Framework: proposes strategies for designing effective LLM-based conversational AI tutors, incorporating an LLM (Core conversational engine), Dialogue Strategy Engine (Guides conversation flow), Knowledge Retrieval Module (Integrates external information), Student Model (Tracks learner state), and Interaction Persona Module (Manages AI style/role).
The strategies aim to address limitations of raw LLMs, such as over-directness and lack of student modeling, by integrating pedagogical principles like Socratic questioning, scaffolding, and reflection.
The framework emphasizes aligning AI interactions with proven learning theories to create personalized, engaging, and educationally productive dialogues.

LLM-based Multi-Agent System for Intelligent Refactoring of Haskell Code

LLM-based Multi-Agent System: introduces an automated Haskell code refactoring system with agents for code analysis, strategy formulation, refactoring execution, testing, and debugging.
The system employs specialized agents like Code Context and Structure, Code Smells, Refactoring Strategy, Refactor (Expert/Lead), Testing and Validation, and Debug agents to collaboratively improve code.
This multi-agent approach aims to enhance code quality, runtime efficiency, and memory usage in functional programming codebases through structured interaction and iterative refinement.

Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System

Mem4Nav: introduces a hierarchical spatial-cognition long-short memory system with Sparse Octree (voxel indexing), Semantic Topological Graph (landmark connectivity), Reversible Token Processing (encodes/decodes memory), Long-Term Memory (lossless historical storage), and Short-Term Memory Cache (recent local context).
This system integrates fine-grained voxel indexing and high-level landmark connectivity with dual memory modules for efficient storage and retrieval of spatial observations.
The dual memory architecture, using reversible tokens for LTM and a frequency-recency cache for STM, enables agents to retain relevant experiences over extended time horizons for improved navigation.

Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

Commander-GPT: introduces a modular multi-agent framework for multimodal sarcasm detection, including Input, Subtask Routing, Subtask Execution by specialized agents (Context Modeling, Sentiment Analysis, Rhetorical Device Recognition, Facial Expression Recognition, Image Summarization, Scene Text Recognition), and a Commander for result integration and final decision.
The framework decomposes sarcasm detection into six cognitively meaningful sub-tasks, dynamically routing input to the most suitable specialist agents.
Centralized coordination by the commander integrates information from activated agents for adaptive and fine-grained reasoning across modalities.

Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

Skywork-SWE: introduces an automated data curation pipeline for software engineering tasks, including data collection and pre-filtering, environment setup and execution-based validation, and agent trajectory generation, used to train the Skywork-SWE Agent Model, evaluated with the OpenHands Agent Framework and Test-Time Scaling (TTS).
The pipeline generates a large-scale, high-quality dataset of GitHub issue-fix instances with executable runtime environments.
The trained model demonstrates data scaling laws in software engineering tasks and achieves state-of-the-art performance on SWE-bench Verified among open-source models.

Augmenting Multi-Agent Communication with State Delta Trajectory

SDE (State Delta Encoding): introduces a novel multi-agent communication protocol that augments natural language with LLM's hidden states by transferring token-wise changes.
The protocol involves agents built from a single LLM exchanging natural language tokens and state delta trajectories.
State delta trajectories, representing differences in hidden states, are injected into the receiver agent's transformer layers to enhance understanding.

23th June 2025

Distilling Tool Knowledge into Language Models via Back-Translated Traces

Back-translation pipeline: introduces a method to distill tool knowledge into language models by converting tool-integrated reasoning traces into natural language using a SOLVER AGENT (Generates TIR traces), SymPy Toolkit (Provides symbolic tools), TIR Trace Filtering (Selects correct traces), TRANSLATOR AGENT (Translates tool calls), JUDGE AGENT (Verifies translations), and REPHRASE AGENT (Reconstructs NL traces) to fine-tune a Student Model (Fine-tuned target model).
This pipeline generates high-quality natural language reasoning traces from tool-augmented solutions, enabling smaller models to internalize structured problem-solving patterns without requiring tool access at inference.
The approach improves performance on challenging math benchmarks by transferring symbolic computation capabilities and structured reasoning from tool-using agents to language-only models.

20th of June 2025

Kimi-Researcher End-to-End RL Training for Emerging Agentic Capabilities

Kimi-Researcher: introduces an autonomous agent with multi-turn search and reasoning.
Trained on end-to-end agentic RL using REINFORCE-algorithm.
Designs a context management ystem to retain important information while discarding unnecessary documents.

Private Training & Data Generation by Clustering Embeddings

DP Synthetic Generation: introduces a novel principled method for differentially private synthetic data generation by clustering embeddings, which includes an encoder, DP clustering, DP GMM estimation, GMM sampling, and optional DP filtering and a decoder.
The framework first transforms sensitive input data into embeddings, then privately clusters these embeddings and estimates Gaussian Mixture Model parameters to generate synthetic embeddings.
It can optionally decode synthetic embeddings into realistic synthetic images and achieves state-of-the-art classification accuracy on standard benchmarks while ensuring strong privacy guarantees.

18th June 2025

SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence

SwarmAgentic: introduces a framework for fully automated agentic system generation, using Particle Swarm Optimization to explore a language-driven design space, optimizing Agentic Systems composed of an Agent Set and Collaborative Structure.
The framework iteratively refines Agentic Systems by updating Particle positions and velocities based on Fitness Function evaluation and Flaw Identification.
Velocity updates integrate Failure-Driven Adjustments, Personal Best Guidance, and Global Best Guidance to refine Agent functionality and collaboration strategies, yielding the best system as the Search Result.

Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents

LPA (LLM-based Planning Agent): introduces a system for managing complex failure analysis workflows, with Agent Core orchestrating control flow, Memory retaining information, Plan Generation creating step-by-step plans, Action Matching and Execution selecting and running tools, Feedback and Reflection adjusting plans based on results, LLM processing language and reasoning, Tools providing external system interfaces, and Data serving as external information sources.
The agent utilizes LLMs as the "brain" to decompose complex queries and resolve them through reasoning and autonomous tool use, employing ReAct or Online Replanning approaches.
The system integrates external tools like databases, search engines, and AI models to retrieve data and perform analysis tasks, supporting FA engineers.

PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection

PhishDebate: introduces a modular multi-agent LLM-based debate framework for phishing website detection, with URL Analyst Agent, HTML Structure Agent, Content Semantic Agent, Brand Impersonation Agent, Moderator, and Judge components.
The framework employs specialized agents to analyze different website aspects and coordination agents to manage a structured debate process.
This multi-agent approach aims to improve detection accuracy, interpretability, and robustness compared to single-agent methods.

The Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games

Approach: introduces a framework for constructing natural language state representations for prompting LLM agents in repeated multi-agent games, implemented with LLM Agents, Game Environment, Prompting Mechanism, State Representation, LangChain, and OpenAI API.
The system evaluates LLM agent behavior in a dynamic selfish routing game by varying state representations along action informativeness, reward informativeness, and prompting style axes.
The research finds that summarized state representations, regret-based feedback, and limited information about others' actions lead to more stable, equilibrium-like agent behavior.

Managing Complex Failure Analysis Workflows with LLM-based Reasoning and Acting Agents

LPA (LLM-based Planning Agent): introduces an agent architecture for failure analysis workflows, integrating a Large Language Model for reasoning and planning, Memory for retaining information, Action Matching and Execution for tool use, and Feedback and Reflection for plan refinement, interacting with a User and the external Environment via Data and Tools.
The agent utilizes ReAct-style iterative task generation or online replanning to process complex queries and generate human-readable responses.
The implementation integrates external tools like databases and ML models, demonstrating technical feasibility and robustness in a production-like environment.

AGENTGROUPCHAT-V2 : Divide-and-Conquer Is What LLM-Based Multi-Agent System Need

AGENTGROUPCHAT-V2: introduces a novel framework with Query Manager (Frontend, query decomposition), Task Manager (Central coordination, task flow), Group Manager (Execution, collaboration organization), Agent (Individual LLM participant), Task (Basic processing unit), Group (Collaborative work unit), and Task Forest (Hierarchical task structure) for LLM-based multi-agent systems.
The framework employs a divide-and-conquer parallel architecture, dynamic task tree decomposition, and specialized agent role assignment to address challenges in system architecture, generalizability, and performance.
Experimental results demonstrate superior performance on complex reasoning, code generation, and diverse tasks compared to existing multi-agent approaches.

RAS-EVAL: A COMPREHENSIVE BENCHMARK FOR SECURITY EVALUATION OF LLM AGENTS IN REAL-WORLD ENVIRONMENTS

RAS-Eval: introduces a comprehensive security benchmark for LLM agents, including Test Cases, Attack Tasks, Scenarios, Toolkits, Risk Management, and Evaluation Pipelines.
The benchmark supports both Real Execution and Simulated Execution of tools across JSON, LangGraph, and MCP formats.
It incorporates Failure Modes and Vulnerability Types for granular analysis and uses Evaluation Pipelines to measure task completion and attack success rates.

From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem

Agent Framework: introduces a survey of jailbreak attacks and defenses in the LLM ecosystem, with Core (Central processing unit), Planning (Task decomposition/logic), Tools (External interfaces/applications), Memory (Information management/storage), and LLM Network (Multi-agent interaction) components, where the paper reviews the evolution from LLMs to MLLMs and Agents and analyzes security challenges.
The survey categorizes jailbreak techniques by attack impact and visibility and defense strategies by response timing and technical approach.
It also details datasets and evaluation metrics used in jailbreak research and outlines future research directions.

Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs

Multi-Response Generation (MRG) and Preference-based Selection (PS): introduces a two-stage framework for open-domain dialogue response generation, where MRG generates a set of diverse responses and PS selects the best one based on human preference.
The approach leverages smaller LLMs and introduces the o2mDial dataset to explicitly capture the one-to-many property.
Empirical results show the framework enhances response diversity and quality in smaller LLMs, approaching the performance of larger models.

HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models

LLM-based Embodied Agent Pipeline: studies hallucinations in embodied agents by evaluating a pipeline that takes Scene (Visual input) and Task Description (Natural language instruction), processes them via a Scene Parser (Extracts scene info) and LLM as Goal Interpreter (Generates symbolic goals) to produce LTL Goal (Symbolic task goals) for Execute the Task (Action planning/execution).
The study constructs a hallucination probing set by systematically modifying the Task Description and Scene Information inputs to introduce scene-task inconsistencies.
The research finds that LLMs struggle to reconcile scene-task inconsistencies, leading to hallucinations and failures in handling infeasible tasks.

OS-HARM: A Benchmark for Measuring Safety of Computer Use Agents

OS-HARM (Benchmark): introduces a benchmark for measuring the safety of computer use agents, featuring OS-HARM tasks, OSWorld Ubuntu VM, LLM Agent, OSWorld scaffolding, Agent Traces, and LLM Judge.
The benchmark evaluates LLM-based agents on tasks involving deliberate user misuse, prompt injection attacks, and model misbehavior within a realistic OSWorld environment.
An automated LLM Judge evaluates agent performance and safety based on recorded execution traces, including reasoning steps, screenshots, and accessibility trees.

LLM Agent for Hyper-Parameter Optimization

LLM Agent Framework and MCP: introduces an interactive framework orchestrating collaboration between the LLM Agent (comprising Profile, Memory, Planning, and Action components), Human inputs, and the Environment (WS-PSO-CM algorithm) for automatic hyper-parameter tuning.
The Model Context Protocol (MCP) defines a unified communication specification enabling the LLM Agent to interact with external systems via MCP Client and MCP Server architecture.
The framework iteratively refines hyper-parameters for the WS-PSO-CM algorithm based on prompt requirements and environmental feedback to optimize UAV trajectory and communication.

Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

Large Reasoning Model (LRM): analyzes privacy leakage in Large Reasoning Models, involving Input (Sensitive user data and scenario), Large Reasoning Model (Processes input), Reasoning Trace (Intermediate thinking steps), and Output (Final response), revealing that Reasoning Traces frequently leak sensitive user data.
Reasoning Traces, often assumed internal, are shown to be easily extractable and contain abundant sensitive data, making them a significant privacy vulnerability.
Increasing test-time compute for better utility can worsen privacy by increasing leakage in the Reasoning Trace, highlighting the need for privacy strategies targeting internal thinking.

17th June 2025

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.

Gemini 2.X model family: introduces Gemini 2.5 Pro and Flash, built on Sparse mixture-of-experts transformers (Core Architecture) with Multimodal Support (Text, Image, Video, Audio), Long Context Processing (>1M tokens), Tool Use Support (Function calls), Thinking (Inference process), and Deep Think (Reasoning approach).
The models enable next-generation agentic capabilities, demonstrated by the Gemini Plays Pokémon agent harness which includes components like Persistent Memory & Context, Goals, Action History & Summaries, Game State, Periodic Processes (Memory Summarizer, Guidance Gemini), Agentic Tools (Pathfinder, Boulder Puzzle Strategist), and Game I/O.
Gemini 2.5 models achieve state-of-the-art performance on various benchmarks, including long-context video understanding (processing up to 3 hours of video) and coding, while also undergoing extensive safety and security evaluations.

AGENTDISTILL: TRAINING-FREE AGENT DISTILLATION WITH GENERALIZABLE MCP BOXES

AgentDistill: introduces a training-free agent distillation framework, with Teacher Agent (Generates MCPs), Manager Agent (Teacher) (Coordinates tasks), Basic Image Captioner (Teacher) (Captions images), MCP Creation Module (Creates task MCPs), MCP-Box Construction (Builds MCP Box), Abstraction (Parameterizes MCPs), Clustering (Groups MCPs), Consolidation (Merges MCPs), MCP Box (Reusable task modules), Student Agent (Uses MCP Box), Manager Agent (Student) (Coordinates tasks, uses MCP Box), and Basic Image Captioner (Student) (Captions images), which transfers task-solving capabilities from large teacher agents to small student agents via reusable Model-Context-Protocols (MCPs).
The framework involves a teacher agent generating MCPs, a construction process to build a reusable MCP-Box by abstracting, clustering, and consolidating them, and a student agent that directly integrates this MCP-Box for inference.
AgentDistill enables student agents to inherit sophisticated problem-solving skills and generalize across tasks by providing a structured MCP-Box without requiring additional training or trajectory replay.

T-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

T-bench: introduces a novel benchmark for evaluating the reliability and consistency of LLM-based agents in dynamic, real-world interactions, featuring a Language Agent (interacts with users and tools), an LM-simulated User (simulates human users), API Tools (interface with databases), Realistic Databases (store domain information), and Domain-specific Policy Documents (provide rules for agent behavior), where it emulates dynamic conversations between a simulated user and a language agent using domain-specific API tools and policy guidelines.
The benchmark employs an efficient evaluation process by comparing the database state at the end of a conversation with an annotated goal state and introduces a new metric, pass^k, to assess agent behavior reliability across multiple trials.
It highlights that current state-of-the-art function calling agents struggle with complex reasoning, policy adherence, and consistency, indicating a need for more sophisticated agent architectures.

Unified Software Engineering agent as AI Software Engineer

USEagent (Unified Software Engineering agent): introduces a unified agent for software engineering tasks, with Meta-Agent orchestrates actions, Actions perform SE tasks, Task State stores shared information, and Program is the target software project.
The Meta-Agent uses a ReAct-style loop to select actions based on the current task state and action outputs.
The framework utilizes a set of modular actions encapsulating units of work and a structured task state for consensus memory among actions.

Doppelgänger Method : Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack

Doppelgänger Method: introduces a prompt-based transferable adversarial attack method to break LLM agent consistency, evaluated using the PACAT Level metric, and countered by the CAT Prompt defense.
The method demonstrates the risk of role hijacking and internal information exposure in LLM agents.
Experimental results show the attack's effectiveness and the defense prompt's ability to mitigate consistency degradation.

Automated Decision-Making on Networks with LLMs through Knowledge-Guided Evolution

LLMNet: introduces a system for automated GNN design using LLM-based agents, including Knowledge Agent (Builds, manages knowledge bases), Prior Knowledge Base (Stores task-specific knowledge), Experiment Knowledge Base (Stores experimental results), Planning Agent (Generates task plan, evaluates), Data Agent (Performs feature engineering), Configuration Agent (Configures search space), and Evaluation Agent (Fine-tunes, experiments), which leverages knowledge bases and RAG for knowledge-guided evolution.
The system employs a pipeline of specialized agents that interact with constructed knowledge bases to design and refine GNN model architectures step by step.
LLMNet demonstrates superior performance across various graph learning tasks by effectively integrating graph-related knowledge into the automated design process.

GENERATIONPROGRAMS: Fine-grained Attribution with Executable Programs

GENERATIONPROGRAMS: introduces a modular generation framework that decomposes the process into program generation by an LLM and program execution by neural modules, producing an output with sentence-level attributions from input documents.
The framework first generates an executable program plan composed of modular text operations tailored to the query, then executes this plan using neural modules like paraphrasing, compression, fusion, and extraction on retrieved document sentences.
This two-stage approach enables fine-grained attribution by tracing the program execution and linking generated content back to source sentences, enhancing interpretability and verifiability.

SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

SIRI-Bench (Spatial Intelligence ReasonIng Benchmark): introduces a benchmark for evaluating VLMs' spatial intelligence using video-based 3D geometry problems, generated by an Automatic Scene Creation Engine leveraging Specialized LLM Agents to transform Original Math Problems into Realistic 3D Scenes and Video inputs for VLMs, alongside textual Questions and numerical Answers.
The Automatic Scene Creation Engine generates the benchmark data by solving geometric conditions, generating Blender Python Scripts, and refining textual inputs and outputs.
SIRI-Bench challenges VLMs to extract spatial information from video and perform complex reasoning, revealing limitations in current models compared to human performance and text-based LLMs.

LLM-Powered Swarms: A New Frontier or a Conceptual Stretch?

LLM-Powered Swarms: introduces a new paradigm for swarm intelligence using LLMs as agents, featuring LLM Agents, Multi-Agent Coordination (Interconnected agents collaborate), Client-Side Operation (Framework runs locally), LLM Access (Cloud or local models), and Prompts (Natural language instructions).
This approach contrasts with traditional rule-based swarms by trading execution speed for flexibility and higher-level reasoning capabilities.
Evaluation using Boids and ACO models highlights significant latency and resource costs compared to classical methods, suggesting potential for hybrid systems.

Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent

ECPO (Expectation Confirmation Preference Optimization): introduces a novel multi-turn preference optimization paradigm leveraging Expectation Confirmation Theory to align LLM-based conversational recommendation agents with user expectations.
The framework explicitly models user satisfaction evolution across turns using Forward Expectation Confirmation and rewrites unsatisfactory responses via Backward Expectation Derivation with a Rewriter.
ECPO is supported by AILO, an LLM-based user simulator that provides realistic feedback and performs expectation confirmation, enabling efficient turn-level preference optimization without extensive sampling.

ADRD: LLM-DRIVEN AUTONOMOUS DRIVING BASED ON RULE-BASED DECISION SYSTEMS

ADRD (LLM-Driven Autonomous Driving Based on Rule-based Decision Systems): introduces a framework with Information Module, Agents Module (Planner, Coder, Summarizer), and Testing Module, leveraging LLMs to generate and refine rule-based decision trees for autonomous driving.
The Information Module gathers scenario data, the Agents Module generates and codes driving tactics, and the Testing Module provides feedback for iterative refinement.
ADRD demonstrates superior performance, response speed, and interpretability compared to baselines by integrating LLMs with rule-based decision systems.

From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents

TIMER: introduces timely dialogue response generation, with Time Interval Prediction (predicts delay), Time-conditioned Response Generation (generates response), and Fine-tuned Dialogue Model (base language model), addressing when and what to respond based on temporal context.
The model is trained using a multi-task learning objective on a large-scale synthetic dataset derived from event knowledge graphs and LLMs.
TIMER demonstrates improved performance over baselines in predicting appropriate response delays and generating time-specific, coherent dialogue.

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

AgentSynth: introduces a scalable pipeline for synthesizing computer-use tasks and trajectories by iteratively chaining LLM-generated subtasks, executed by a Task Executor, verified by a Task Verifier, revised by a Task Reviser, proposed as follow-ups by a Follow-up Task Proposer, and summarized into final tasks by a Task Summarizer, operating within an Environment guided by a Persona.
The pipeline leverages information asymmetry, generating simple subtasks that compose into challenging long-horizon tasks, enabling controllable difficulty.
AgentSynth generates over 6,000 diverse and realistic tasks at a low cost, providing a benchmark that reveals performance gaps in current LLM agents on multi-step computer tasks.

MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment

MAS-LitEval: introduces a multi-agent system for literary translation quality assessment, with Terminology Consistency Agent (Ensures key term consistency), Narrative Perspective Consistency Agent (Verifies narrative voice alignment), Stylistic Consistency Agent (Evaluates tone rhythm style), and Coordinator (Combines agent scores feedback).
The system employs specialized LLMs within agents to evaluate distinct dimensions of literary translation quality across segmented text chunks.
The Coordinator integrates agent evaluations into an Overall Translation Quality Score (OTQS) and a detailed report, ensuring global consistency.

FormGym: Doing Paperwork with Agents

Agent Framework with FieldFinder: introduces a system for end-to-end form completion using agents equipped with tools, including a novel field localization tool.
The system evaluates Vision-Language and GUI agents on the FormGym benchmark, which includes diverse forms, user profiles, and tasks.
The FieldFinder tool assists agents by predicting bounding boxes for input fields, significantly improving text placement accuracy.

Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification

CVDP (Comprehensive Verilog Design Problems): introduces a benchmark dataset and infrastructure, with Datapoint, Prompt, Context, Reference Solution, Test Harness, Testbench, Benchmark Runner, Agent Under Test, Model Under Test, Mini Repo, EDA Tools, Docker, LLM Judge, Map Feature, and Report & Logs components, designed to evaluate LLMs and agents on hardware design and verification tasks.
The benchmark includes 783 human-authored problems across 13 categories covering RTL generation, verification, debugging, and comprehension, provided in both Non-Agentic (single-turn) and Agentic (multi-turn, tool-using) formats.
The infrastructure supports Dockerized agents and test harnesses for realistic tool interaction using EDA tools, and includes an LLM judge for quality filtering of datapoints.

16th June 2025

LocationReasoner: Evaluating LLMs on Real-World Site Selection Reasoning

LocationReasoner benchmark: introduces a benchmark to evaluate LLMs' real-world reasoning abilities in site selection, with Query Generation, Sandbox Environment, Datasets, In-house Tools, Execution Pathways, and Automated Verification components, evaluating Direct Code Generation, ReAct, and Reflexion approaches.
The benchmark uses curated datasets and in-house tools within a sandbox environment to test LLMs on constraint-based location search with automated verification.
Evaluation reveals current LLMs and agentic strategies struggle with complex real-world reasoning tasks, highlighting limitations in holistic and non-linear reasoning.

Discovering Temporal Structure: An Overview of Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning (HRL): introduces an overview of methods for discovering temporal structure, formalized using the options framework including option policy, option termination function, option initiation function, high-level policy, and option model function, and discusses agent architectures like Hierarchical Components, Goal Conditioned, Feudal Architecture, and Single Network.
The paper surveys methods for temporal structure discovery categorized by learning from online experience, offline datasets, and foundation models.
HRL aims to improve exploration, credit assignment, transfer, and interpretability by leveraging temporal structure in sequential decision-making problems.

How Does LLM Reasoning Work for Code? A Survey and a Call to Action

Code Reasoning Taxonomy: introduces a classification of techniques for LLM reasoning on code tasks, including Code CoT Reasoning, Execution-based reasoning, Inference Scaling, and Agentic approaches.
The taxonomy details sub-techniques such as Plan-based CoT, Self-evaluation of execution behavior, Sampling, and Agentic Workflow.
The survey highlights how these distinct reasoning strategies and their components are applied and perform on various code-related benchmarks.

Spec2RTL-Agent: Automated Hardware Code Generation from Complex Specifications Using LLM Agent Systems

Spec2RTL-Agent: introduces an LLM-based multi-agent system for automated RTL code generation from complex specifications, including Iterative Understanding and Reasoning Module, Progressive Coding and Prompt Optimization Module, Adaptive Reflection Module, and Code Optimization and Conversion Module.
The system processes unstructured specification documents, refines code generation through multiple abstraction levels, and iteratively verifies outputs.
Spec2RTL-Agent demonstrates effectiveness in generating accurate RTL code with reduced human intervention compared to existing methods.

We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent Systems

SAFEMCP: introduces a controlled framework to examine safety issues in MCP-powered agent systems, with Agent, Backbone LLM, MCP-Servers, Attack, Defense, Passive Defense, Active Defense, Evaluation, Scenario, and Metric components.
The framework simulates third-party attacks on LLM agents interacting with external services via the Model Context Protocol (MCP).
SAFEMCP provides tools for evaluating attack effectiveness and defense strategies using various scenarios and metrics.

CAMS: A CityGPT-Powered Agentic Framework for Urban Human Mobility Simulation

CAMS: introduces a CityGPT-powered agentic framework for urban human mobility simulation, with MobExtractor (Extracts/synthesizes mobility patterns), GeoGenerator (Generates geospatial knowledge/trajectories), and TrajEnhancer (Enhances trajectories via DPO) components.
The framework leverages an urban foundation model (CityGPT) and agentic reasoning to generate realistic and plausible human mobility trajectories.
CAMS integrates urban spatial knowledge and multi-dimensional feedback for controllable and generalizable simulation without relying on external geospatial information.

--

Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning

LA-CDM: introduces a hypothesis-driven uncertainty-aware language agent system for clinical decision making, comprising a Hypothesis Agent (forms hypothesis and confidence), a Decision Agent (decides action), Shared LLM Weights (underlying language model), and a Clinical Decision Making Environment (simulates patient interaction and provides feedback).
The system is trained using a hybrid paradigm combining supervised learning for hypothesis generation and reinforcement learning for uncertainty estimation and efficient action selection.
This approach models the iterative clinical process of forming hypotheses and requesting tests to converge towards a diagnosis, improving diagnostic performance and efficiency.

Towards Pervasive Distributed Agentic Generative Al - A State of The Art

Pervasive Distributed Agentic Generative AI: surveys the state of the art in LLM-based agents deployed in pervasive computing environments, detailing the Transformer Architecture, Short-Term Memory, Long-Term Memory, Hybrid Memory, Cloud Layer, Fog Layer, and Edge Layer components.
The paper examines the architecture of LLM agents, their deployment strategies across different infrastructure layers, and evaluation methods.
It highlights challenges in deploying these agents on resource-constrained pervasive devices and proposes the "Agent as a Tool" concept.

A Game-Theoretic Negotiation Framework for Cross-Cultural Consensus in LLMS

Game-Theoretic Negotiation Framework: introduces a systematic approach for cross-cultural consensus among LLMs, with Cultural Agents, Guideline Sets, Guideline Weights, Utility Functions, Negotiation Process (PSRO-based), Meta Strategy Solver, Best Response Oracle, Regional Cultural Agents, and Consensus Evaluation Toolkit, designed to achieve fair and robust agreement.
The framework models consensus as a Nash Equilibrium and employs a PSRO-based negotiation process driven by utility functions balancing consistency, acceptance, and novelty.
Culturally aligned Regional Cultural Agents are constructed using survey data, and consensus outcomes are evaluated using perplexity-based acceptance and value self-consistency metrics.

Querying Large Automotive Software Models: Agentic vs. Direct LLM Approaches

Direct Full-Context Prompting: introduces a baseline approach where the LLM (Processes model) receives the entire Software Model File (Complete input data) along with Instructions (Guidance for LLM) and a Question (User query) to produce an Answer (LLM response).
Agent with File Tools (ReAct Architecture): presents an agentic approach where the LLM (Agent's reasoning engine) interacts with Software Model Files (Data source) via a Toolkit (External tool access) containing specific Tools (File interaction functions), communicating through Messages (Communication channel) and append observation (Tool output) to answer a User (Initiates query) Question (User query) with an Answer (Agent's response).
The study compares these two architectures for querying large automotive software models, evaluating their accuracy and token efficiency using various LLMs and a custom question dataset.

Leveraging In-Context Learning for Language Model Agents

ICL-DS (In-Context Learning with Demonstration Selection): introduces an approach for LLM agents that leverages in-context learning with dynamically selected demonstrations, including an LLM Agent (generates thoughts and actions), a Demonstration Pool (stores annotated trajectories and snippets), an Iterative Annotation Algorithm (automatically annotates tasks for demonstrations), a Demonstration Selector (retrieves relevant demonstrations), Prompt Construction (formats input for LLM), a ReAct Solver (executes tasks iteratively with reasoning), a Plan & Execute (PnE) Solver (plans subtasks and executes them), and an Environment (provides observations and executes actions).
The paper proposes an iterative annotation algorithm to automatically and efficiently create a demonstration pool of solution trajectories for agentic tasks, which are then used to improve LLM agent performance, reliability, and efficiency.
The research demonstrates that using task-level trajectory demonstrations and smaller step-level snippet demonstrations significantly boosts performance for LLM agents, enabling them to rival costlier trained agents.

MOTIVEBENCH: How Far Are We From Human-Like Motivational Reasoning in Large Language Models?

MOTIVEBENCH: a comprehensive evaluation benchmark designed to assess the extent to which LLMs (LLMs) can replicate human-like motivations and behaviors, consisting of 200 rich contextual scenarios and 600 reasoning tasks covering multiple levels of motivation.
The benchmark addresses limitations of existing datasets by providing detailed scenarios, character profiles, and reasoning tasks that mimic real-world situations, thereby enabling a more accurate evaluation of LLMs' motivational intelligence.
MOTIVEBENCH aims to provide insights into LLMs' capabilities in understanding and exhibiting human-like motivational reasoning, highlighting areas where current models fall short and suggesting directions for future research in humanizing LLMs.

MAGIC: Multi-Agent Argumentation and Grammar Integrated Critiquer

MAGIC (Multi-Agent Argumentation and Grammar Integrated Critiquer): is a framework that utilizes multiple specialized agents to evaluate distinct writing aspects, aiming to predict holistic scores and produce detailed, rubric-aligned feedback for essays.
The framework employs an orchestrator to consolidate the outputs from individual agents, which focus on specific components of argumentative writing such as argument structure, grammar, vocabulary, and comprehension.
MAGIC aims to provide greater transparency, flexibility, and extensibility compared to monolithic automated essay scoring and feedback systems.

Scaling Test-time Compute for LLM Agents

ATTS (Agentic Test-Time Scaling): explores test-time scaling strategies for language agents, including parallel sampling, sequential revision, verifiers and merging, and diversifying rollouts.
The research systematically analyzes the impact of different design strategies on agent performance, finding that scaling test-time compute improves agent capabilities.
Key findings include the importance of knowing when to reflect, the superiority of list-wise methods for verification and merging, and the positive effect of diversified rollouts on agent performance.

15th June 2025

WEREWOLF-PLUS: AN UPDATE OF WEREWOLF GAME SETTING BASED ON DSGBENCH

WereWolf-Plus: introduces a multi-model, multi-dimensional, and multi-method benchmarking platform with Werewolf Simulation (Rule-compliant environment), LLM Agents (Flexible model assignment), Role Configuration (Customizable roles), Reasoning Enhancement (Experience-Retrieval Augmentation), and Evaluation Framework (Metrics for agents) for evaluating multi-agent strategic reasoning in the Werewolf game.
The platform provides a flexible and reliable environment supporting standard and customizable game setups with various roles and flexible LLM-role assignment.
WereWolf-Plus incorporates retrieval-augmented memory for contextual compression and reflection, and introduces comprehensive quantitative evaluation metrics for different roles and players.

Mastering Da Vinci Code: A Comparative Study of Transformer, LLM, and PPO-based Agents

Transformer-based Baseline Model: introduces, with Transformer architecture (Predicts opponent tiles), Input Representation (Current game state string), and Prediction Task (Predict hidden tile values) components, a model designed to predict opponent tiles based on the current public game state.
This baseline model utilizes a Transformer architecture but is limited by its restricted access to the full game history.
Its reasoning is primarily based on the current state snapshot and explicit negative constraints from prior guesses.

SoK: The Privacy Paradox of Large Language Models: Advancements, Privacy Risks, and Mitigation

Systematization of Knowledge (SoK) on LLM Privacy: introduces a comprehensive analysis of privacy challenges in LLMs, categorizing them across training data, user prompts, generated outputs, and LLM agents, with all LLM (Large Language Model), LLM System, Training Data, User Prompts, LLM Generated Output, LLM Agent System, Main LLM Agent, Secondary Agents, External Tools, Knowledge Base, Memory, User, Service Provider-components, where the paper evaluates existing mitigation techniques and identifies research gaps.
The paper highlights how LLMs' advanced capabilities and interactive nature introduce distinct privacy risks compared to traditional AI.
It discusses various mitigation strategies for each category of privacy challenge, noting their effectiveness and limitations.

SCISAGE: A MULTI-AGENT FRAMEWORK FOR HIGH-QUALITY SCIENTIFIC SURVEY GENERATION

SciSage (Scientific Sage): introduces a multi-agent framework with Interpreter (Understand/rewrite query), Organizer (Construct outline), Collector (Retrieve/rerank papers), Composer (Generate content), Refiner (Refine final document), and Reflector (Iterative hierarchical reflection) agents for high-quality scientific survey generation.
The framework employs a reflect-when-you-write paradigm, with the Reflector agent critically evaluating drafts at multiple levels.
SciSage coordinates specialized agents through query understanding, retrieval, content generation, and iterative hierarchical reflection processes.

14th June 2025

Synthetic Socratic Debates: Examining Persona Effects on Moral Decision and Persuasion Dynamics

Synthetic Socratic Debates: introduces a system using Agent (LLM with persona), Persona (6-dimensional identity profile), Moderator (Manages debate turns), Multi-Turn Debate Framework (Simulates agent interactions), Persona Modeling (Assigns identity attributes), Decision Measures (Quantify moral judgments), Persuasion Measures (Evaluate debate effectiveness), Rhetorical Strategy Evaluation (Assesses persuasion modes), and LLM-as-a-judge (Evaluates rhetorical strategies) to simulate moral debates between AI agents with distinct personas.
The system investigates how persona traits influence moral decision-making and persuasive strategies in LLMs during multi-turn debates over real-world moral dilemmas.
The research reveals that political ideology and personality traits significantly shape initial moral stances and debate outcomes, impacting persuasive success and rhetorical strategies.

Towards Building General Purpose Embedding Models for Industry 4.0 Agents

Recommender Agent (ReAct Agent with Multi-Task Embedder Tool): introduces a framework for industrial asset maintenance agents, combining a ReAct Agent (Plans and reasons), a Multi-Task Embedder (Retrieves relevant items) used as a tool, and LLM Augmentation (Enriches queries) for improved context.
The framework leverages domain-specific embeddings fine-tuned on nine industrial tasks derived from ISO documents to enhance retrieval performance for complex queries.
Ablation studies demonstrate the effectiveness of LLM query augmentation and the importance of balanced positive/negative samples for training the embedding model.

The Rise of AI Companions: How Human-Chatbot Relationships Influence Well-Being

Study Approach: investigates how human-chatbot relationships influence well-being by collecting Survey Data and Chat History Data, deriving Chatbot Companionship Measures, Interaction Intensity Measure, Self-Disclosure Measures, Human Social Support Measure, and Well-Being Measure, and analyzing them using LLM-based Text Analysis, Topic Modeling, Regression Analysis, and CFA.
The study uses a mixed-methods approach, triangulating self-report surveys, open-ended descriptions, and chat transcripts to understand chatbot companionship and its psychological associations.
Findings suggest that companionship-oriented chatbot use, especially with high intensity and self-disclosure, is associated with lower well-being, particularly for users with limited offline social support.

AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving

AgentOrchestra: introduces a hierarchical multi-agent framework with a Planning Agent (Central orchestrator) coordinating Specialized Sub-Agents (Domain-specific processing team), utilizing various tools, memory, and models for general-purpose task solving.
The framework features a two-tier architecture where the planning agent decomposes tasks and delegates sub-tasks to specialized agents equipped with domain-specific tools.
AgentOrchestra supports flexible orchestration, inter-agent communication, and adaptive role allocation, enabling robust performance on complex, multimodal tasks.

Tiered Agentic Oversight: A Hierarchical Multi-Agent System for AI Safety in Healthcare

TAO (Tiered Agentic Oversight): introduces a hierarchical multi-agent framework for AI safety in healthcare, featuring an Agent Recruiter (Recruits expert agents), Agent Router (Routes query to tier), Tier 1 (Initial assessment/screening), Tier 2 (Specialized review/analysis), Tier 3 (Expert consultation/synthesis) of Medical Agents (Core assessment units), Case Escalation (Escalates to higher tiers), Intra-Tier Collaboration (Discussion within tier), Inter-Tier Collaboration (Dialogue between tiers), Final Decision Agent (Synthesizes final decision), and Human Oversight (Targeted human intervention).
The framework routes tasks based on complexity and agent roles, escalating complex or high-risk cases through tiers with automated inter- and intra-tier collaboration and role-playing.
TAO enhances AI safety through layered, automated supervision, demonstrating superior performance on healthcare safety benchmarks compared to single-agent and multi-agent baselines.

Topology-Assisted Spatio-Temporal Pattern Disentangling for Scalable MARL in Large-scale Autonomous Traffic Control

TGN-TMoE: introduces a novel MARL framework for large-scale traffic control, with Agent Observations (Raw graph data), MF Synchronization (Integrates mean-field information), Temporal Learning (Processes temporal features), Topological Processing (Extracts topological features), Spatial Learning (Processes spatial features), TMoE Module (Routes features to experts), Graph Pooling (Aggregates graph features), and Decision Making Module (Policy and value networks), designed to enhance environmental representation and agent coordination.
The framework integrates Dynamic Graph Neural Networks and Topological Data Analysis, employing a TSD-enhanced Mixture of Experts architecture for scalable multi-agent reinforcement learning.
TGN-TMoE leverages topological signatures to disentangle graph features for specialized processing within observation fusion and decision-making modules.

Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM

MAoP (Multiple Aspects of Planning): introduces a travel planning framework with a Strategist Model (Decomposes, routes aspects), Planning Model (Generates plan), Preprocessing Framework (Prepares input context including eliciting preferences, selecting POIs, and spatial optimization), Reward Model (Provides training signal), and Distilled Model (One-step inference).
The framework leverages a strategist for pre-planning and a planning model for generating travel itineraries based on wide-horizon thinking over multiple aspects.
A separate agent-based simulation framework, Travel-Sim, is proposed for evaluating the feasibility and personalization of the generated travel plans.

SHEETMIND: AN END-TO-END LLM-POWERED MULTI-AGENT FRAMEWORK FOR SPREADSHEET AUTOMATION

SheetMind: introduces a modular multi-agent framework for spreadsheet automation, with Manager Agent (Decomposes user instructions), Action Agent (Generates BNF commands), Reflection Agent (Validates actions, monitors effects), Front-End (User interface, executes actions), Back-End (Houses agent pipeline), and Spreadsheet (Target environment), enabling natural language interaction.
The framework decomposes complex instructions into subtasks, translates them into structured commands using BNF grammar, and validates actions through a feedback loop.
SheetMind integrates LLM-driven planning, structured execution, and agentic feedback to bridge the gap between natural language and spreadsheet functionalities.

INDOORWORLD : Integrating Physical Task Solving and Social Simulation in A Heterogeneous Multi-Agent Environment

INDOORWORLD: introduces a heterogeneous multi-agent environment integrating physical task solving and social simulation, with Agent (Autonomous entity), Perception (Processes observations), Memory (Stores information, history), Planning (Determines objectives, tasks), Action (Selects, executes actions), Task Prioritization (Encourages task focus), Environment (Simulated indoor space), Object (Physical entity), Location (Spatial area), and World State (Joint state variables) components, designed to simulate occupant behaviors in indoor spaces.
The environment features heterogeneous agents with multi-level profiles (role, action space, capability, knowledge) and human needs, interacting with objects and locations to modify the world state.
The framework supports both collaborative task-solving and autonomous social simulation sessions, providing a testbed for LLM-based multi-agent systems and potential applications in architectural design.

Cloud Infrastructure Management in the Age of AI Agents

AI Agents: introduces an envisioned agentic system architecture for automating cloud infrastructure management using LLM-powered agents, including User-agent Interface, Agent-cloud Interface, Multi-agent Orchestration, Memory, Reasoning, Tools, Planning, Guardrails, Actions, Cloud Vendors, and Cloud Gym components.
The proposed architecture utilizes different cloud interaction modalities (SDK, CLI, IaC, Web) and incorporates exploration/exploitation phases and guardrails for safety and reliability.
A preliminary study evaluates agents across modalities on VM management tasks, highlighting trade-offs in efficiency, success rate, and error handling.

13th June 2025

Reviving DSP for Advanced Theorem Proving in the Era of Reasoning Models

DSP+ (Draft, Sketch, and Prove): introduces an enhanced neuro-symbolic framework for automated theorem proving, featuring fine-grained integration of LLMs (Draft Model, Sketch Model, Proving Model) with symbolic search (Symbolic Search, Step Prover) across its three phases (Draft, Sketch, Prove) and an Ensemble Setting (combines model configurations).
The framework coordinates existing reasoning models and tactic step provers, leveraging neuro-symbolic enhancements like Filter (removes thinking tokens) and Repair (masks syntactic errors) to improve proving accuracy and token efficiency.
DSP+ demonstrates the overlooked potential of classical reasoning patterns, offering an efficient and complementary approach to RL-based training in automated theorem proving.

ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification

ReVeal: introduces a multi-turn reinforcement learning framework for code agents, featuring an Iterative Generation-Verification Loop where a Policy LLM generates code and test cases, External Tools execute them, and Tool Feedback provides results, guided by Turn-Level Reward Design and Outcome Reward, trained using Turn-Aware PPO on the Dataset (TACO).
The framework enables LLMs to autonomously generate and verify code through iterative refinement and tool interaction, improving performance and self-verification capabilities.
ReVeal's approach allows for effective test-time scaling into deeper inference regimes and pushes reasoning boundaries beyond the base model.

The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs

Behavior Gap Evaluation Framework: introduces a comprehensive framework to quantify the behavior gap between LLM Agents and Human Experts in task-oriented dialogs using Behavior Gap Metrics, Task Complexity Metrics, and Performance Metrics within a Teacher-Forcing Approach on various Datasets, revealing significant discrepancies that negatively impact LLM Agent performance.
The study utilizes LLM-based Classifiers and an LLM-based Evaluator to analyze specific behavioral dimensions like dialog acts, Tool usage, and knowledge integration, demonstrating that the gap widens with increasing task complexity.
Aligning LLM agent behavior closer to human strategies through Behavior Intervention significantly improves performance, particularly in complex tasks.

A Fast, Reliable, and Secure Programming Language for LLM Agents with Code Actions

QUASAR: introduces a programming language for LLM agent code actions, with LLM Agent generating Python Subset code, Transpiler converting it to QUASAR Language, and QUASAR Interpreter executing it using Internal Rewrite Rules and External Rewrite Rule, managing External Calls tracked in the Execution Set, validated by User Approval Mechanism, and supporting Conformal Semantics with Abstract External Functions.
The QUASAR Language separates pure internal computation from external side effects, enabling automatic parallelization, dynamic security controls, and uncertainty quantification.
The approach leverages LLM proficiency in Python by transpiling a restricted subset to QUASAR for improved performance, security, and reliability compared to direct Python execution.

PRO-V: An Efficient Program Generation Multi-Agent System for Automatic RTL Verification

PRO-V: introduces an efficient program generation multi-agent system for automatic RTL verification, with Stimulus Generator (Generates input signals), Functional Model (Generates reference outputs), Self-Improvement (Selects, refines models), Validator (Verifies DUT with judge), Judge Agent (LLM for selection, validation), and Refinement Agent (LLM for refinement) components.
The system enhances verification accuracy and coverage through inference-time scaling via dual sampling and a self-improvement mechanism using a best-of-n selection strategy.
PRO-V integrates an LLM-as-a-judge into the validation process, leveraging rule-based static analysis converted to natural language for enhanced prompting and root cause analysis.

Robot Context Protocol (RCP): A Runtime-Agnostic Interface for Agent-Aware Robot Control

RCP (Robot Context Protocol): introduces, "a lightweight, middleware-agnostic communication protocol designed to abstract robotic system complexity", with Adapter Layer (Translates client interfaces), Transport Layer (Handles communication channels), Service Layer (Defines core operations), ROS2 Interface Layer (Maps to ROS2 runtime), and Status Query (Provides health and feedback), where "RCP provides a unified and semantically meaningful interface that decouples client-facing operations from backend implementations".
The protocol is structured in modular layers, including the Adapter Layer for diverse clients, the Transport Layer for communication channels (HTTP, WebSocket, SSE), the Service Layer for high-level operations (read, execute, write, subscribe), and the ROS2 Interface Layer for mapping to the underlying runtime.
RCP includes a Status Query component for real-time protocol health and command feedback, supporting robustness and operational transparency.

Your Ride, Your Rules: Psychology and Cognition Enabled Automated Driving Systems

PACE-ADS (Psychology and Cognition Enabled Automated Driving Systems): introduces a human-centered autonomy framework with Driver Agent (Analyzes external traffic), Psychologist Agent (Interprets occupant psychological state), and Coordinator Agent (Synthesizes inputs, decides behavior) interfacing with Perception module (Provides sensor data), Route planning module (Computes global route), Motion planning module (Generates behavior, trajectory), and Control module (Executes planned trajectory) for adaptive driving.
The framework leverages LLM-based agents to sense, interpret, and respond to both external traffic conditions and internal occupant states.
Operating in a closed-loop architecture, the system dynamically adjusts driving style and supports vehicle operation recovery.

Revealing Political Bias in LLMs through Structured Multi-Agent Debate

Structured Multi-Agent Debate Framework: introduces a system to investigate political bias in LLMs, with LLM Agents (Simulate participants) assigned Agent Personas (Political/gender identities) debating generated Debate Scenarios (Generated topics/questions) following a specific Debate Format (Structured rounds/statements), evaluated by an LLM-as-a-Judge (Evaluates attitudes) using Attitude Scoring (Quantifies agreement/disagreement) and a defined Speaking Order (Agent turn sequence).
The framework systematically varies LLM models, agent gender attributes, and debate formats to examine influences on political bias and attitude shifts.
Experiments reveal Republican agents shift towards neutral, gender influences attitudes, and echo chambers form with attitude intensification, particularly when gender is known.

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

SEC-bench: introduces an automated benchmarking framework for evaluating LLM agents on security engineering tasks, with Preprocessor (collects instances), Verifier (reproduces, verifies vulnerabilities), and Evaluator (transforms, formulates tasks) components.
The Verifier component employs a multi-agent scaffold including Manager, Builder, Exploiter, and Fixer agents to reproduce and validate vulnerabilities.
The framework automatically creates high-quality software vulnerability datasets with reproducible artifacts for evaluating LLM agent capabilities in tasks like proof-of-concept generation and vulnerability patching.

AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments

AgentSense: introduces a virtual data generation pipeline using LLM agents in simulated home environments to create diverse sensor data for human activity recognition.
The pipeline involves LLMs generating personas, routines, and actions, which are then executed in an extended VirtualHome simulator equipped with virtual ambient sensors.
The generated virtual sensor data is used to pretrain HAR models, demonstrating improved performance, especially in low-resource settings, compared to training solely on real data.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

RACE and FACT evaluation frameworks: introduce two novel evaluation frameworks, with Judge LLM, Adaptive Criteria Generation, Reference-Based Scoring, Statement-URL Extraction, Support Judgment, Jina Reader API, and Citation Metrics Calculation components, designed to comprehensively assess Deep Research Agents.
RACE evaluates report generation quality using adaptive criteria and reference-based scoring, while FACT assesses information retrieval and citation trustworthiness.
These frameworks are part of the DeepResearch Bench, a benchmark of 100 PhD-level research tasks for evaluating LLM-based agents.

A Hybrid Multi-Agent Prompting Approach for Simplifying Complex Sentences

Hybrid Multi-Agent Prompting Approach: introduces a system using multi-agent collaboration for sentence simplification, including Agent 1 Sentence Simplifier, Agent 2 Semantic and Lexical Similarity Evaluator, Agent 3 Alternative Sentence Simplifier, and Comparator components.
The system processes complex sentences through a workflow where agents decompose, evaluate, and iteratively revise the output to preserve meaning while reducing complexity.
This multi-agent architecture demonstrates improved performance over single-agent methods for simplifying complex sentences in domains like video game design.

ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification

ReVeal: introduces a multi-turn reinforcement learning framework that enables code agents to engage in an iterative generation-verification loop using a single Policy LLM, guided by Input Prompt and Tool Feedback, structured as a Multi-turn Rollout producing an Output Rollout, optimized with Outcome Reward and Turn-Level Rewards via Turn-Aware PPO.
The framework alternates between Generation (producing code) and Verification (generating test cases and plans) stages, leveraging external Tools like Python Interpreters for execution.
This iterative process and dense reward structure allow the model to self-verify, refine outputs, and improve both generation and verification capabilities over multiple turns.

Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards

Agent-RLVR: introduces a framework for training software engineering agents using Reinforcement Learning from Verifiable Rewards (RLVR), incorporating agent guidance and environment rewards, with Policy, Environments, Trajectory, Evaluation, Environment Information, Agent Guidance, Guidance Generation, RLVR Data, Policy Update, and Instruct Tuning components.
The framework trains an agent Policy by having it interact with Environments, evaluating Trajectories via Evaluation, generating Environment Information from failures, and using Guidance Generation to create Agent Guidance.
Incorrect trajectories are reattempted with Agent Guidance, and the resulting RLVR Data is used for Policy Update via DPO and optional Instruct Tuning to improve agent performance.

Large Language Model-Powered Conversational Agent Delivering Problem-Solving Therapy (PST) for Family Caregivers: Enhancing Empathy and Therapeutic Alliance Using In-Context Learning

LLM-powered Conversational Agent Models: introduces an LLM-powered agent delivering Problem-Solving Therapy (PST) for family caregivers, integrating Motivational Interviewing (MI) and Behavioral Chain Analysis (BCA) using prompting techniques, Retrieval-Augmented Generation (RAG), and clinician-curated content.
The research evaluates four distinct configurations of this agent, comparing different LLMs (GPT-40, Llama 3) and combinations of in-context learning techniques (Few-shot, RAG) for their impact on perceived empathy and therapeutic alliance.
The models aim to provide empathetic and tailored mental health support by improving contextual understanding and generating personalized, actionable strategies for caregivers.

Secure API-Driven Research Automation to Accelerate Scientific Discovery

S3M (Secure Scientific Service Mesh): introduces, "Secure Scientific Service Mesh (Overall framework), Manages data streaming, Automates complex workflows, Manages compute jobs, Provides resource status, Retrieves environment info, Manages access tokens, Enables secure communication, Underlying service mesh platform, Python interface, Validates client interactions, Creates streaming objects, Deploys streaming clusters", a framework providing API-driven infrastructure for automated scientific discovery with integrated streaming, workflow orchestration, and fine-grained authorization.
The framework utilizes a service mesh architecture built on OpenShift and Istio to ensure modularity, scalability, and policy-driven security enforcement across computational services.
S3M offers a comprehensive set of APIs and an SDK to enable authenticated external systems and intelligent agents to securely provision resources, stream data, and trigger compute jobs dynamically.

Your Ride, Your Rules: Psychology and Cognition Enabled Automated Driving Systems

PACE-ADS (Psychology and Cognition Enabled Automated Driving Systems): introduces a human-centered autonomy framework with Psychologist Agent (Interprets occupant state/intent), Driver Agent (Perceives external traffic context), Coordinator Agent (Synthesizes inputs, decides behavior), Perception module (Provides sensor data), Route planning module (Plans/replans vehicle route), Motion planning module (Generates behaviors/trajectories), and Control Module (Executes low-level commands), enabling AVs to sense, interpret, and respond to external traffic and internal occupant states.
The framework uses three specialized foundation model agents in an agentic workflow to manage complex driving tasks and enable adaptive, interpretable, and collaborative driving.
PACE-ADS complements existing AV modules by operating at the high-level behavioral decision layer, personalizing riding experience, and supporting recovery from immobilization.

Self-Regulating Cars: Automating Traffic Control in Free Flow Road Networks

SRC (Self-Regulating Cars): introduces a physics-informed reinforcement learning protocol for automating traffic control in free-flow networks by having a central RL agent modulate vehicle speeds on super-segments based on traffic state observations, guided by a reward function.
The system utilizes Deep Q-Learning with a neural network to learn speed modulation policies, evaluated in a PTV Vissim simulation environment.
The approach aims to optimize network throughput and prevent congestion by coordinating individual self-regulating cars without requiring new physical infrastructure.

PE-MA: Parameter-Efficient Co-Evolution of Multi-Agent Systems

PE-MA (Parameter-Efficient Multi-Agent Co-Evolution): introduces, "a novel collaboration framework", with Frozen Backbone (Fixed feature extractor), Personalized Adapter (Adapts to local tasks/data), Shared Adapter (Shares knowledge across agents), Communication Mechanism (Exchanges and aggregates adapters), designed for efficient, scalable, and personalized co-evolution in multi-agent systems.
Each agent maintains a lightweight personalized adapter for agent-specific behavior and a shared adapter collaboratively optimized across neighboring agents.
The dual-adapter architecture balances global coordination with local adaptation, significantly reducing training and communication costs.

Interaction, Process, Infrastructure: A Unified Architecture for Human-Agent Collaboration

Unified Architecture for Human-Agent Collaboration: introduces a layered framework for human-agent collaboration with Interaction Layer (surface of shared understanding), Process Layer (collaborative core), and Infrastructure Layer (orchestration, execution, memory).
The Process Layer explicitly models goals, workflows, and progress, serving as connective tissue for human-agent alignment and coordination over time.
This modular architecture supports transparency, extensibility, and adaptive, goal-aligned collaboration by decoupling interaction, process logic, and computational foundation.

Reviving DSP for Advanced Theorem Proving in the Era of Reasoning Models

DSP+ (Draft, Sketch, and Prove): introduces an enhanced neuro-symbolic framework for automated theorem proving, featuring fine-grained integration of LLMs (Draft Model, Sketch Model, Proving Model) with symbolic search (Symbolic Search, Step Prover) across its three phases (Draft, Sketch, Prove) and an Ensemble Setting (combines model configurations).
The framework coordinates existing reasoning models and tactic step provers, leveraging neuro-symbolic enhancements like Filter (removes thinking tokens) and Repair (masks syntactic errors) to improve proving accuracy and token efficiency.
DSP+ demonstrates the overlooked potential of classical reasoning patterns, offering an efficient and complementary approach to RL-based training in automated theorem proving.

Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models

Parallel Thinking: introduces an alternative test-time scaling strategy that generates multiple independent reasoning paths within a fixed inference budget and selects the final answer via majority vote.
This approach addresses the limitations of sequential "Wait & Think more" methods, which suffer from "overthinking" and performance degradation beyond a critical point due to increased output variance.
The framework consistently maintains or improves accuracy, achieving up to 20% higher accuracy compared to extended thinking by effectively utilizing the inference budget and mitigating the overthinking trap.

12th June 2025

AUTOMIND: Adaptive Knowledgeable Agent for Automated Data Science

AUTOMIND (Adaptive Knowledgeable Agent for Automated Data Science): introduces an adaptive, knowledgeable LLM-agent framework with a curated expert knowledge base, an agentic knowledgeable tree search algorithm, and a self-adaptive coding strategy.
The framework leverages the expert knowledge base via a retriever to ground the tree search, which explores solutions through drafting, improving, and debugging actions.
The self-adaptive coding strategy dynamically adjusts code generation based on task complexity, using either one-pass generation or stepwise decomposition with execution feedback.

Specification and Evaluation of Multi-Agent LLM Systems - Prototype and Cybersecurity Applications

Multi-Agent LLM System: introduces a system architecture and specification for multi-agent LLM applications, including a Client Application, Conversational User Interface, Agent Manager, Conversation Manager, Execution Engine, Agents, LLM Services, Host Execution Environment, and Agent Schema.
The system allows specifying agents with executable prompts, actions, and data, supporting prompting/reasoning techniques and conditional execution based on results.
The Agent Schema defines agent types, functions (execution, evaluation), and configurations for agents and LLMs, enabling systematic evaluation of LLMs and techniques in specific applications like cybersecurity.

From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review

GPT ranking system: introduces a novel peer review mechanism using LLM agents for pairwise comparisons, aggregated by the Bradley-Terry model to derive a global ranking of submissions.
The system contrasts pairs of manuscripts to determine relative quality, moving away from traditional independent absolute scoring.
Empirical experiments demonstrate the system's potential to identify high-impact papers more effectively than rating-based methods, while also revealing biases against topic novelty and institutional diversity.

LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation

Reflection Agent with Dedicated Evaluator: introduces a system for automatic code validation and refinement, including a Code Generator, Reflect module, and Evaluator using specific metrics.
The Evaluator utilizes Bidirectional Functionality Matching and Logic Representation metrics to assess generated Bash code quality without requiring reference code.
The system incorporates judgments and feedback from the evaluation metrics to refine the initial code snippet generated by the Code Generator.

Using Invocable APIs derived from NL2SQL datasets for LLM Tool-Calling Evaluation

LLM Tool-Calling Evaluation Framework: introduces a method to convert NL2SQL datasets into NL2API datasets for LLM tool-calling evaluation using a Data Generation Pipeline.
The framework includes generated API Collections (SLOT, SEL, REST) with varying characteristics, Invocable APIs for live interaction, and an Evaluation Set pairing natural language queries with ground-truth API sequences.
It evaluates the performance of various LLMs and ReACT Agents on these generated datasets to assess their tool-calling capabilities.

Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges

Reasoning Agentic RAG: introduces a paradigm integrating retrieval with model-driven reasoning and decision-making, encompassing Question, LLM/LRM, Reasoning, Retrieval, Retrieved Information, Distilled Information, and Answer components.
The framework categorizes approaches into predefined reasoning with fixed pipelines and agentic reasoning with autonomous tool orchestration.
This survey reviews techniques, architectural designs, reasoning strategies, and tool coordination within this paradigm to address industry challenges.

Provably Learning from Language Feedback

HELIX: introduces a framework for Learning from Language Feedback (LLF), including an LLM Policy, Reference Policy, Reward Mapping, set of Hypotheses, set of Actions, and Score Matrix.
The LLM Policy generates hypotheses and actions, the Reference Policy adds random actions, and the Reward Mapping scores actions under hypotheses to form a Score Matrix.
The algorithm uses the Score Matrix for decision-making, employing exploitation when consensus exists and exploration otherwise, potentially re-scoring with the Reference Policy.

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

SWE-Factory: introduces an automated pipeline for GitHub issue resolution benchmark construction, including Raw Issue Collection (Collects GitHub issue data), SWE-Builder (Automates environment setup), Grading Results (Grades test outcomes), and Fail2pass Validation (Validates fail-to-pass transition).
The SWE-Builder component is a multi-agent framework comprising a Repository Explorer (Collects repository setup information), Environment Manager (Generates Dockerfile), Test Manager (Generates test script), and Test Analyst (Validates environment, plans iterations), supported by an Evaluation Environment Memory Pool (Stores and reuses setups).
The pipeline automates environment construction, grading via exit codes, and fail2pass validation to reduce manual effort in creating large-scale, high-quality datasets.

Build the web for agents, not agents for the web

AWI (Agentic Web Interface): introduces a new web interface paradigm specifically designed for agents, featuring unified higher-level actions, compatibility with user interfaces, access control for agents, progressive information transfer, and agentic task queues.
This paradigm shift aims to overcome limitations of current human-designed web interfaces for web agents.
The paper establishes guiding principles for AWI design, emphasizing safety, efficiency, and standardization, and advocates for broad ML community involvement.

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Lightweight Sequential Monitoring Framework: introduces a defense against decomposition attacks by using an external monitor to evaluate the cumulative context of subtasks.
The monitor outputs a binary flag at each step to halt the LLM if harmful intent is detected based on the prompt history.
This framework outperforms single-input monitoring and is cost/latency efficient for mitigating decomposition attacks.

Execution Guided Line-by-Line Code Generation

EG-CFG (Execution-Guided Classifier-Free Guidance): introduces a novel approach for neural code generation that incorporates real-time execution signals into the language model generation process, utilizing a Large Language Model, Programming Task input, Initial Prompt, Candidate Generation via beam search, Executable Extraction via AST parsing, Execution Engine for running test cases, Execution Trace generation, Dynamic Signal aggregation, Dynamic Prompt construction, Classifier-Free Guidance for token generation, an Inference Loop for autoregressive generation, and Parameter Search.
The method dynamically incorporates execution signals as the model generates code line-by-line, guiding the generation process toward executable solutions.
EG-CFG achieves state-of-the-art performance on multiple code generation benchmarks by leveraging execution feedback and Classifier-Free Guidance.

Dynamic Epistemic Friction in Dialogue

Dynamic Epistemic Friction (DEF): introduces a formal model of dynamic epistemic friction in dialogue, operationalized within Dynamic Epistemic Logic and vector-based belief representations, using epistemic states, propositions, evidence, alignment, friction, QBank, EBank, FBank, an update function, friction coefficients, and friction equilibrium.
The model quantifies resistance encountered during belief updates by measuring vector similarity between agent beliefs and new information combined with evidence.
Empirical analysis on a situated collaborative task dataset demonstrates that the model effectively predicts participant belief updates by modeling this resistance.

OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems

OPT-Agent: introduces a framework that emulates human reasoning for optimizing solutions by iteratively generating, validating, and improving solutions using historical feedback, with all Drafting, Improving, Debugging, Historical Information, Error Analysis, Validation, and Metrics components.
The framework's workflow involves generating a Draft solution, iteratively Improving valid solutions or Debugging buggy ones based on Error Analysis and Historical Information, with Validation and Metrics guiding the process.
OPT-Agent is evaluated on OPT-BENCH, a benchmark of machine learning and NP problems, to assess LLMs' iterative optimization capabilities.

Integrating Large Language Models into Text Animation: An Intelligent Editing System with Inline and Chat Interaction

Text Animation Editing System: introduces an LLM-aided system for text animation editing, featuring a Script Panel (Edit text, properties), Timeline Panel (Arrange, time clips), Chat Panel (Natural language commands), Resource Panel (Manage assets), Inspector Panel (Adjust properties), Preview Panel (Visualize edits), Inline Agent (Contextual suggestions), Chat Agent (Conversational task execution), LLM (Large Language Model) (AI engine), Semantic-Animation Mapping (Intent to action), and Script-Timeline Synchronization (Panels linked).
This system employs a dual-mode agent pipeline (Inline and Chat Agents) powered by an LLM for intelligent assistance and natural language interaction.
The system aims to lower creative barriers for non-professionals and enhance editing efficiency through seamless inline edits and chat-based interactions.

Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding

VLFly (Vision-Language Fly): introduces a novel VLN framework for UAVs, including an instruction encoding module, a goal retrieval module, a waypoint planning module, and action execution, designed for open-vocabulary goal understanding and continuous control.
The framework processes natural language instructions, retrieves a goal image, generates waypoints from egocentric observations, and executes continuous velocity commands.
VLFly achieves robust generalization and outperforms baselines in simulation and real-world UAV navigation tasks without task-specific fine-tuning.

SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis

SDialog: introduces a Python toolkit for synthetic dialogue generation and analysis, with Turn (Single utterance), Event (Action/instruction), Dialog (Complete conversation structure), Persona (Character profile definition), PersonaAgent (Simulates agent role-playing Persona), BaseOrchestrator (Abstract control class), SimpleReflexOrchestrator (Triggers on condition), LengthOrchestrator (Controls dialogue length), ChangeMindOrchestrator (Simulates agent changing mind), SimpleResponseOrchestrator (Suggests responses by similarity), InstructionListOrchestrator (Provides sequence of instructions), DialogGenerator (Generates dialogue using LLM), PersonaDialogGenerator (Generates dialogue between Personas), Dataset Utilities (Work with external datasets), Serialization Utilities (Save/load dialogues), and Visualization Utilities (Analyze/visualize dialogues) components, designed for creating realistic, diverse, and controllable conversational data.
The toolkit provides abstractions for personas, orchestration, and scenario management, leveraging instruction-tuned LLMs for generation.
SDialog supports workflows like multi-agent simulation and scenario-driven generation, aiming to standardize synthetic data generation for reproducibility.

Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models

Multi-User Dialogue Data Construction Method: introduces a method to extend single-user dialogue datasets by incorporating a second user's utterances, utilizing a Single-User Dialogue Structure, Speech Act Type Identification, User2 Utterance Generation, User2 Utterance Validation, and an LLM to create a Multi-User Dialogue Structure for evaluating DST.
The method systematically generates and validates user2 utterances based on speech act theory to create a controlled multi-user setting for assessing LLM performance.
This approach enables evaluating LLMs on multi-user dialogue state tracking challenges with minimal dataset construction costs.

BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis

BugGen: introduces a self-correcting multi-agent LLM pipeline, with Module Splitter (partitions RTL), Mutation Index (lists mutation types), Mutation Cache (stores history), Region Selector Agent (chooses region), Mutation Selector Agent (chooses mutation), Mutation Injector Agent (inserts mutation), Evaluation (validates bug), and Rollback/Retry (corrects failures), designed to autonomously generate, insert, and validate realistic functional bugs in RTL.
The pipeline leverages LLM agents in a closed-loop architecture with shared memory and iterative refinement to produce unique, syntactically valid, and functionally detectable bugs.
BugGen achieves high functional accuracy and throughput, outperforming existing methods and generating high-quality bug datasets suitable for training ML-based debugging models.

Minimizing False Positives in Static Bug Detection via LLM-Enhanced Path Feasibility Analysis

LLM4PFA (LLM-Enhanced Path Feasibility Analysis): introduces an iterative path feasibility analysis framework for static bug detection, with Iterative Function Analysis, Feasible Path Constraint Extraction, Critical Path Conditional Branches Identification, Feasible Path Conditional Expression Extraction, Context-Aware Symbolic Range Reasoning, LLM Agent, Variable Symbolic Range Reasoning, Function Call Symbolic Range Reasoning, Function Retrieval Tool, Source Code Repository, Function Call Memory, Constraints Solving, SMT Query Script Generation, Script Template Generation, SMT Constraints Generation, Script Merging, Constraint Solver, Control-Flow Graph (CFG), and Initial States P.
The framework iteratively analyzes functions in a call trace, extracting and solving feasible path constraints using LLM agents for symbolic reasoning and a constraint solver.
LLM4PFA leverages LLM agents' self-planning and tool-usage capabilities for context-aware symbolic range reasoning and iteratively generates and solves SMT queries to minimize false positives.

WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models

WGSR-Bench: introduces, "a wargame-based benchmark for LLMs", with Environmental situational awareness, Opponent risk assessment, and Policy generation components, where "it systematically assesses strategic reasoning abilities using wargame scenarios".
The benchmark evaluates LLMs' capabilities in multi-agent decision-making, intent inference, and counterfactual reasoning within a high-complexity wargame environment.
It employs a structured cognitive framework (S-POE) and utilizes real adversarial wargame data for comprehensive evaluation and analysis.

11th June 2025

AUTONOMOUS COMPUTER VISION DEVELOPMENT WITH AGENTIC AI

Agentic AI approach: introduces an autonomous computer vision development system, with OpenManus Agent (Orchestrates task execution), Memory (Stores runtime state/context), Planning (Decomposes tasks, selects tools), Reasoning (Analyzes inputs, makes decisions), Self-Correction/Adaptation (Handles errors, refines plans), Tools (Execute Python, browser, files, shell), SimpleMind Framework (Executes computer vision tasks), Configurable Tools (Perform image processing, neural nets), Knowledge Graph (Defines SimpleMind workflow), Blackboard (Central working memory), SM-Learn (Trains neural network weights), SM-Think (Performs inference), User Prompt (Natural language task input), System Prompt (Guides LLM planning), Verifier (Checks YAML configuration), Tool Configuration File (KG) (YAML workflow definition), Tool Execution (Runs SimpleMind modules), where the system translates natural language prompts into SimpleMind workflows for medical image analysis.
The OpenManus agent leverages an LLM for planning and tool use, generating a YAML Knowledge Graph that configures SimpleMind's computer vision tools.
SimpleMind executes the planned workflow, utilizing its Blackboard for data flow and SM-Learn/SM-Think for training and inference on medical images.

AURA: A Multi-Agent Intelligence Framework for Knowledge-Enhanced Cyber Threat Attribution

AURA (Attribution Using Retrieval-Augmented Agents): introduces a multi-agent framework for cyber threat attribution, comprising input processing, query rewriting, semantic retrieval, decision making, external search, attribution generation, conversational memory, and a knowledge base.
The framework processes diverse threat data via collaborative agents, integrating Retrieval-Augmented Generation (RAG) with LLMs for knowledge-enhanced reasoning and interpretable attribution.
AURA generates transparent, evidence-backed attribution decisions by tracing reasoning to contextual evidence and providing natural language justifications.

Disclosure Audits for LLM Agents

CMPL (Conversational Manipulation for Privacy Leakage): introduces an automated auditing framework for conversational privacy risks in LLM agents, featuring an Application Agent A (LLM agent being audited), an Adversary U (LLM agent attempting leakage), and an Auditor D (LLM agent detecting leakage) interacting within a Conversation Loop (iterative interaction process) based on a Scenario Description σ (public context), Information Subject Profile I (private data), Privacy Directive ψ (disclosure rules), and Task Description T (agent A's goal).
The Adversary U employs a Strategist Se (adversary planning module) and Prompt Generator Ge (adversary query module) to manipulate the Conversation History H (dialogue turns) and may use a Side-channel Predictor Pe (adversary inference module) to make a Prediction (adversary guess) with Confidence kt (adversary prediction score).
The Auditor D monitors the Conversation History H and uses an Entail Function (auditor explicit leakage detector) and the adversary's Prediction and Confidence to produce an Indicator zt (auditor leakage signal) when explicit or implicit leakage is detected, while both agents utilize Memory (stores/summarizes history) to maintain state.

Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information

Chat-of-Thought: introduces a collaborative multi-agent system for domain-specific information generation, featuring LLM-based Agents, Context Discovery, Multi-Round Chain of Interactions, Template-driven Routing, and Quality Check.
The system employs specialized LLM-based Agents with defined roles and state to engage in iterative discussions guided by templates and dynamic assignment.
It leverages diverse input sources, question/answer banks, and various learning methods to generate and refine domain-specific knowledge like FMEA documents.

A quantum semantic framework for natural language processing

Quantum Semantic Framework: introduces a non-classical approach to natural language processing, modeling semantic meaning as observer-dependent and contextually actualized through the interaction of a Semantic Expression (symbol affording interpretations) and an Interpretive Agent (observer) via an Interpretive Observable (semantic probe operator).
The framework posits that meaning is not intrinsic but emerges dynamically, influenced by the agent's Semantic Memory (agent internal state) and Context (situational factors), with interpretation dynamics governed by a Semantic Hamiltonian (interpretation dynamics).
A Semantic Bell Test (experimental method) using LLM Agents (computational observers) configured with Personas (agent configurations) and presented with Ambiguous Word Pairs (stimuli) demonstrates non-classical contextuality in interpretation, supporting the framework's premise.

AI Agent Behavioral Science

AI Agent Behavioral Science: introduces, "AI Agents (autonomous systems)/Memory (stores history)/Planning (strategizing actions)/Tool Use (interacts with tools)/Action Modules (executes decisions)/Intrinsic Attributes (internal traits)/Environmental Constraints (external structures)/Behavioral Feedback (adaptation mechanism)/Ability (foundational competence)/Motivation (drive from feedback)/Trigger (initiating signals)", a paradigm for studying AI agents as behavioral entities in context, emphasizing systematic observation, intervention design, and theory-guided interpretation.
This perspective focuses on understanding how AI agent behavior emerges and adapts through the interplay of internal factors, environmental context, and interaction feedback.
The paper systematizes research on individual, multi-agent, and human-agent interactions and positions this behavioral science approach as essential for responsible AI.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

V-JEPA 2 (Self-Supervised Video Model): introduces a self-supervised approach combining internet video data and robot interaction data, with V-JEPA 2 Encoder (Extracts video representations), V-JEPA 2 Predictor (Predicts masked representations), V-JEPA 2 EMA Encoder (Target for prediction), V-JEPA 2-AC Frozen Encoder (Provides learned representations), V-JEPA 2-AC Action-Conditioned Predictor (Predicts future state representations), MLLM Projector Module (Maps visual to LLM), and MLLM LLM Backbone (Language model).
The framework pre-trains V-JEPA 2 on internet video, then post-trains V-JEPA 2-AC on robot data for planning, and aligns V-JEPA 2 with an LLM for video question answering.
V-JEPA 2 demonstrates strong performance on motion understanding, action anticipation, video question answering, and enables zero-shot robot manipulation via planning.

Patterns of Patterns III

PLACARD: introduces a methodology combining PAR, CLA, and DPL for collective learning and design, extended to pattern-competent AI agents within Multi-Agent Systems, utilizing Language Model Substrates and various agent types.
The approach structures pattern use via an A/B/C catalogue and proposes AI agent pattern types (Interactional, Cognitive, Infrastructural) along with specific candidate patterns for agents and their environments.
Different agent roles, including Code-Writing & Execution Agents, Pattern-Aware Dialogue Agents, Pattern-Reflective Meta-Agents, and Multi-Agent Institution Designers, interact within the MAS environment, grounded by a Real-World Interface layer.

Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling

VRS (Verbalized Rejection Sampling): introduces, "a natural-language adaptation of classical rejection sampling", with LLM (Performs accept/reject decision), Target Distribution (Desired sample distribution), Proposal Distribution (Source of candidate samples), Candidate Sample (Sample from proposal), Verbalized Prompt (LLM input with descriptions), Accept/Reject Decision (LLM binary output), and Sampling Loop (Generates candidates, repeats), which prompts an LLM to reason about and accept or reject proposed samples from a proposal distribution to generate samples from a target distribution.
The framework verbalizes the target distribution, proposal distribution, and a candidate sample into a prompt for the LLM, which acts as a black-box decision engine.
The external sampling loop generates candidate samples and repeats the LLM decision process until the required number of accepted samples is collected.

SRLAgent: Enhancing Self-Regulated Learning Skills through Gamification and LLM Assistance

SRLAgent: introduces an LLM-assisted system that fosters self-regulated learning skills through gamification and adaptive support, with Game Environment (Minecraft 3D world), Learning Management System (Manages learning elements), Task System (Manages hierarchical tasks), Agent System (AI-powered support), LLM (Large Language Model), Planning Agent (Supports forethought phase), SubTask Monitor (Tracks performance), SubTask Tutor Agent System (Provides tutoring), Quiz Agent (Supports quizzes), Review Agent (Guides analysis), Chatting Agent (Facilitates discussions), Writing Agent (Supports writing), Reflection Agent (Guides reflection phase), Task State (Current task status), SRL Phase (Current SRL stage), Task Learning Views (User interface views), Learning Subtasks Content (Educational materials/activities), Learning Feedback (Agent-provided feedback), Tutor Feedback (Tutor agent feedback), Prompt Configurations (Agent prompt templates), where it guides students through SRL phases using gamification and LLM-powered agents in a game-based environment.
The system is grounded in Zimmerman's three-phase SRL framework, enabling goal-setting, strategy execution, and self-reflection within an interactive game environment.
SRLAgent offers real-time feedback and scaffolding powered by LLMs to support students' independent study efforts.

PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants

PersonaLens: introduces a benchmark for evaluating personalization in task-oriented conversational AI assistants, with diverse user profiles, tasks with situational contexts, and two LLM-based agents (User Agent and Judge Agent) to simulate interactions and evaluate performance.
The benchmark features 1,500 user profiles, 111 tasks across 20 domains, and LLM-powered agents for scalable, automated evaluation.
PersonaLens assesses personalization, task completion, and dialogue quality in multi-turn interactions, revealing insights into current LLM assistants' capabilities.

INTELLIGENT DESIGN 4.0: PARADIGM EVOLUTION TOWARD THE AGENTIC AI ERA

ID 4.0 (Intelligent Design 4.0): introduces a multi-agent-based paradigm for engineering design automation, composed of stage-level agents (interprets inputs/decomposes tasks, explores early solutions/ideation, translates concepts/coordinates modeling, refines geometry/produces models, applies optimization/fine-tunes) and functional agents (retrieves external knowledge, accesses databases, infers user intent/context, facilitates divergent thinking, synthesizes 2D/3D geometry, conducts multi-physics simulations, verifies compliance) operating within a shared information environment (supports coordination/learning).
This framework envisions autonomous, task-specialized AI agents coordinating via orchestrated design workflows to support complex, end-to-end design processes.
Agents interact with human designers and external tools, leveraging shared memory for cross-agent coordination and cumulative learning throughout the design process.

Feature Engineering for Agents: An Adaptive Cognitive Architecture for Interpretable ML Monitoring

CAMA (Cognitive Architecture for Monitoring Agent): introduces a cognitive architecture for ML monitoring, with Semantic Memory (Stores reference data), Working Memory (Holds current context), Episodic Memory (Retains past instances), Procedural Memory (Stores agent code), Decision Procedure (Feature engineering approach), LLM (Reasoning engine), and Agent code (Includes prompts/chains), designed to enhance interpretability and actionability of monitoring outputs.
The Decision Procedure implements a three-step feature engineering-inspired approach: Refactor, Break Down, and Compile, to process monitoring data.
This architecture leverages structured memory and feature engineering principles to provide robust, interpretable, and actionable insights for ML model monitoring.

Intent Factored Generation: Unleashing the Diversity in Your Language Model

IFG (Intent Factored Generation): introduces a two-stage sampling process including Prompt, LLM Internal, Intent, Phrasing, and Response.
The process samples a semantically dense Intent first, then the final Response conditioned on the Prompt and Intent, allowing independent temperature control for diversity and coherence.
IFG can be implemented via Few-shot Prompting or Finetuning on intent-annotated data, with IFG-prompting encouraging granular steps in reasoning tasks.

Application-Driven Value Alignment in Agentic AI Systems: Survey and Perspectives

LLM-based Agent System: introduces, this survey, with LLM-based Agent (core system participant), Multi-Agent System (multiple interacting agents), Value Principles (hierarchical ethical norms), Value Alignment Evaluation (assessing adherence to values), and Value Coordination (managing values in multi-agents), where this survey reviews application-driven value alignment in agentic AI systems.
The paper integrates AI advancements driven by large models with demands of social governance, covering value principles, application scenarios, and evaluation methods.
It systematically examines datasets and methods for value alignment assessment and explores value coordination among multiple agents within agent systems.

DipLLM: Fine-Tuning LLM for Strategic Decision-making in Diplomacy

DipLLM (Fine-Tuned LLM-Based Agent): introduces DipLLM, a fine-tuned LLM-based agent for strategic decision-making in Diplomacy, with Llama 3 8B (LLM backbone), Autoregressive Factorization (decomposes actions), TextDiplomacy (state to text, text to actions), Equilibrium Search (piKL-Hedge) (generates Q-values), Human IL Model (DipNet) (collects raw data), Environment (Diplomacy) (game simulation), Loss Function (aligns policy), LoRA (parameter adaptation), and Data (collected game data).
DipLLM leverages autoregressive factorization to simplify complex multi-unit action assignment into sequential unit-level decisions.
The agent is fine-tuned on a small dataset using a designed loss function to align its policy with an equilibrium objective, outperforming state-of-the-art models like Cicero with significantly less data.

Effective Red-Teaming of Policy-Adherent Agents

CRAFT (Constraint-aware Red-teaming with Adversarial Framing and Tactics): introduces a multi-agent red-teaming system with Policy Analyzer (Extracts relevant policy), Deception Planner (Plans attack strategies), Avoidance Advisor (Plans what to avoid), Dialogue Executor (Executes interaction dialogue), Conversation Memory (Dialogue history), Policy-Adherent Agent (Target system), Policy (Rules for target), and User Request (Initial user input), designed to expose vulnerabilities in policy-constrained LLM-based agents through strategic, multi-step adversarial planning.
The system leverages policy knowledge, strategic reasoning, and pre-execution planning to generate policy-aware persuasive strategies that undermine policy adherence in customer service scenarios.
CRAFT achieves significantly higher attack success rates compared to generic jailbreak methods and cooperative user simulations, highlighting the need for stronger safeguards against malicious user behavior.

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Multi-Agent Framework: introduces ReasonMed, a large-scale medical reasoning dataset generated and refined using a multi-agent system for initial path generation, followed by verification, ranking, summarization, error correction, and quality assessment.
The framework generates 370k high-quality medical reasoning examples by distilling 1.7 million initial paths from multiple LLMs through a rigorous multi-stage verification and refinement pipeline.
Leveraging the generated dataset, the authors train ReasonMed-7B and find that combining detailed chain-of-thought reasoning with concise answer summaries yields the most effective fine-tuning strategy for medical question answering.

A Call for Collaborative Intelligence: Why Human-Agent Systems Should Precede AI Autonomy

LLM-HAS (LLM-based Human-Agent Systems): introduces implementation guidelines with Initial Setup (define environment, roles, interaction), Human Data (acquire, process, use human data), Model Engineering (iterative development, feedback, learning, optimization), Post-Deployment (monitor, maintain alignment, adapt), and Evaluation (assess effectiveness, safety, experience) components.
The framework advocates for collaborative AI-human partnerships, prioritizing human involvement for guidance, control, and enhanced system trustworthiness and adaptability.
Progress in AI is measured by how well systems work with humans, enhancing human capabilities through partnership rather than pursuing full autonomy.

Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation

Multi-Agent Language Models: introduces integrating Language Model (LM) (Recommends/generates actions) into Reinforcement Learning (RL) Agent (Learns decision policy) loops interacting with an Environment (Simulates game world), utilizing Dataset (Human gameplay examples) and Replay Buffers (Stores game transitions) for training via Distillation Loss (Transfers LM knowledge) and TD Loss (Reinforcement learning update), employing Encoders (GRU) (Encode text inputs), Decoder (Combines encoded features), and MLP (Predicts action values).
The approach explores LM-in-the-Loop for action recommendation in text games and LM as Multi-Agent in Hanabi, showing improved performance and accelerated convergence compared to baselines.
Key findings highlight the importance of careful transition selection for LM training and demonstrate the potential for distilling language model knowledge into reinforcement learning agents.

10th June 2025

Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation

ANN (Agentic Neural Network): introduces a framework conceptualizing multi-agent collaboration as a layered neural network architecture, including Agent (Node), Layer (Agent Team), Agent Pipeline, Dynamic Routing/Team Selection, Aggregation Function, Forward Pass, Backward Pass/Optimization, Textual Gradient, Global Optimization, Local/Layerwise Optimization, Momentum, Validation, Memory/Trajectory, LLM Backbone, Prompt, and Tool components.
The framework employs a two-phase optimization strategy: a forward pass for dynamic team selection and a backward pass for iterative refinement using textual gradients.
This approach enables agents to self-evolve their roles, prompts, and coordination, dynamically reconfiguring teams and strategies based on performance feedback.

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

UTBoost: introduces a framework for augmenting test cases using intramorphic testing, including the UTBoost Workflow (Orchestrates testing process), UTGenerator (Generates augmented test cases), and Intramorphic Testing (Establishes test oracle).
The UTGenerator component utilizes an LLM and a multi-level localization process (file, function/class, line) to identify relevant code areas before generating new test cases and their dependencies.
UTBoost enhances the evaluation of coding agents on benchmarks like SWE-Bench by identifying insufficient test cases and erroneous patches, leading to more reliable results and leaderboard updates.

GUIROBOTRON-SPEECH: TOWARDS AUTOMATED GUI AGENTS BASED ON SPEECH INSTRUCTIONS

GUIRoboTron-Speech: introduces an end-to-end autonomous GUI agent accepting speech instructions and screenshots, with Vision Encoder (Processes GUI screenshot), Audio Encoder (Processes speech instruction), Large Language Model (Processes inputs, predicts action), Grounding Stage (Trains visual understanding), and Planning Stage (Trains reasoning and planning) components, designed to predict GUI actions from multimodal input.
The approach leverages a progressive training framework with grounding and planning stages to develop capabilities in understanding GUI elements and task execution.
Mixed-instruction training is employed during the grounding stage to mitigate modality imbalance from pre-trained foundation models.

Agent-based Condition Monitoring Assistance with Multimodal Industrial Database Retrieval Augmented Generation

MindRAG (Multimodal industrial database Retrieval-Augmented Generation): introduces an agent-based condition monitoring assistance framework with LLMs, Agents, a Multimodal & semi-structured annotated machine graph Vector Store, Multimodal RAG techniques for Retrieval, Generation, Tools, and Knowledge Bases.
The framework integrates LLM-based reasoning agents (Main thinker, CM analyst, Maintenance scheduler, Evaluation agent) with a novel vector store structure designed for industrial condition monitoring data.
MindRAG leverages multimodal retrieval and generative capabilities, supported by custom tools and knowledge bases, to provide decision support and explainable interfaces for condition monitoring analysts.

Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search

LWM-Planner (LLM-based World Model Planning Agent): introduces an LLM agent framework that enhances planning via in-context learning using a Fact Extractor (Extracts atomic facts from experience), a Planner LLM (Performs lookahead search planning), Atomic Facts (Learned knowledge base), and Interaction History (Recent observation-action memory).
The agent extracts task-critical atomic facts from interaction trajectories to dynamically augment prompts for LLM components responsible for action proposal, world model simulation, and value estimation.
Planning involves a depth-limited lookahead search where the Planner LLM simulates trajectories and evaluates outcomes guided by accumulated facts and history.

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

ALE-Bench: introduces a benchmark for long-horizon objective-driven algorithm engineering, featuring Problem (Provides statement/metadata), Scorer (Evaluates solution code), Visualizer (Displays execution results), Test Run (Executes code in sandbox), Code Sandbox (Replicates execution environment), Leaderboard (Ranks submissions/calculates metrics), and Session (Orchestrates AI interaction/evaluation).
The benchmark provides a software framework simulating competitive programming contests, allowing AI systems to iteratively refine solutions using test-run feedback and visualizations.
ALE-Bench quantifies AI performance on computationally hard optimization problems from AtCoder Heuristic Contests, enabling comparison against human experts and fostering long-horizon problem-solving research.

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

VIKI-R (Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning): introduces a two-stage framework that fine-tunes a pretrained vision-language model using Chain-of-Thought demonstrations and reinforcement learning, evaluated on the VIKI-Bench benchmark.
The framework addresses embodied multi-agent cooperation across three hierarchical levels: agent activation, task planning, and trajectory perception, utilizing diverse robot embodiments and multi-view visual observations.
The approach significantly outperforms baselines, demonstrating enhanced visual reasoning and compositional cooperation patterns among heterogeneous agents in complex environments.

Design Patterns for Securing LLM Agents against Prompt Injections

Design Patterns: introduces, with Action-Selector Pattern (Selects predefined actions), Plan-Then-Execute Pattern (Defines plan, executes actions), LLM Map-Reduce Pattern (Dispatches isolated sub-agents), Dual LLM Pattern (Privileged/quarantined LLMs), Code-Then-Execute Pattern (Writes formal program), and Context-Minimization pattern (Removes prompt from context), a set of principled design patterns for building AI agents resistant to prompt injection attacks.
These patterns impose intentional constraints on LLM agents, limiting their ability to perform arbitrary tasks and preventing untrusted input from triggering consequential actions.
The paper analyzes the trade-offs of these patterns in terms of utility and security and illustrates their application through case studies of LLM agent applications.

Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Evaluation Frameworks for LLM-based Data Science AI Systems: includes LLM/Agent (AI system being evaluated), Environment (Where agent operates), Tools (Capabilities like code execution), Evaluation System (Measures performance), Data (Input for tasks), Task Description (Instructions for agent), and User (Interacts with agent).
These frameworks assess AI systems, ranging from assistants to autonomous agents, on various data science activities using diverse metrics and setups.
Evaluation often involves the agent interacting with data and tools within an environment, with performance judged by an automated or human-assisted evaluation system against task descriptions and data.

Improved LLM Agents for Financial Document Question Answering

Multi-Agent Framework: introduces a system for financial document question answering with Analyst, Critic, Improved Critic, and Calculator agents.
This framework utilizes multiple LLM-based agents to improve numerical reasoning on tabular and textual financial data.
The proposed calculator agent demonstrates improved performance and safety compared to the previous state-of-the-art approach for this task.

Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMS

End-to-End DST Model: introduces an end-to-end dialogue state tracking system using a pretrained speech encoder, a small connector module, a pretrained LLM, and dialogue history.
The system processes speech input and dialogue history to directly output a JSON string representing the dialogue state.
The approach bridges speech and language model representation spaces through a two-stage training scheme for ASR pre-training and joint ASR-DST finetuning.

MasHost Builds It All: Autonomous Multi-Agent System Directed by Reinforcement Learning

MasHost: introduces a reinforcement learning-based framework for autonomous multi-agent system construction, with Multi-agent System (Mas), LLM Agent, Role Pool, Interaction Pathway, Markov Decision Process (MDP), State, Action Space, Node-level Actions, Edge-level Actions, Policy Function, Node-level Policy (πθ), Edge-level Policy (πφ), Reward Function, Joint Probabilistic Space Sampling (JPSS), Hierarchical Relative Policy Optimization (HRPO), Group-relative Advantage, Action-wise Absolute Reward, Triple Objective, Query, State List, Selected Agents, Global Messages, Summarizer Agent, and EXIT Node components.
The framework models Mas construction as an MDP, employing JPSS for joint node and edge sampling and HRPO for multi-objective optimization towards performance, efficiency, and rationality.
MasHost enables autonomous Mas graph construction and role selection from a full-scale space, guided by a hierarchical reward structure combining group-relative and action-wise rewards.

CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

CAF-I (Collaborative Agent Framework for Irony): introduces a multi-agent framework for irony detection with Context, Semantic, Rhetoric, Decision, and Refinement Evaluator Agents.
CAF-I performs multi-perspective analysis and interactive collaborative optimization to improve detection accuracy and interpretability.
The framework achieves state-of-the-art zero-shot performance by simulating human-like multi-perspective analysis.

TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration

TACTIC (Translation Agents with Cognitive-Theoretic Interactive Collaboration): introduces a multi-agent translation framework inspired by cognitive translation studies, including DraftAgent (Generates multiple drafts), RefinementAgent (Synthesizes drafts), EvaluationAgent (Evaluates translation quality), ScoreAgent (

Name		Name	Last commit message	Last commit date
Latest commit History 1,357 Commits
resources		resources
LICENSE		LICENSE
README.md		README.md

License

tmgthb/Autonomous-Agents

Folders and files

Latest commit

History

Repository files navigation

Autonomous Agents

Research papers: 2025

22nd July 2025

21st July 2025

20th July 2025

19th July 2025

18th July 2025

17th July 2025

16th July 2025

15th July 2025

14th July 2025

13th July 2025

12th July 2025

11th July 2025

10th July 2025

9th July 2025

8th July 2025

7th July 2025

6th July 2025

5th July 2025

4th July 2025

3rd July 2025

2nd July 2025

1st July 2025

30th June 2025

29th June 2025

28th June 2025

27th June 2025

26th June 2025

25th June 2025

24th June 2025

23th June 2025

20th of June 2025

18th June 2025

17th June 2025

16th June 2025

15th June 2025

14th June 2025

13th June 2025

12th June 2025

11th June 2025

10th June 2025

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages