This repository contains the source code for AgentSim, a platform for comparative analysis of agentic retrieval workflows.
Training trustworthy agentic LLMs requires data that shows the grounded reasoning process, not just the final answer. Existing datasets fall short: question-answering data is outcome-only, chain-of-thought data is not tied to specific documents, and web-agent datasets track interface actions rather than the core retrieval and synthesis steps of a RAG workflow. We introduce AgentSim, an open-source platform for simulating RAG agents. It generates verifiable, stepwise traces of agent reasoning over any document collection. AgentSim uses a policy to ensure the agent widely explores the document set. It combines a multi-model validation pipeline with an active human-in-the-loop process. This approach focuses human effort on difficult steps where models disagree. Using AgentSim, we built the Agent-Trace Corpus (ATC), a large collection of grounded RAG trajectories over the MSMARCO, Quasar-T, and CausalQA-22 corpora. We use ATC to compare how different LLMs manage exploration, repeated information, and uncertainty. AgentSim is a new tool for analyzing and training verifiable RAG agents.
Keywords: Agentic AI, Simulation Platform, Retrieval-Augmented Generation (RAG), Data Generation
AgentSim is a controlled environment for running multi-agent simulations over document collections. It:
- Compares Models: Run identical tasks across GPT-4o, Mistral-Large, DeepSeek-V3, and others
- Captures Reasoning: Log complete thought processes, not just final answers
- Generates Training Data: Produce query-document-answer triples with reasoning chains
- Enables Analysis: Quantify exploration strategies, retrieval patterns, and synthesis behaviors
AgentSim has generated a unified training corpus from 3,000 exploratory simulations:
Location: data/corpus/
Format: Compressed JSONL, ready for LLM training
Documentation: See data/corpus/README.md
cd agentsim
poetry installcp .env.example .env
# Add your API keys: OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.poetry run agentsim info# Standard mode: Fixed workflow, single query
poetry run agentsim simulate standard_gpt4o
# Exploratory mode: Multi-hop knowledge expansion
poetry run agentsim simulate exploratory_seedsls data/simulation_output/standard_gpt4o/Each run produces:
traces.jsonl– Step-by-step reasoning with LLM I/Otrajectories.jsonl– High-level action sequencessupervised.jsonl– Query-document-answer training pairsconfig.json– Run configurationstats.json– Execution statistics
agentsim/
├── agentsim/ # Core platform code
│ ├── components/ # Retrieval, planning, synthesis components
│ ├── workflow/ # Workflow execution engine
│ ├── simulation/ # Simulation modes (standard, adaptive, exploratory)
│ ├── clients/ # LLM and retrieval clients
│ └── cli.py # Command-line interface
│
├── templates/
│ ├── workflows/ # Reusable workflow definitions
│ └── simulations/ # End-to-end experiment configs
│
├── data/
│ ├── seeds/ # Seed queries for simulations
│ ├── datasets/ # Raw datasets (gitignored)
│ ├── simulation_output/ # Run outputs (gitignored)
│ └── corpus/ # Training-ready corpus
│ ├── traces/ # Reasoning traces
│ ├── supervised/ # Training pairs
│ ├── trajectories/ # High-level actions
│ ├── retrievals/ # Document logs
│ └── queries/ # All queries
│
└── README.md # This file
Fixed workflow execution with similarity-based stopping.
Use Case: Reproducible benchmarking
Example: poetry run agentsim simulate standard_gpt4o
Teacher model selects components dynamically (HATEOAS).
Use Case: Studying decision-making strategies
Example: poetry run agentsim simulate adaptive_gpt4o
Multi-hop knowledge expansion with Active Validation.
Use Case: Generating diverse training data
Example: poetry run agentsim simulate exploratory_seeds
Active Validation: Teacher and consultant models run in parallel. When they disagree, the system flags high-uncertainty steps for analysis.
- OpenSearch: Full-text search over local indices
- ChatNoir: Web-scale retrieval (MS MARCO, ClueWeb)
- Vector Search: Dense retrieval with embeddings
- Query Decomposer: Break complex queries into sub-questions
- Planner: Generate multi-step reasoning plans
- Reflector: Analyze knowledge gaps and suggest next queries
- Answer Drafter: Generate candidate answers from evidence
- Finalizer: Produce final answer with citations
- Verifier: Check claims against retrieved evidence
- Filter: Remove low-quality or irrelevant documents
- Deduplicator: Remove duplicate content
- Reranker: Reorder documents by relevance
cd data/simulation_output/exploratory_seeds
python evaluate_runs.py --runs 6601a11b b1b576a8 fb969481Computes:
- Exploration breadth (unique documents)
- Retrieval redundancy (repeated retrievals)
- Query statistics (count, length, diversity)
python analyze_reformulation.py --runs 6601a11b b1b576a8 fb969481Classifies query reformulations:
- Conceptual: Semantic expansion (e.g., "bridge collapse" → "structural failure aerodynamics")
- Procedural: Step-by-step planning (e.g., "I will first retrieve...")
- Syntactic: Keyword simplification (e.g., "why did X happen?" → "X cause")
python generate_multi_dataset_figure.py --output-dir figures/ --format pdfCreates publication-ready visualizations comparing models across datasets.
Use corpus/traces/all_traces.jsonl.gz:
- Step-by-step reasoning with thought → action → observation
- LLM input/output for each step
- Multi-model outputs for comparison
Use corpus/supervised/all_supervised.jsonl.gz:
- Query-document-answer triples
- Reasoning chains showing derivation
- Multi-hop reasoning examples
Use corpus/trajectories/all_trajectories.jsonl.gz:
- High-level decision sequences
- Document retrieval patterns
- Query reformulation strategies
Use corpus/queries/all_queries.json.gz:
- Original queries and reformulations
- Semantic vs syntactic patterns
- Model-specific strategies
Create templates/workflows/my_workflow.yaml:
name: my_workflow
components:
- type: retriever
config:
top_k: 20
- type: filter
config:
min_score: 0.5
- type: answer_drafter
config:
max_tokens: 500
- type: finalizerCreate templates/simulations/my_experiment.yaml:
name: my_experiment
mode: standard
workflow: my_workflow
teacher_model:
name: gpt4o
model_id: gpt-4o
temperature: 0.7
dataset:
name: msmarco
split: train
sample_size: 100
retrieval:
backend: opensearch
index: msmarco-v2.1-segmented
output_dir: data/simulation_output/my_experimentpoetry run agentsim simulate my_experiment# Show platform info
agentsim info
# List available templates
agentsim list
# Run simulation
agentsim simulate <template_name>
# Generate seed queries
agentsim seed-select \
--dataset msmarco \
--split train \
--num-seeds 1000 \
--retrieval opensearch \
--output seeds_msmarco_1k.jsonl
# Validate configuration
agentsim validate <template_name>This guide demonstrates the core capabilities of AgentSim for reviewers and new users.
# Python 3.10+, Poetry
git clone https://github.com/searchsim-org/agentsim.git
cd agentsim
poetry install
cp .env.example .env
# Add your API keys to .envExecute a single-query RAG workflow with GPT-4o:
poetry run agentsim simulate standard_gpt4oExpected Output: data/simulation_output/standard_gpt4o/ containing traces, trajectories, and training pairs.
Run exploratory mode to generate multi-hop reasoning traces:
poetry run agentsim simulate exploratory_seedsThis executes 1,000 seed queries from MSMARCO, Quasar-T, and CausalQA datasets.
Expected Output: Step-by-step reasoning logs with Active Validation flags for high-uncertainty steps.
The pre-generated corpus is available in data/corpus/:
# View corpus statistics
cat data/corpus/corpus_stats.json
# Examine a trace sample
zcat data/corpus/traces/all_traces.jsonl.gz | head -n 1 | jq .The interactive workflow designer is available at agentsim.searchsim.org for real-time simulation inspection and configuration.
- Visual Platform: agentsim.searchsim.org
- Corpus Documentation:
data/corpus/README.md - Seed Documentation:
data/seeds/README.md - Simulation Output Documentation:
data/simulation_output/README.md
