✨ MARS: Toward More Efficient Multi-Agent Collaboration for LLM Reasoning

This repository provides the necessary scripts and examples to run the MARS pipeline and reproduce the experimental results from our paper.

📘 Introduction

Coming soon... Stay tuned for an overview of our framework, key ideas, and applications.

🚀 Usage

This section walks through how to run the core functionalities of MARS.

🧰 Prerequisites

Clone the repo and install dependencies:

git clone https://github.com/xwang97/MARS.git
cd MARS
pip install -r requirements.txt

Configure the backend LLMs by editing config.yml (default: all use GPT-3.5 Turbo):

author_llm: "gpt-3.5-turbo"
reviewer_llms:
  - "gpt-3.5-turbo"
  - "gpt-3.5-turbo"
  - "gpt-3.5-turbo"
meta_llm: "gpt-3.5-turbo"

🔐 API Keys: Store API keys in .txt files outside the repo:

For OpenAI: openai_api_key.txt
For NVIDIA NIM: nvidia_api_key.txt
(see NVIDIA NIM API)

🧪 Quick Example

Run the full MARS pipeline from a Python terminal:

from pipelines import PipelineRunner

runner = PipelineRunner(task="gpqa")
review_history = runner.run_mars_pipeline(user_query="What is 9 × 7?", n_reviewers=3, verbosity=1)

response = review_history['author_response'] if 'author_rebuttal' not in review_history else review_history['author_rebuttal']

📌 Parameters

Name	Description
`task`	Dataset/task name. Choose from: 🧮 `"gsm"`, `"gsm_hard"`, `"math"`, `"ciar"` → math data 📚 `"mmlu"`, `"gpqa"` → multi-choice QA
`question`	The input question (just the raw question text — no prompt formatting needed).
`n_reviewers`	Number of reviewers (recommended: 2 or 3; default: 3).
`verbosity`	Set to `1` to print step-by-step output; default is `0`.

📤 Output

response: The final answer (initial author response or rebuttal).
review_history: A dictionary containing all intermediate steps:
- author_response, review1, review2, ..., meta_review, author_rebuttal (if applicable).

📈 Evaluation

You can reproduce all experiments from the paper using evaluation.py. For example:

from evaluation import eval_marvel

multi_score, _, avg_tokens, avg_time = eval_mars(
    task="gpqa",
    n_problems=100,
    n_reviewers=2,
    selected=True
)

This evaluates MARS on the GPQA dataset.

📌 Parameters

Name	Description
`task`	Same as in `PipelineRunner`.
`n_problems`	Number of test questions (due to cost, we recommend a subset).
`n_reviewers`	Number of reviewers (2 or 3).
`selected`	If `True`, uses a saved question list for reproducibility. Set `False` on first run to generate and save one automatically.

📤 Output

multi_score: Number of correct final answers.
avg_tokens: Average tokens consumed per question.
avg_time: Average inference time per question.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
baselines		baselines
data		data
figures		figures
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
custom_agents.py		custom_agents.py
evaluation.py		evaluation.py
pipelines.py		pipelines.py
prompt_templates.py		prompt_templates.py
requirements.txt		requirements.txt
test.ipynb		test.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨ MARS: Toward More Efficient Multi-Agent Collaboration for LLM Reasoning

📘 Introduction

🚀 Usage

🧰 Prerequisites

🧪 Quick Example

📌 Parameters

📤 Output

📈 Evaluation

📌 Parameters

📤 Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

✨ MARS: Toward More Efficient Multi-Agent Collaboration for LLM Reasoning

📘 Introduction

🚀 Usage

🧰 Prerequisites

🧪 Quick Example

📌 Parameters

📤 Output

📈 Evaluation

📌 Parameters

📤 Output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages