InnoEval is an automated evaluation framework designed for assessing research ideas and innovation proposals. It leverages multi-agent systems and LLMs to comprehensively evaluate the novelty, feasibility, and significance of research contributions.
-
Multi-Agent Pipeline
A chain of specialized agents (Extraction, Research, Grounding, Evaluation, Report) working together -
Multi-Source Grounding
Gathers evidence from web pages, code repositories, and academic papers to validate claims -
Persona-Based Evaluation
Simulates multiple reviewer perspectives for balanced and comprehensive assessment -
Flexible Input Modes
Supports both PDF URLs and direct text input for research ideas -
Batch Processing
Point-wise and group-wise evaluation for large-scale dataset analysis
- π₯ Installation
- π¬ Quick Start
- π Architecture
- π¬ Examples
- π Configuration
- π Acknowledgement
- βοΈ Citation
git clone https://github.com/your-org/InnoEval.git
cd InnoEvalconda create -n innoeval python=3.10 -y
conda activate innoevalpip install -r requirements.txtCopy the example configuration file and fill in your API keys:
cd config/
cp LLM.env.example LLM.env
# Edit LLM.env with your API keysRequired API keys:
| Key | Description |
|---|---|
DS_API_KEY |
DeepSeek API key (primary LLM) |
DS_API_BASE_URL |
DeepSeek API base URL |
OPENAI_API_KEY |
OpenAI API key (optional) |
GOOGLE_API_KEY |
Google Search API key |
SERPER_API_KEY |
Serper API key for web search |
JINA_API_KEY |
Jina API key for content extraction |
S2_API_KEY |
Semantic Scholar API key |
GH_TOKEN |
GitHub token for repository analysis |
Run the complete pipeline for a single research idea:
cd InnoEval
python3 -m innoeval.pipeline.single_idea_pipelineThis executes the full 6-step pipeline:
- ExtractionAgent: Extract structured idea from PDF/text
- ResearchAgent: Search for related works (web, code, papers)
- Report Extraction: Build evidence reports from search results
- GroundingAgent: Map claims to supporting evidence
- EvaluationAgent: Multi-perspective quality assessment
- ReportAgent: Generate final evaluation report
Evaluate an entire dataset of research papers:
python3 -m innoeval.pipeline.batch_pipelineResults are saved to cache/dataset_conference_points/.
Process papers organized in groups:
python3 -m innoeval.pipeline.group_pipelineResults are saved to cache/dataset_conference_groups/.
Run comparison evaluation on cached group results:
# Group-wise comparison and ranking
python3 -m innoeval.pipeline.group_evaluation
# Pair-wise comparison
python3 -m innoeval.pipeline.pair_evaluationThese scripts read from cache/dataset_conference_groups/ and do not re-run the pipeline.
InnoEval/
βββ config/ # Configuration files
β βββ LLM.env # API keys (not tracked)
β βββ LLM.env.example # Example configuration
β βββ kaggle.json # Kaggle API config
βββ dataset/ # Evaluation datasets
β βββ conference_points.jsonl # Point-wise dataset
β βββ conference_groups.json # Group-wise dataset
β βββ conference_pairs_*.json # Pair datasets
βββ cache/ # Pipeline results cache
β βββ reviewer_personas.json # Reviewer personas
βββ innoeval/ # Main package
βββ mas/ # Multi-Agent System
β βββ agents/ # Agent implementations
β β βββ extraction_agent.py
β β βββ research_agent.py
β β βββ grounding_agent.py
β β βββ evaluation_agent.py
β β βββ report_agent.py
β βββ models/ # LLM and model interfaces
β β βββ model_factory.py
β β βββ bge_singleton.py
β βββ tools/ # Utility tools
β βββ searchers/ # Web/code/paper search
β βββ querygen/ # Query generation
β βββ enricher/ # Content enrichment
β βββ grobid_refs/ # Reference extraction
β βββ repo_analysis/ # GitHub repo analysis
βββ pipeline/ # Pipeline implementations
βββ single_idea_pipeline.py
βββ batch_pipeline.py
βββ group_pipeline.py
βββ group_evaluation.py
βββ pair_evaluation.py
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Input: PDF URL βββββΆβ ExtractionAgent βββββΆβ Idea Object β
β or Text Input β β (Extract) β β (structured) β
βββββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Web Pages β β ResearchAgent βββββΆβ SearchResults β
β Code Repos ββββββ (Search) β β (enriched) β
β Papers β βββββββββββββββββββ ββββββββββ¬βββββββββ
βββββββββββββββββββ β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Claims Map ββββββ GroundingAgent ββββββ Reports Data β
β (evidence) β β (Grounding) β β (extracted) β
ββββββββββ¬βββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Personas βββββΆβEvaluationAgent βββββΆβ EvaluationResultβ
β (reviewers) β β (Evaluate) β β (per-persona) β
βββββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β ReportAgent β
β (Synthesize) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Final Report β
β (Markdown) β
βββββββββββββββββββ
The framework evaluates research ideas across five core dimensions:
| Dimension | Description |
|---|---|
| Clarity | How clearly the idea is presented and explained |
| Novelty | Originality and innovation compared to existing work |
| Validity | Soundness of methodology and theoretical foundations |
| Feasibility | Practical implementability with available resources |
| Significance | Potential impact and contribution to the field |
Custom evaluation metrics can be added through the user_metric parameter.
import asyncio
from pathlib import Path
from innoeval.pipeline.single_idea_pipeline import SingleIdeaPipeline
async def evaluate_paper():
pipeline = SingleIdeaPipeline(
input_type="pdf",
pdf_url="https://openreview.net/pdf?id=YOUR_PAPER_ID",
cache_path=Path("cache/my_paper.json"),
persona_path=Path("cache/reviewer_personas.json"),
research_params={
"title": "Your Paper Title",
"after": "2022-01-01",
"before": "2024-01-01",
"depth": 3,
},
num_personas=5,
get_future_paper=True,
)
result = await pipeline.run()
print(result["final_report"])
asyncio.run(evaluate_paper())import asyncio
from pathlib import Path
from innoeval.pipeline.single_idea_pipeline import SingleIdeaPipeline
async def evaluate_idea():
idea_text = """
This paper introduces a novel approach to automated code review
using large language models with retrieval-augmented generation...
"""
pipeline = SingleIdeaPipeline(
input_type="text",
idea_text=idea_text,
cache_path=Path("cache/my_idea.json"),
research_params={
"title": "LLM-based Code Review",
"after": "2023-01-01",
"before": "2024-12-01",
},
num_personas=3,
)
result = await pipeline.run()
print(result["final_decision"])
asyncio.run(evaluate_idea())# The evaluation agent supports custom metrics
eval_params = {
"temperature": 0.7,
"user_metric": [
{
"metric": "Reproducibility",
"description": "Evaluate whether sufficient detail is provided for reproduction"
},
{
"metric": "EthicalConsiderations",
"description": "Assess potential ethical implications and mitigation strategies"
}
]
}# Create a JSONL file with format:
# {"paper_id": "xxx", "title": "...", "decision": "accept"}
# Then run:
# python3 -m innoeval.pipeline.batch_pipeline
# Or programmatically:
from innoeval.pipeline.batch_pipeline import load_dataset, process_paper
items = load_dataset(Path("dataset/my_papers.jsonl"), num=10)
for item in items:
print(f"Processing: {item.title}")The config/LLM.env file controls all API settings:
# Primary LLM (DeepSeek)
DS_API_KEY=your_deepseek_key
DS_API_BASE_URL=https://api.deepseek.com/v1
# OpenAI (alternative)
OPENAI_API_KEY=your_openai_key
OPENAI_API_BASE_URL=https://api.openai.com/v1
# Search APIs
GOOGLE_API_KEY=your_google_key
SERPER_API_KEY=your_serper_key
JINA_API_KEY=your_jina_key
S2_API_KEY=your_semantic_scholar_key
# GitHub
GH_TOKEN=your_github_token
# Kaggle (optional)
KAGGLE_CONFIG_DIR=./configThe default model configuration in SingleIdeaPipeline:
model_config = {
"models": {
"default_provider": "dsr1",
"dsr1": {
"model_name": "deepseek-v3.2",
"api_key": os.getenv("DS_API_KEY"),
"base_url": os.getenv("DS_API_BASE_URL"),
"max_tokens": 4096,
"temperature": 0.7,
},
}
}| Agent | Key Parameters |
|---|---|
| ExtractionAgent | extract_temperature: 0.3 |
| ResearchAgent | top_k: 10, max_results_per_query: 5, web_max_results: 5, github_max_results: 5 |
| GroundingAgent | extract_temperature: 0.0 |
| EvaluationAgent | temperature: 0.7, num_personas: 5 |
| ReportAgent | temperature: 0.4 |
| Parameter | Type | Description |
|---|---|---|
title |
str | Paper title for search optimization |
after |
str | Search papers after this date (YYYY-MM-DD) |
before |
str | Search papers before this date (YYYY-MM-DD) |
depth |
int | Search depth (1-5) |
web_temperature |
float | Temperature for web search queries |
code_temperature |
float | Temperature for code search queries |
Pipeline results are cached in JSON format:
{
"extraction_result": {...},
"search_results_dict": {...},
"reports_data": {...},
"grounding_result": {...},
"evaluation_result": {...},
"final_report": "...",
"final_decision": "accept/reject",
"total_time": 123.45,
"total_token": 50000
}This project builds upon and draws inspiration from the following open-source projects:
We thank the InternAgent project for providing foundational multi-agent architecture patterns and evaluation methodologies that influenced our pipeline design.
We thank RepoMaster for the repository analysis toolkit that enables comprehensive code repository evaluation in our grounding process.
If you find our work helpful, please use the following citations.
@misc{qiao2026innoevalresearchideaevaluation,
title={InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem},
author={Shuofei Qiao and Yunxiang Wei and Xuehai Wang and Bin Wu and Boyang Xue and Ningyu Zhang and Hossein A. Rahmani and Yanshan Wang and Qiang Zhang and Keyan Ding and Jeff Z. Pan and Huajun Chen and Emine Yilmaz},
year={2026},
eprint={2602.14367},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.14367},
}
This project is licensed under the MIT License - see the LICENSE file for details.
