Skip to content

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

License

Notifications You must be signed in to change notification settings

zjunlp/InnoEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

πŸ“„arXiv β€’ 🌐Demo

Awesome License: MIT Last Commit PRs Welcome

If you like our project, please give us a star on GitHub for the latest updates!

method

InnoEval is an automated evaluation framework designed for assessing research ideas and innovation proposals. It leverages multi-agent systems and LLMs to comprehensively evaluate the novelty, feasibility, and significance of research contributions.

  • Multi-Agent Pipeline
    A chain of specialized agents (Extraction, Research, Grounding, Evaluation, Report) working together

  • Multi-Source Grounding
    Gathers evidence from web pages, code repositories, and academic papers to validate claims

  • Persona-Based Evaluation
    Simulates multiple reviewer perspectives for balanced and comprehensive assessment

  • Flexible Input Modes
    Supports both PDF URLs and direct text input for research ideas

  • Batch Processing
    Point-wise and group-wise evaluation for large-scale dataset analysis

Table of Contents

πŸ“₯ Installation

1. Clone the Repository

git clone https://github.com/your-org/InnoEval.git
cd InnoEval

2. Create Virtual Environment

conda create -n innoeval python=3.10 -y
conda activate innoeval

3. Install Dependencies

pip install -r requirements.txt

4. Configure API Keys

Copy the example configuration file and fill in your API keys:

cd config/
cp LLM.env.example LLM.env
# Edit LLM.env with your API keys

Required API keys:

Key Description
DS_API_KEY DeepSeek API key (primary LLM)
DS_API_BASE_URL DeepSeek API base URL
OPENAI_API_KEY OpenAI API key (optional)
GOOGLE_API_KEY Google Search API key
SERPER_API_KEY Serper API key for web search
JINA_API_KEY Jina API key for content extraction
S2_API_KEY Semantic Scholar API key
GH_TOKEN GitHub token for repository analysis

🎬 Quick Start

1. Single Idea Evaluation

Run the complete pipeline for a single research idea:

cd InnoEval
python3 -m innoeval.pipeline.single_idea_pipeline

This executes the full 6-step pipeline:

  1. ExtractionAgent: Extract structured idea from PDF/text
  2. ResearchAgent: Search for related works (web, code, papers)
  3. Report Extraction: Build evidence reports from search results
  4. GroundingAgent: Map claims to supporting evidence
  5. EvaluationAgent: Multi-perspective quality assessment
  6. ReportAgent: Generate final evaluation report

2. Point-wise Dataset Evaluation

Evaluate an entire dataset of research papers:

python3 -m innoeval.pipeline.batch_pipeline

Results are saved to cache/dataset_conference_points/.

3. Group Dataset Evaluation

Process papers organized in groups:

python3 -m innoeval.pipeline.group_pipeline

Results are saved to cache/dataset_conference_groups/.

4. Group/Pair Evaluation

Run comparison evaluation on cached group results:

# Group-wise comparison and ranking
python3 -m innoeval.pipeline.group_evaluation

# Pair-wise comparison
python3 -m innoeval.pipeline.pair_evaluation

These scripts read from cache/dataset_conference_groups/ and do not re-run the pipeline.

πŸ“‚ Architecture

Directory Structure

InnoEval/
β”œβ”€β”€ config/                     # Configuration files
β”‚   β”œβ”€β”€ LLM.env                 # API keys (not tracked)
β”‚   β”œβ”€β”€ LLM.env.example         # Example configuration
β”‚   └── kaggle.json             # Kaggle API config
β”œβ”€β”€ dataset/                    # Evaluation datasets
β”‚   β”œβ”€β”€ conference_points.jsonl # Point-wise dataset
β”‚   β”œβ”€β”€ conference_groups.json  # Group-wise dataset
β”‚   └── conference_pairs_*.json # Pair datasets
β”œβ”€β”€ cache/                      # Pipeline results cache
β”‚   └── reviewer_personas.json  # Reviewer personas
└── innoeval/                   # Main package
    β”œβ”€β”€ mas/                    # Multi-Agent System
    β”‚   β”œβ”€β”€ agents/             # Agent implementations
    β”‚   β”‚   β”œβ”€β”€ extraction_agent.py
    β”‚   β”‚   β”œβ”€β”€ research_agent.py
    β”‚   β”‚   β”œβ”€β”€ grounding_agent.py
    β”‚   β”‚   β”œβ”€β”€ evaluation_agent.py
    β”‚   β”‚   └── report_agent.py
    β”‚   β”œβ”€β”€ models/             # LLM and model interfaces
    β”‚   β”‚   β”œβ”€β”€ model_factory.py
    β”‚   β”‚   └── bge_singleton.py
    β”‚   └── tools/              # Utility tools
    β”‚       β”œβ”€β”€ searchers/      # Web/code/paper search
    β”‚       β”œβ”€β”€ querygen/       # Query generation
    β”‚       β”œβ”€β”€ enricher/       # Content enrichment
    β”‚       β”œβ”€β”€ grobid_refs/    # Reference extraction
    β”‚       └── repo_analysis/  # GitHub repo analysis
    └── pipeline/               # Pipeline implementations
        β”œβ”€β”€ single_idea_pipeline.py
        β”œβ”€β”€ batch_pipeline.py
        β”œβ”€β”€ group_pipeline.py
        β”œβ”€β”€ group_evaluation.py
        └── pair_evaluation.py

Pipeline Workflow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Input: PDF URL │───▢│ ExtractionAgent │───▢│   Idea Object   β”‚
β”‚  or Text Input  β”‚    β”‚   (Extract)     β”‚    β”‚  (structured)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                       β”‚
                                                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Web Pages     β”‚    β”‚  ResearchAgent  │───▢│  SearchResults  β”‚
β”‚   Code Repos    │◀───│    (Search)     β”‚    β”‚   (enriched)    β”‚
β”‚   Papers        β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                     β”‚
                                                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Claims Map    │◀───│ GroundingAgent  │◀───│  Reports Data   β”‚
β”‚  (evidence)     β”‚    β”‚   (Grounding)   β”‚    β”‚  (extracted)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Personas      │───▢│EvaluationAgent  │───▢│ EvaluationResultβ”‚
β”‚  (reviewers)    β”‚    β”‚   (Evaluate)    β”‚    β”‚   (per-persona) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                       β”‚
                                                       β–Ό
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚  ReportAgent    β”‚
                                              β”‚  (Synthesize)   β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                       β”‚
                                                       β–Ό
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚  Final Report   β”‚
                                              β”‚  (Markdown)     β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Evaluation Dimensions

The framework evaluates research ideas across five core dimensions:

Dimension Description
Clarity How clearly the idea is presented and explained
Novelty Originality and innovation compared to existing work
Validity Soundness of methodology and theoretical foundations
Feasibility Practical implementability with available resources
Significance Potential impact and contribution to the field

Custom evaluation metrics can be added through the user_metric parameter.

πŸ”¬ Examples

Example 1: Evaluate from PDF URL

import asyncio
from pathlib import Path
from innoeval.pipeline.single_idea_pipeline import SingleIdeaPipeline

async def evaluate_paper():
    pipeline = SingleIdeaPipeline(
        input_type="pdf",
        pdf_url="https://openreview.net/pdf?id=YOUR_PAPER_ID",
        cache_path=Path("cache/my_paper.json"),
        persona_path=Path("cache/reviewer_personas.json"),
        research_params={
            "title": "Your Paper Title",
            "after": "2022-01-01",
            "before": "2024-01-01",
            "depth": 3,
        },
        num_personas=5,
        get_future_paper=True,
    )
    result = await pipeline.run()
    print(result["final_report"])

asyncio.run(evaluate_paper())

Example 2: Evaluate from Text

import asyncio
from pathlib import Path
from innoeval.pipeline.single_idea_pipeline import SingleIdeaPipeline

async def evaluate_idea():
    idea_text = """
    This paper introduces a novel approach to automated code review
    using large language models with retrieval-augmented generation...
    """

    pipeline = SingleIdeaPipeline(
        input_type="text",
        idea_text=idea_text,
        cache_path=Path("cache/my_idea.json"),
        research_params={
            "title": "LLM-based Code Review",
            "after": "2023-01-01",
            "before": "2024-12-01",
        },
        num_personas=3,
    )
    result = await pipeline.run()
    print(result["final_decision"])

asyncio.run(evaluate_idea())

Example 3: Custom Evaluation Metrics

# The evaluation agent supports custom metrics
eval_params = {
    "temperature": 0.7,
    "user_metric": [
        {
            "metric": "Reproducibility",
            "description": "Evaluate whether sufficient detail is provided for reproduction"
        },
        {
            "metric": "EthicalConsiderations",
            "description": "Assess potential ethical implications and mitigation strategies"
        }
    ]
}

Example 4: Batch Processing with Custom Dataset

# Create a JSONL file with format:
# {"paper_id": "xxx", "title": "...", "decision": "accept"}
# Then run:
# python3 -m innoeval.pipeline.batch_pipeline

# Or programmatically:
from innoeval.pipeline.batch_pipeline import load_dataset, process_paper

items = load_dataset(Path("dataset/my_papers.jsonl"), num=10)
for item in items:
    print(f"Processing: {item.title}")

πŸ›  Configuration

LLM Configuration

The config/LLM.env file controls all API settings:

# Primary LLM (DeepSeek)
DS_API_KEY=your_deepseek_key
DS_API_BASE_URL=https://api.deepseek.com/v1

# OpenAI (alternative)
OPENAI_API_KEY=your_openai_key
OPENAI_API_BASE_URL=https://api.openai.com/v1

# Search APIs
GOOGLE_API_KEY=your_google_key
SERPER_API_KEY=your_serper_key
JINA_API_KEY=your_jina_key
S2_API_KEY=your_semantic_scholar_key

# GitHub
GH_TOKEN=your_github_token

# Kaggle (optional)
KAGGLE_CONFIG_DIR=./config

Model Configuration

The default model configuration in SingleIdeaPipeline:

model_config = {
    "models": {
        "default_provider": "dsr1",
        "dsr1": {
            "model_name": "deepseek-v3.2",
            "api_key": os.getenv("DS_API_KEY"),
            "base_url": os.getenv("DS_API_BASE_URL"),
            "max_tokens": 4096,
            "temperature": 0.7,
        },
    }
}

Agent Parameters

Agent Key Parameters
ExtractionAgent extract_temperature: 0.3
ResearchAgent top_k: 10, max_results_per_query: 5, web_max_results: 5, github_max_results: 5
GroundingAgent extract_temperature: 0.0
EvaluationAgent temperature: 0.7, num_personas: 5
ReportAgent temperature: 0.4

Research Parameters

Parameter Type Description
title str Paper title for search optimization
after str Search papers after this date (YYYY-MM-DD)
before str Search papers before this date (YYYY-MM-DD)
depth int Search depth (1-5)
web_temperature float Temperature for web search queries
code_temperature float Temperature for code search queries

Cache Structure

Pipeline results are cached in JSON format:

{
  "extraction_result": {...},
  "search_results_dict": {...},
  "reports_data": {...},
  "grounding_result": {...},
  "evaluation_result": {...},
  "final_report": "...",
  "final_decision": "accept/reject",
  "total_time": 123.45,
  "total_token": 50000
}

πŸ“„ Acknowledgement

This project builds upon and draws inspiration from the following open-source projects:

InternAgent

We thank the InternAgent project for providing foundational multi-agent architecture patterns and evaluation methodologies that influenced our pipeline design.

RepoMaster

We thank RepoMaster for the repository analysis toolkit that enables comprehensive code repository evaluation in our grounding process.


✍️ Citation

If you find our work helpful, please use the following citations.

@misc{qiao2026innoevalresearchideaevaluation,
      title={InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem}, 
      author={Shuofei Qiao and Yunxiang Wei and Xuehai Wang and Bin Wu and Boyang Xue and Ningyu Zhang and Hossein A. Rahmani and Yanshan Wang and Qiang Zhang and Keyan Ding and Jeff Z. Pan and Huajun Chen and Emine Yilmaz},
      year={2026},
      eprint={2602.14367},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.14367}, 
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages