# DeepReviewer Tutorial: Automated Peer Review with Large Language Models

Welcome to the DeepReviewer tutorial! This notebook will guide you through using DeepReviewer, a powerful tool for generating automated peer reviews of academic papers. DeepReviewer leverages large language models (LLMs) to provide structured, comprehensive feedback, simulating the insights of multiple human reviewers.

**Key Features:**

*   **Structured Reviews:** Generates reviews with sections like Summary, Soundness, Presentation, Contribution, Strengths, Weaknesses, Suggestions, Questions, Rating, and Confidence.
*   **Multiple Review Modes:** Offers "Fast", "Standard", and "Best" modes to balance speed and detail.
*   **Simulated Reviewers:** "Standard" and "Best" modes simulate multiple reviewers for diverse perspectives.
*   **Meta-Review Generation:** Combines individual reviewer feedback into a concise meta-review.
*   **Structured Output:** Provides results in an easily parsable format.
* **Customizable Model:** Allows you to use different pre-trained DeepReviewer models or your fine-tuned models.

**This tutorial will cover:**

1.  Setting up the environment and installing dependencies.
2.  Loading a paper for review.
3.  Generating reviews in different modes (Fast, Standard).
4.  Parsing the review results to extract structured feedback.

## 1. Setup and Installation

First, we need to install the required libraries.  DeepReviewer relies on `transformers` and `vllm` for efficient model loading and inference. Run the following commands in your terminal or in a notebook cell:

```bash
pip install transformers
pip install vllm
```

Now, let's import the `DeepReviewer` class and initialize the model.

In [2]:
from ai_researcher.deep_reviewer import DeepReviewer

# Initialize DeepReviewer.  We'll use the 7B model for faster inference in this tutorial.
# You can change 'model_size' to "14B" for a larger model (requires more resources).
# For even faster inference (but potentially lower quality), you can set device="cpu",
# but GPU is highly recommended.

reviewer = DeepReviewer(custom_model_name="/zhuminjun/llama70/DeepReviewer_14B", device="cuda", tensor_parallel_size=1, gpu_memory_utilization=0.95)

# Other parameters you can customize:
# - custom_model_name:  Path to a custom DeepReviewer model (overrides model_size).
# - tensor_parallel_size:  Number of GPUs to use for parallel processing (for larger models).
# - gpu_memory_utilization:  Fraction of GPU memory to use.

INFO 03-11 13:57:29 config.py:135] Replacing legacy 'type' key with 'rope_type'
INFO 03-11 13:57:36 config.py:526] This model supports multiple tasks: {'score', 'classify', 'embed', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 03-11 13:57:36 config.py:1538] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-11 13:57:36 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='/zhuminjun/llama70/DeepReviewer_14B', speculative_config=None, tokenizer='/zhuminjun/llama70/DeepReviewer_14B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=70000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observabil

Loading safetensors checkpoint shards:   0% Completed | 0/6 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  17% Completed | 1/6 [00:07<00:37,  7.50s/it]
Loading safetensors checkpoint shards:  33% Completed | 2/6 [00:13<00:27,  6.80s/it]
Loading safetensors checkpoint shards:  50% Completed | 3/6 [00:19<00:18,  6.23s/it]
Loading safetensors checkpoint shards:  67% Completed | 4/6 [00:24<00:11,  5.96s/it]
Loading safetensors checkpoint shards:  83% Completed | 5/6 [00:29<00:05,  5.57s/it]
Loading safetensors checkpoint shards: 100% Completed | 6/6 [00:36<00:00,  5.93s/it]
Loading safetensors checkpoint shards: 100% Completed | 6/6 [00:36<00:00,  6.07s/it]



INFO 03-11 13:58:15 model_runner.py:1116] Loading model weights took 27.4462 GB
INFO 03-11 13:58:16 worker.py:266] Memory profiling takes 0.71 seconds
INFO 03-11 13:58:16 worker.py:266] the current vLLM instance can use total_gpu_memory (79.14GiB) x gpu_memory_utilization (0.95) = 75.18GiB
INFO 03-11 13:58:16 worker.py:266] model weights take 27.45GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 0.94GiB; the rest of the memory reserved for KV Cache is 46.71GiB.
INFO 03-11 13:58:16 executor_base.py:108] # CUDA blocks: 15304, # CPU blocks: 1310
INFO 03-11 13:58:16 executor_base.py:113] Maximum concurrency for 70000 tokens per request: 3.50x
INFO 03-11 13:58:20 model_runner.py:1435] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_

Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████| 35/35 [00:16<00:00,  2.12it/s]

INFO 03-11 13:58:36 model_runner.py:1563] Graph capturing finished in 16 secs, took 0.48 GiB
INFO 03-11 13:58:36 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 21.59 seconds





## 2. Loading a Paper for Review

We'll load a paper from a JSON file.  The JSON file should contain a list of papers, where each paper is a dictionary with at least a `title` and `latex` key.  We've provided a sample file named `generated_paper.json` for this tutorial.

In [5]:
import json

# Load the paper(s) from the JSON file
with open('generated_paper.json', 'r', encoding='utf-8') as f:
    papers = json.load(f)

# Print some basic information about the loaded papers
for i, paper in enumerate(papers):
    print(f"Paper {i+1}: Title: {paper['title']}")
    print(f"Paper {i+1}: LaTeX content length: {len(paper['latex'])} characters")
    print('\n')

Paper 1: Title: Scientific Peer Reviewer Agents}


Paper 1: LaTeX content length: 41664 characters


Paper 2: Title: Multi-Agent Review: Simulating Human Reviewers for Scientific Peer Review with Large Language Models}


Paper 2: LaTeX content length: 29971 characters


Paper 3: Title: Referees in AI for Scientific Peer Review}


Paper 3: LaTeX content length: 35620 characters


Paper 4: Title: Evaluating LLM-based AI Reviewer Agent for Scientific Peer Review}


Paper 4: LaTeX content length: 47163 characters


Paper 5: Title: AI-Powered Peer Review Can Help Scientific Progress}


Paper 5: LaTeX content length: 32091 characters


Paper 6: Title: Optimizing AI Agents for Simulated Peer Review}


Paper 6: LaTeX content length: 32312 characters


Paper 7: Title: Multi-Paper Benchmarking Review: Enhancing Scientific Peer Review with Multi-Agent Competition}


Paper 7: LaTeX content length: 43401 characters


Paper 8: Title: Scientific Review Protocol as Multi-agent Competition Game}


Pape

## 3. Generating Reviews

Now, let's use DeepReviewer to generate reviews. We'll demonstrate both "Fast Mode" and "Standard Mode."

### 3.1 Fast Mode

In [6]:
# Generate reviews in Fast Mode
fast_review_results = reviewer.evaluate([paper['latex'] for paper in papers], mode="Fast Mode")

# Process and print the results for each paper
for i, review_result in enumerate(fast_review_results):
    print(f"\n--- Fast Mode Review for Paper: {papers[i]['title']} ---")
    if review_result:
        print(f"Raw text: {review_result.get('raw_text', 'N/A')}")  # Show the raw output
    else:
        print("Review generation failed.")

Processed prompts: 100%|█████████████████████████████| 10/10 [02:02<00:00, 12.29s/it, est. speed input: 728.04 toks/s, output: 201.14 toks/s]

Of course. I will use Fast Mode for quick thinking. As a professional reviewer, I will directly output a detailed evaluation of this paper. Let me think - Fast Mode means I will directly output a Summary, followed by scores for Soundness, Presentation and Contribution, then provide analysis of Strengths, Weaknesses, Suggestions, and Questions. Finally, I will output the Rating, Confidence and Decision:

\boxed_review{
## Summary:

This paper introduces a novel framework for enhancing scientific peer review using Large Language Models (LLMs). The authors propose a system comprising two key components: the Scientific Peer Reviewer Agent (SPRA) and the Human-LLM Reviewer Agent (HLL-RA). SPRA is designed to simulate the peer review process, generating reviews based on a structured JSON configuration and a detailed prompt. HLL-RA, on the other hand, iteratively refines these simulated reviews by incorporating human feedback. This iterative process involves multiple LLMs generating reviews, 




### 3.2 Standard Mode

In [7]:
# Generate reviews in Standard Mode with 3 simulated reviewers
standard_review_results = reviewer.evaluate([paper['latex'] for paper in papers], mode="Standard Mode", reviewer_num=3)

# Process and print the results for each paper
for i, review_result in enumerate(standard_review_results):
    print(f"\n--- Standard Mode Review for Paper: {papers[i]['title']} ---")
    if review_result:
      print(f"Raw text: {review_result.get('raw_text', 'N/A')}") # Show the raw output
    else:
      print('Review generation failed')

Processed prompts: 100%|█████████████████████████████| 10/10 [08:51<00:00, 53.16s/it, est. speed input: 169.43 toks/s, output: 194.81 toks/s]

I will use Standard Mode for comprehensive thinking. As a professional reviewer, I will simulate 3 different reviewers, followed by the verification thinking. Then I will output the Finally Review Output. Let me think - Standard Mode means I will output the original review, followed by the verification thinking. Considering that I am currently in standard mode, I should think from my existing knowledge and consider some related work content when writing about weaknesses. Then I will output the Finally Review Output and Meta Review Output:

\boxed_simreviewers{
## Reviewer 1

### Summary

This paper proposes a framework for simulating and optimizing scientific peer review using LLMs. The authors introduce two main components: the Scientific Peer Reviewer Agent (SPRA), which simulates peer reviews based on a JSON configuration, and the Human-LLM Reviewer Agent (HLL-RA), which iteratively optimizes these simulations by incorporating human feedback to better align with human judgment. The 




## 4. Parsing Review Results
Let's parse and examine the structured information from the Standard Mode reviews. We'll extract key data like the average rating, decision, and the number of reviewers.

In [14]:
# Process and print the parsed results for each paper
for i, review_result in enumerate(standard_review_results):
    print(f"\n************\n------\n************\n--- Parsed Review for Paper: {papers[i]['title']}")
    if review_result:
        # --- Meta-Review (if available) ---
        if review_result['meta_review']:
            print("\n--- Meta-Review ---")
            print(f"  Summary: {review_result['meta_review'].get('summary', 'N/A')}")
            print(f"  Rating: {review_result['meta_review'].get('rating', 'N/A')}")
            print(f"  Soundness: {review_result['meta_review'].get('soundness', 'N/A')}")
            print(f"  Presentation: {review_result['meta_review'].get('presentation', 'N/A')}")
            print(f"  Contribution: {review_result['meta_review'].get('contribution', 'N/A')}")

        # --- Overall Decision (if available) ---
        if review_result['decision']:
            print("\n--- Overall Decision ---")
            print(f"  Decision: {review_result['decision']}")

    else:
        print("Review generation failed.")


************
------
************
--- Parsed Review for Paper: Scientific Peer Reviewer Agents}



--- Meta-Review ---
  Summary: This paper introduces a novel framework for simulating and optimizing scientific peer review using large language models (LLMs). The authors propose two main components: the Scientific Peer Reviewer Agent (SPRA), which simulates peer reviews based on a JSON configuration, and the Human-LLM Reviewer Agent (HLL-RA), which iteratively refines these simulations by incorporating human feedback to better align with human judgment. The SPRA is configured to analyze various components of a scientific paper, such as the abstract, introduction, and methodology, and to identify potential issues like statistical errors or unethical research practices. The HLL-RA then uses human feedback to select the most appropriate reviews, which are subsequently used to fine-tune the LLMs using Direct Preference Optimization (DPO). This iterative process aims to improve the quality a

Let's parse and examine the structured information from the reviews to identify the best paper. We'll extract key data like the average rating, decision, and individual reviewer scores to determine which paper received the highest evaluation.

In [16]:
# Function to extract average rating from a review
def extract_average_rating(review_result):
    if not review_result:
        return 0.0
    
    # Method 1: Extract from meta_review if available
    if review_result.get('meta_review') and 'rating' in review_result['meta_review']:
        try:
            return float(review_result['meta_review']['rating'])
        except (ValueError, TypeError):
            pass
    
    # Method 2: Calculate average from individual reviewer ratings
    ratings = []
    for review in review_result.get('reviews', []):
        if 'rating' in review:
            try:
                ratings.append(float(review['rating']))
            except (ValueError, TypeError):
                pass
    
    return np.mean(ratings) if ratings else 0.0

# Find the best paper based on average rating
paper_ratings = []
for i, (paper, review) in enumerate(zip(papers, standard_review_results)):
    avg_rating = extract_average_rating(review)
    decision = review.get('decision', 'No decision')
    reviewer_count = len(review.get('reviews', []))
    
    paper_ratings.append({
        'index': i,
        'title': paper['title'],
        'avg_rating': avg_rating,
        'decision': decision,
        'reviewer_count': reviewer_count
    })
    
    print(f"Paper {i+1}: {paper['title']}")
    print(f"  Average Rating: {avg_rating:.2f}/10")
    print(f"  Decision: {decision}")
    print(f"  Number of Reviewers: {reviewer_count}")
    print()

# Identify the best paper
if paper_ratings:
    best_paper = max(paper_ratings, key=lambda x: x['avg_rating'])
    print("=" * 50)
    print(f"BEST PAPER: {best_paper['title']}")
    print(f"Rating: {best_paper['avg_rating']:.2f}/10")
    print(f"Decision: {best_paper['decision']}")
    print("=" * 50)
else:
    print("No valid paper ratings found.")

Paper 1: Scientific Peer Reviewer Agents}


  Average Rating: 5.67/10
  Decision: Reject
  Number of Reviewers: 3

Paper 2: Multi-Agent Review: Simulating Human Reviewers for Scientific Peer Review with Large Language Models}


  Average Rating: 4.33/10
  Decision: Reject
  Number of Reviewers: 3

Paper 3: Referees in AI for Scientific Peer Review}


  Average Rating: 5.67/10
  Decision: Reject
  Number of Reviewers: 3

Paper 4: Evaluating LLM-based AI Reviewer Agent for Scientific Peer Review}


  Average Rating: 4.67/10
  Decision: Reject
  Number of Reviewers: 3

Paper 5: AI-Powered Peer Review Can Help Scientific Progress}


  Average Rating: 5.00/10
  Decision: Reject
  Number of Reviewers: 3

Paper 6: Optimizing AI Agents for Simulated Peer Review}


  Average Rating: 6.00/10
  Decision: Reject
  Number of Reviewers: 3

Paper 7: Multi-Paper Benchmarking Review: Enhancing Scientific Peer Review with Multi-Agent Competition}


  Average Rating: 4.67/10
  Decision: Reject
  Number o

## 6. Conclusion and Further Exploration

In this tutorial, you've learned how to use DeepReviewer to generate automated peer reviews, explore different review modes, and parse the structured output. DeepReviewer offers a powerful and flexible way to leverage the capabilities of large language models for scientific paper evaluation.

**Limitations and Ethical Considerations:**

*   **Not a Replacement for Human Review:** DeepReviewer is a tool to *assist* with peer review, not to replace the critical thinking and expertise of human reviewers.
*   **Potential for Bias:**  Like all LLMs, DeepReviewer can inherit biases from its training data.  Be mindful of this when interpreting the results.
*   **Over-Reliance:**  Avoid over-reliance on automated feedback.  Always critically evaluate the generated reviews.
* **Transparency:** If using DeepReviewer in your work, disclose its use transparently.

**Further Exploration:**

*   **Experiment with different papers:**  Try DeepReviewer with your own research papers or papers from various fields.
*   **Adjust parameters:**  Explore the effects of `reviewer_num` and `max_tokens`.
*   **Explore the vLLM documentation:**  Learn more about the underlying inference engine for advanced usage.
*   **Contribute to the project:**  If you find issues or have suggestions, consider contributing to the DeepReviewer project (if it's open-source).

Thank you for exploring DeepReviewer! We hope this tutorial has been helpful.