# DeepReviewer Tutorial: Automated Peer Review with Large Language Models

Welcome to the DeepReviewer tutorial! This notebook will guide you through using DeepReviewer, a powerful tool for generating automated peer reviews of academic papers. DeepReviewer leverages large language models (LLMs) to provide structured, comprehensive feedback, simulating the insights of multiple human reviewers.

**Key Features:**

*   **Structured Reviews:** Generates reviews with sections like Summary, Soundness, Presentation, Contribution, Strengths, Weaknesses, Suggestions, Questions, Rating, and Confidence.
*   **Multiple Review Modes:** Offers "Fast", "Standard", and "Best" modes to balance speed and detail.
*   **Simulated Reviewers:** "Standard" and "Best" modes simulate multiple reviewers for diverse perspectives.
*   **Meta-Review Generation:** Combines individual reviewer feedback into a concise meta-review.
*   **Structured Output:** Provides results in an easily parsable format.
* **Customizable Model:** Allows you to use different pre-trained DeepReviewer models or your fine-tuned models.

**This tutorial will cover:**

1.  Setting up the environment and installing dependencies.
2.  Loading a paper for review.
3.  Generating reviews in different modes (Fast, Standard).
4.  Parsing the review results to extract structured feedback.
5.  (Optional) Advanced usage, including custom models and batch processing.

## 1. Setup and Installation

First, we need to install the required libraries.  DeepReviewer relies on `transformers` and `vllm` for efficient model loading and inference. Run the following commands in your terminal or in a notebook cell:

```bash
pip install transformers
pip install vllm
```

Now, let's import the `DeepReviewer` class and initialize the model.

In [1]:
from ai_researcher.deep_reviewer import DeepReviewer

# Initialize DeepReviewer.  We'll use the 7B model for faster inference in this tutorial.
# You can change 'model_size' to "14B" for a larger model (requires more resources).
# For even faster inference (but potentially lower quality), you can set device="cpu",
# but GPU is highly recommended.

reviewer = DeepReviewer(custom_model_name="/zhuminjun/model/23250", device="cuda", tensor_parallel_size=1, gpu_memory_utilization=0.95)

# Other parameters you can customize:
# - custom_model_name:  Path to a custom DeepReviewer model (overrides model_size).
# - tensor_parallel_size:  Number of GPUs to use for parallel processing (for larger models).
# - gpu_memory_utilization:  Fraction of GPU memory to use.

INFO 03-10 21:35:04 config.py:135] Replacing legacy 'type' key with 'rope_type'
INFO 03-10 21:35:11 config.py:526] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
INFO 03-10 21:35:11 config.py:1538] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-10 21:35:11 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='/zhuminjun/model/23250', speculative_config=None, tokenizer='/zhuminjun/model/23250', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=70000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityCo

Loading safetensors checkpoint shards:   0% Completed | 0/13 [00:00<?, ?it/s]


INFO 03-10 21:35:21 model_runner.py:1116] Loading model weights took 27.4462 GB
INFO 03-10 21:35:22 worker.py:266] Memory profiling takes 0.82 seconds
INFO 03-10 21:35:22 worker.py:266] the current vLLM instance can use total_gpu_memory (79.14GiB) x gpu_memory_utilization (0.95) = 75.18GiB
INFO 03-10 21:35:22 worker.py:266] model weights take 27.45GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 0.94GiB; the rest of the memory reserved for KV Cache is 46.71GiB.
INFO 03-10 21:35:22 executor_base.py:108] # CUDA blocks: 15304, # CPU blocks: 1310
INFO 03-10 21:35:22 executor_base.py:113] Maximum concurrency for 70000 tokens per request: 3.50x
INFO 03-10 21:35:24 model_runner.py:1435] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:16<00:00,  2.13it/s]

INFO 03-10 21:35:40 model_runner.py:1563] Graph capturing finished in 16 secs, took 0.48 GiB
INFO 03-10 21:35:40 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 19.61 seconds





## 2. Loading a Paper for Review

We'll load a paper from a JSON file.  The JSON file should contain a list of papers, where each paper is a dictionary with at least a `title` and `latex` key.  We've provided a sample file named `generated_paper.json` for this tutorial.

In [2]:
import json

# Load the paper(s) from the JSON file
with open('generated_paper.json', 'r', encoding='utf-8') as f:
    papers = json.load(f)

# Print some basic information about the loaded papers
for i, paper in enumerate(papers):
    print(f"Paper {i+1}: Title: {paper['title']}")
    print(f"Paper {i+1}: LaTeX content length: {len(paper['latex'])} characters")

Paper 1: Title: A Novel Method for Improving LLM Performance
Paper 1: LaTeX content length: 1196 characters


## 3. Generating Reviews

Now, let's use DeepReviewer to generate reviews. We'll demonstrate both "Fast Mode" and "Standard Mode."

### 3.1 Fast Mode

In [3]:
# Generate reviews in Fast Mode
fast_review_results = reviewer.evaluate([paper['latex'] for paper in papers], mode="Fast Mode")

# Process and print the results for each paper
for i, review_result in enumerate(fast_review_results):
    print(f"\n--- Fast Mode Review for Paper: {papers[i]['title']} ---")
    if review_result:
        print(f"Raw text: {review_result.get('raw_text', 'N/A')}")  # Show the raw output
    else:
        print("Review generation failed.")

Processed prompts: 100%|██████████| 1/1 [00:39<00:00, 39.47s/it, est. speed input: 6.84 toks/s, output: 44.47 toks/s]

Of course. I will use Fast Mode for quick thinking. As a professional reviewer, I will directly output a detailed evaluation of this paper. Let me think - Fast Mode means I will directly output a Summary, followed by scores for Soundness, Presentation and Contribution, then provide analysis of Strengths, Weaknesses, Suggestions, and Questions. Finally, I will output the Rating, Confidence and Decision:

\boxed_review{
## Summary:

This paper introduces a two-stage training method aimed at enhancing the performance of large language models (LLMs). The proposed approach begins with supervised fine-tuning (SFT) on a domain-specific dataset, followed by reinforcement learning from human feedback (RLHF). The authors claim that this combination leverages the strengths of both techniques, resulting in significant performance improvements on several benchmark datasets. However, the paper's high-level description and lack of specific details make it challenging to fully assess the novelty and r




### 3.2 Standard Mode

In [4]:
# Generate reviews in Standard Mode with 3 simulated reviewers
standard_review_results = reviewer.evaluate([paper['latex'] for paper in papers], mode="Standard Mode", reviewer_num=3)

# Process and print the results for each paper
for i, review_result in enumerate(standard_review_results):
    print(f"\n--- Standard Mode Review for Paper: {papers[i]['title']} ---")
    if review_result:
      print(f"Raw text: {review_result.get('raw_text', 'N/A')}") # Show the raw output
    else:
      print('Review generation failed')

Processed prompts: 100%|██████████| 1/1 [02:14<00:00, 134.71s/it, est. speed input: 2.44 toks/s, output: 43.90 toks/s]

I will use Standard Mode for comprehensive thinking. As a professional reviewer, I will simulate 3 different reviewers, followed by the verification thinking. Then I will output the Finally Review Output. Let me think - Standard Mode means I will output the original review, followed by the verification thinking. Considering that I am currently in standard mode, I should think from my existing knowledge and consider some related work content when writing about weaknesses. Then I will output the Finally Review Output and Meta Review Output:

\boxed_simreviewers{
## Reviewer 1

### Summary

This paper introduces a new technique for enhancing the performance of large language models (LLMs). The proposed method combines supervised fine-tuning with reinforcement learning from human feedback. The authors present a novel training algorithm and provide theoretical analysis of its convergence properties. The results demonstrate significant improvements on several benchmark datasets.

### Soundne




## 4. Parsing Review Results
Let's parse and examine the structured information from the Standard Mode reviews. We'll extract key data like the average rating, decision, and the number of reviewers.

In [5]:
# Process and print the parsed results for each paper
for i, review_result in enumerate(standard_review_results):
    print(f"\n--- Parsed Review for Paper: {papers[i]['title']} ---")
    if review_result:
        # --- Meta-Review (if available) ---
        if review_result['meta_review']:
            print("\n--- Meta-Review ---")
            print(f"  Summary: {review_result['meta_review'].get('summary', 'N/A')}")
            print(f"  Rating: {review_result['meta_review'].get('rating', 'N/A')}")
            print(f"  Soundness: {review_result['meta_review'].get('soundness', 'N/A')}")
            print(f"  Presentation: {review_result['meta_review'].get('presentation', 'N/A')}")
            # ... (access other meta-review sections as needed)

        # --- Individual Reviewer Feedback (if available) ---
        if review_result['reviews']:
            print("\n--- Individual Reviewer Feedback ---")
            for i, reviewer_feedback in enumerate(review_result['reviews']):
                print(f"  Reviewer {i+1}:")
                print(f"    Summary: {reviewer_feedback.get('summary', 'N/A')}")
                print(f"    Rating: {reviewer_feedback.get('rating', 'N/A')}")
                print(f"    Strengths: {reviewer_feedback.get('strengths', 'N/A')}")
                print(f"    Weaknesses: {reviewer_feedback.get('weaknesses', 'N/A')}")
        # --- Overall Decision (if available) ---
        if review_result['decision']:
            print("\n--- Overall Decision ---")
            print(f"  Decision: {review_result['decision']}")

    else:
        print("Review generation failed.")


--- Parsed Review for Paper: A Novel Method for Improving LLM Performance ---

--- Meta-Review ---
  Summary: This paper introduces a method for enhancing the performance of large language models (LLMs) by combining supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). The authors propose a two-stage training process where the LLM is first fine-tuned on a domain-specific dataset using supervised learning, and then further refined using reinforcement learning to align its outputs with human preferences. The method section describes the standard cross-entropy loss used for SFT and a policy gradient method for RLHF, including the formulation of the reward function. The experimental section presents results from two experiments: one on a sentiment analysis task (Dataset X) and another on a question-answering task (Dataset Y). In both experiments, the proposed method outperforms traditional SFT and RL without human feedback, as measured by accuracy and F1 scor

## 5. Advanced Usage (Optional)

This section covers some more advanced features of DeepReviewer.

### 5.1 Using a Custom Model

If you have a fine-tuned DeepReviewer model, you can load it using the `custom_model_name` parameter:

In [None]:
# Example of using a custom model
# reviewer_custom = DeepReviewer(custom_model_name='/path/to/your/custom/model')
# custom_review_results = reviewer_custom.evaluate(example_paper) # Use as before

### 5.2 Reviewing Multiple Papers

DeepReviewer supports batch processing, allowing you to review multiple papers efficiently. Simply pass a *list* of paper strings (extracted from the 'latex' key of your paper dictionaries) to the `evaluate()` method.  This is already demonstrated in the examples above, as we loaded a list of papers from the JSON file.

### 5.3 Using multiple GPUs for inference
If you are using larger model sizes, such as 14B, or you want to review papers more quickly, DeepReviewer offers an interface for using multiple GPUs. You only need to set up the `tensor_parallel_size` variable, and DeepReviewer will automatically distribute the model across the GPUs, allowing it to load larger models and review papers more quickly.


In [None]:
# Initialize DeepReviewer.  We'll use the 14B model for demonstration.
# User could change 'tensor_parallel_size' to 4 for a larger model (requires more resources).
reviewer_multigpu = DeepReviewer(model_size="14B", device="cuda", tensor_parallel_size=2)

# Generate a review in Standard Mode with 3 simulated reviewers
multigpu_review_results = reviewer_multigpu.evaluate([paper['latex'] for paper in papers], mode="Standard Mode", reviewer_num=3)

# Print the raw generated text (for demonstration)
print("\n--- Standard Mode Review (Raw Text) ---")
print(multigpu_review_results[0]['raw_text'])

OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like WestlakeNLP/DeepReviewer-14B is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

: 

## 6. Conclusion and Further Exploration

In this tutorial, you've learned how to use DeepReviewer to generate automated peer reviews, explore different review modes, and parse the structured output. DeepReviewer offers a powerful and flexible way to leverage the capabilities of large language models for scientific paper evaluation.

**Limitations and Ethical Considerations:**

*   **Not a Replacement for Human Review:** DeepReviewer is a tool to *assist* with peer review, not to replace the critical thinking and expertise of human reviewers.
*   **Potential for Bias:**  Like all LLMs, DeepReviewer can inherit biases from its training data.  Be mindful of this when interpreting the results.
*   **Over-Reliance:**  Avoid over-reliance on automated feedback.  Always critically evaluate the generated reviews.
* **Transparency:** If using DeepReviewer in your work, disclose its use transparently.

**Further Exploration:**

*   **Experiment with different papers:**  Try DeepReviewer with your own research papers or papers from various fields.
*   **Adjust parameters:**  Explore the effects of `reviewer_num` and `max_tokens`.
*   **Explore the vLLM documentation:**  Learn more about the underlying inference engine for advanced usage.
*   **Contribute to the project:**  If you find issues or have suggestions, consider contributing to the DeepReviewer project (if it's open-source).

Thank you for exploring DeepReviewer! We hope this tutorial has been helpful.