# Tutorial 2: Understanding and Using CycleReviewer

## Introduction

CycleReviewer (also known as WhizReviewer) is a set of specialized large language models that have undergone extensive supervised training specifically for academic peer review. The model is available in three sizes:

- 8B parameters (based on Llama3.1)
- 70B parameters (based on Llama3.1)
- 123B parameters (based on Mistral-Large-2)

These models are designed to simulate the peer review process by evaluating research papers across multiple dimensions, providing detailed feedback, and generating comprehensive reviews that closely mirror those produced by human academic reviewers.

## Setting Up the Environment

Let's start by importing the necessary libraries and initializing the CycleReviewer model:

In [1]:
import json

from ai_researcher import CycleReviewer

# Initialize CycleReviewer with the 8B parameter model (for faster inference)
# You can choose "8B", "70B", or "123B" depending on available computational resources
reviewer = CycleReviewer(model_size="8B")

INFO 02-26 00:33:07 __init__.py:183] Automatically detected platform cuda.
INFO 02-26 00:33:09 config.py:2364] Downcasting torch.float32 to torch.float16.
INFO 02-26 00:33:16 config.py:526] This model supports multiple tasks: {'classify', 'embed', 'score', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 02-26 00:33:16 config.py:1538] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 02-26 00:33:16 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='/zhuminjun/llama70/CycleReviewer-8B', speculative_config=None, tokenizer='/zhuminjun/llama70/CycleReviewer-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=50000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, deco

Loading safetensors checkpoint shards:   0% Completed | 0/7 [00:00<?, ?it/s]


INFO 02-26 00:36:30 model_runner.py:1116] Loading model weights took 14.9888 GB
INFO 02-26 00:36:31 worker.py:266] Memory profiling takes 0.64 seconds
INFO 02-26 00:36:31 worker.py:266] the current vLLM instance can use total_gpu_memory (79.14GiB) x gpu_memory_utilization (0.95) = 75.18GiB
INFO 02-26 00:36:31 worker.py:266] model weights take 14.99GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 58.91GiB.
INFO 02-26 00:36:31 executor_base.py:108] # CUDA blocks: 30163, # CPU blocks: 2048
INFO 02-26 00:36:31 executor_base.py:113] Maximum concurrency for 50000 tokens per request: 9.65x
INFO 02-26 00:36:33 model_runner.py:1435] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_

Capturing CUDA graph shapes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:15<00:00,  2.27it/s]

INFO 02-26 00:36:49 model_runner.py:1563] Graph capturing finished in 15 secs, took 0.26 GiB
INFO 02-26 00:36:49 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 18.18 seconds





## Loading a Paper to Review

Let's load a paper from a JSON file and prepare it for review:

In [2]:
# Load a paper from our JSON file
with open('generated_paper.json', 'r', encoding='utf-8') as f:
    papers = json.load(f)

# Print basic information about the paper
print(f"Paper Title: {papers[0]['title']}")
print(f"Abstract length: {len(papers[0]['abstract'])} characters")
print(f"LaTeX content length: {len(papers[0]['latex'])} characters")

Paper Title: Scientific Peer Reviewer Agents}


Abstract length: 1177 characters
LaTeX content length: 41664 characters


## Reviewing a Single Paper

Now, let's use CycleReviewer to evaluate the paper:

In [4]:
# Generate the review
review_result = reviewer.evaluate([paper['latex'] for paper in papers])

Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 10/10 [02:44<00:00, 16.45s/it, est. speed input: 571.37 toks/s, output: 316.69 toks/s]


## Parsing and Analyzing Review Results

Let's parse the review to extract structured information including ratings, strengths, weaknesses, and recommendations:

In [7]:
# Parse the review
review_data = review_result[0]

for i in range(len(review_result)):
    if review_result[i] != None:
        review_data = review_result[i]
        # Print summary of the review
        print(f"Title: {papers[i]['title']}")
        print(f"Average Rating: {review_data['avg_rating']:.2f}/10")
        print(f"Decision: {review_data['paper_decision']}")
        print(f"Number of Reviewers: {len(review_data['reviews'])}")
        print("\nIndividual Ratings:")
        for i, rating in enumerate(review_data['rating']):
            print(f"  Reviewer {i+1}: {rating}")
        print('---***---')

Title: Scientific Peer Reviewer Agents}


Average Rating: 4.00/10
Decision: Reject
Number of Reviewers: 4

Individual Ratings:
  Reviewer 1: 1.0
  Reviewer 2: 5.0
  Reviewer 3: 5.0
  Reviewer 4: 5.0
---***---
Title: Multi-Agent Review: Simulating Human Reviewers for Scientific Peer Review with Large Language Models}


Average Rating: 3.00/10
Decision: Reject
Number of Reviewers: 4

Individual Ratings:
  Reviewer 1: 3.0
  Reviewer 2: 3.0
  Reviewer 3: 3.0
  Reviewer 4: 3.0
---***---
Title: Referees in AI for Scientific Peer Review}


Average Rating: 4.00/10
Decision: Reject
Number of Reviewers: 4

Individual Ratings:
  Reviewer 1: 1.0
  Reviewer 2: 5.0
  Reviewer 3: 5.0
  Reviewer 4: 5.0
---***---
Title: Evaluating LLM-based AI Reviewer Agent for Scientific Peer Review}


Average Rating: 5.00/10
Decision: Reject
Number of Reviewers: 4

Individual Ratings:
  Reviewer 1: 5.0
  Reviewer 2: 5.0
  Reviewer 3: 5.0
  Reviewer 4: 5.0
---***---
Title: AI-Powered Peer Review Can Help Scientific P

## Let's examine some key feedback from the reviewers:

In [9]:
# Display key strengths mentioned by reviewers
print("KEY STRENGTHS IDENTIFIED:")
print("-" * 50)
for i, strength in enumerate(review_data['strength']):
    print(f"Reviewer {i+1}:")
    # Print the first 200 characters of each strength for brevity
    print(strength[:500] + "..." if len(strength) > 200 else strength)
    print()

# Display key weaknesses mentioned by reviewers
print("\nKEY WEAKNESSES IDENTIFIED:")
print("-" * 50)
for i, weakness in enumerate(review_data['weaknesses']):
    print(f"Reviewer {i+1}:")
    # Print the first 200 characters of each weakness for brevity
    print(weakness[:500] + "..." if len(weakness) > 200 else weakness)
    print()

KEY STRENGTHS IDENTIFIED:
--------------------------------------------------
Reviewer 1:
- The paper proposes a computational model for scientific discovery, emphasizing the critical role of scientific peer review, and introduces the Scientific Peer Reviewer Agent (SPRA) for simulating the peer review process of scientific papers. The Human-LLM Reviewer Agent (HLL-RA) is developed for iterative review optimization.
- The impact of HLL-RA is validated through the creation of a new synthetic dataset (Syn-RS) of peer reviews for scientific papers, which are rated by humans.
- Based on ...

Reviewer 2:
- The paper proposes a computational model for scientific discovery, emphasizing the critical role of scientific peer review, and introduces the Scientific Peer Reviewer Agent (SPRA) for simulating the peer review process of scientific papers. The Human-LLM Reviewer Agent (HLL-RA) is developed for iterative review optimization.
- The impact of HLL-RA is validated through the creation of a ne

### Meta Review

Let's look at the meta review, which synthesizes the individual reviewer feedback:

In [10]:
print("META REVIEW:")
print("=" * 50)
print(review_data['meta_review'])

META REVIEW:
The paper proposes a computational model for scientific discovery, emphasizing the critical role of scientific peer review, and introduces the Scientific Peer Reviewer Agent (SPRA) for simulating the peer review process of scientific papers. The Human-LLM Reviewer Agent (HLL-RA) is developed for iterative review optimization. The impact of HLL-RA is validated through the creation of a new synthetic dataset (Syn-RS) of peer reviews for scientific papers, which are rated by humans. Based on not only the simulated reviews but also the reviews selected by HLL-RA, the LLMs are further fine-tuned to better align with human judgment, validated through achieving new state-of-the-art (SOTA) performance on the RewardBench benchmark.

The reviewers are unanimous in their opinion that the paper is not well written, and that the main contribution of this paper is not clear. The paper does not provide a clear description of the method, experiments, results, limitations, and future work.

## Finding the Best Paper

Now, let's find the highest-rated paper from our reviews:

In [21]:
# Find the best paper
rating_max = 0
best_paper_num = 0
for i in range(len(review_result)):
    if review_result[i] != None:
        ratings = review_data['avg_rating']
        if ratings > rating_max:
            best_paper_num=i
            rating_max = ratings


print("=" * 50)
print("BEST PAPER SELECTED:")
print(f"Title: {papers[best_paper_num]['title']}...")
print(f"Average Rating: {review_result[best_paper_num]['avg_rating']:.2f}/10")
print(f"Decision: {review_result[best_paper_num]['paper_decision']}")
print("=" * 50)

# Print key strengths from the highest-rated paper's review
best_review = review_result[best_paper_num]
print("\nKEY STRENGTHS OF THE BEST PAPER:")
for i, strength in enumerate(best_review.get('strength', [])):
    if strength:
        print(f"\nStrength {i+1}:")
        print(strength[:300] + "..." if len(strength) > 300 else strength)

BEST PAPER SELECTED:
Title: Scientific Peer Reviewer Agents}

...
Average Rating: 4.00/10
Decision: Reject

KEY STRENGTHS OF THE BEST PAPER:

Strength 1:
The paper proposes a method to train a peer-reviewer agent using a combination of human feedback and LLM-generated reviews.



Strength 2:
The paper introduces a novel application of LLMs to the iterative and social processes fundamental to the scientific method, and presents a fine-tuning approach to align LLMs with human judgment. The paper also presents a new synthetic dataset (Syn-RS) of high-quality reviews of scientific papers, w...

Strength 3:
The paper introduces a novel application of LLMs to the iterative and social processes fundamental to the scientific method, and presents a fine-tuning approach to align LLMs with human judgment. The paper also presents a new synthetic dataset (Syn-RS) of high-quality reviews of scientific papers, w...

Strength 4:
The paper introduces a novel application of LLMs to the iterative and soc

## Conclusion

In this tutorial, we've explored how to use CycleReviewer to evaluate academic research papers. We've seen how the model can:

1. Generate detailed reviews that assess papers on multiple dimensions
2. Provide specific feedback on strengths and weaknesses
3. Assign numerical ratings to quantify paper quality
4. Generate meta-reviews that synthesize multiple reviewer perspectives
5. Make accept/reject recommendations

CycleReviewer represents a significant advancement in automating the peer review process, offering researchers, publishers, and educators a powerful tool for evaluating scientific work. While it can provide valuable feedback and insights, it's important to remember that these automated reviews are best used in conjunction with human expert evaluation, especially for final publication decisions.