## Notebook Overview

This notebook demonstrates the evaluation process of reviews generated by Large Language Models (LLMs) by comparing them with original human-generated reviews. The evaluation workflow consists of the following steps:

1. **Summarization:** Both the human-generated reviews and the LLM-generated reviews are summarized.
2. **Matching Summarized Points:** The summarized points from both sets of reviews are compared to identify matches.
3. **Evaluation Metrics:** We calculate several metrics to assess the quality and similarity of the generated reviews relative to the human reviews. These metrics include:
   - Hit Rate
   - Jaccard Index
   - Szymkiewicz–Simpson Overlap Coefficient
   - Sørensen–Dice Coefficient

# Environment Setup

In [65]:
import os
import json
import openai
from typing import List, Tuple

import sys
sys.path.append('../')

from utils import clean_json_output
from prompts import SUMMARY_PROMPT, REVIEW_COMPARISON_RPOMPT

In [8]:
os.environ["OPENAI_API_KEY"] = ""
client = openai.Client()

# Example: Review Generation and Evaluation

In this example, we use the paper titled "Cyclic Orthogonal Convolutions for Long-Range Integration of Features" to demonstrate the review generation and evaluation process. You can access the paper [here](https://openreview.net/pdf?id=868DWd46dv2). The human-generated reviews were obtained from the OpenReview platform, while the reviews generated by the LLM are produced using the GPT-4 Turbo model. This setup allows us to compare the effectiveness of automated review generation against human expert reviews.

In [60]:
title = "Cyclic Orthogonal Convolutions for Long-Range Integration of Features"

In [53]:
example_human_reviews = [
    """
    The paper proposes cyclic orthogonal convolutions as a means to grow receptive fields fast in CNNs. The authors show a small improvement of their cyclic convolution model over a simple CNN baseline on CIFAR-10, ImageNet and Stylized ImageNet. Overall it's an interesting idea, but not executed very convincingly.

    The biological motivation is weak at best. Long-range horizontal connections in cortex, which the authors use as motivation, are feature specific, i.e. between corresponding orientation domains. In contrast, in the authors' setup, they interact across all features. Moreover, the long-range connections are only along the x and y direction, but not in oblique directions. In my opinion, the author's proposal is not closer to biology than vision transformers, which also provide an all-to-all spatial interaction, albeit with a different mechanism and arguably stronger performance on large-scale datasets.

    The experiments with simple CNNs are nice and show a trend in the right direction, but in order to show that the cyclic convolutions are also of practical use, more extensive and competitive results would be necessary. The authors argue that also in ResNets receptive fields grow sublinearly with depth. If that's the case, why don't they show that incorporating cyclic orthogonal convolutions improves a standard ResNet-50 model on ImageNet?

    I don't find the pathfinder results very convincing. It has been shown before (this workshop, last year) that CNNs can also learn Pathfinder once the training setup is slightly adjusted """, 
    """
    This paper attempts to enable CNNs to learn long range spatial dependencies, typically only possible at great depth, in the early layers. To acheive this the authors propose CycleNet, a network of 'cycles' of orthogonal convolutions. These convolutions are performed across the three coordinate planes and have a subtaintially larger receptive field than a typical convolution without a dramatic increase in the number of parameters. The motivation for this work is comprehensive and the architecture is well described and intuitive. Experimental results show that CycleNet significantly improves performance over a baseline on the pathfinder challenge and also provides a modest improvement / increase in parameter efficiency on CIFAR-10.

    The authors touch on a biological basis for the ideas explored here but I feel that the full potential of this line of reasoning is not realised. For example, the authors show improved generalisation to stylised ImageNet only in an Appendix when it is arguably among the most exciting results of the paper. These could be further augmented with the addition of other biological similarity measures such as the brain-score (https://www.brain-score.org/). Finally, it would be valuable for the authors to delve deeper into the related biology, perhaps identifying specific cell types or psychophysical results that they feel are better represented by the CycleNet model.

    Overall, this is a well presented and clearly motivated work with promising results, a strong accept.
    """,
    """
    This article proposes a convolutional network architecture to address the lack of connectivity between features of spatially distant locations within a layer. The authors propose CycleNet, which consists of the concatenation of convolutional operations on the three pairs dimensions - (x, y), (x, z) and (y, z) - instead of only on (x, y). The paper studies several properties of CycleNet compared to some baselines models: the performance on CIFAR-10, the receptive field size of the learnt features and the performance on the Pathfinder challenge.

    This is a well written paper, which presents a simple and reasonable idea to address a weakness of standard convolutional models - the lack of connectivity between distant pixels or features. While the analysis of the proposed architecture does not outperform standard models on image classification tasks, the performance is close enough and, importantly, the experiments show the advantageous properties of CycleNet on other dimensions beyond classification accuracy, such as the receptive field size of the features and the performance on other tasks such as Pathfinder. I think the choice of experiments is sound and extensive enough for a workshop submission. Therefore, I have a generally positive impression of this paper and I recommend its acceptance to the SVRHM 2021.

    Nonetheless, I have a few comments about potential weakness or aspects that could be improved, as well as some questions. First, I believe that the paper should more transparently present the less positive results of CycleNet from the experimental setup. For example, the authors show the performance on CIFAR-10 compared to a basic CNN baseline in Figure 3 of the main body of the paper, but leave for the supplementary material the results on ImageNet, where CycleNet achieve comparably worse classification accuracy. I argue that this introduces an analytical bias that can be misleading. Second, I think the paper could be improved by more in-depth discussion of the limitations of the proposal and directions for future work. Finally, I would also have appreciated a longer discussion on what the gap is that this new architecture aims to fill if the issue it addresses can be mitigated or solved by architectures such as transformers. I encourage the authors to consider these changes for their camera-ready version, if the paper is accepted.
    """
]

In [55]:
example_gpt4_reviews = [
    """
    [Significance and novelty]
    1. The paper proposes a novel architecture, CycleNet, based on cyclic orthogonal convolutions that allows efficient information flow between features and locations across an entire image with a small number of layers. [2. The architecture differs from standard Convolutional Neural Networks (CNNs) as it allows for long-range integration of features in (x, y), (x, z), and (y, z) coordinates through the use of a cycle of three orthogonal convolutions. 3. CycleNet obtains competitive results on image classification tasks on CIFAR-10 and ImageNet datasets, while transferring better to stylised images and outperforming CNNs on the Pathfinder challenge, where integration of distant features is crucial. 4. The study hypothesizes that long-range integration favours recognition of objects by shape rather than texture, adding to the significance and novelty of the research.

    [Potential reasons for acceptance]
    1. The paper presents a novel architecture, CycleNet, which uses cyclic orthogonal convolutions, a new approach for long-range integration of features in CNNs. [2. The paper's experimental results show that CycleNet obtains competitive performance on image classification tasks on CIFAR-10 and ImageNet datasets. 3. CycleNet transfers better than CNNs to stylised images and outperforms them on the Pathfinder challenge, where integration of distant features is crucial. 4. The study provides insights into the importance of long-range integration of features for efficient recognition, contributing to the advancement of knowledge in the field.

    [Potential reasons for rejection]
    1. The paper lacks a comprehensive comparison with state-of-the-art methods for long-range integration of features in CNNs. [2. Some parts of the paper, particularly the theoretical explanations, can be unclear or lack sufficient detail. 3. The study could benefit from a more detailed experimental setup, such as more diverse datasets or a larger number of experiments, to further validate the proposed architecture's efficacy. 4. The authors fail to provide a clear explanation of why cyclic orthogonal convolutions lead to efficient long-range integration, and some explanations might require further justification or validation.

    [Suggestions for improvement]
    1. Conduct a comprehensive comparison with existing methods for long-range integration of features in CNNs and provide a detailed analysis of the results and performance differences. [2. Clarify the theoretical explanations and provide detailed, comprehensive justifications for the choices made in the design of the proposed architecture. 3. Provide a more detailed experimental setup, including more diverse datasets, a larger number of experiments, and control groups for a quantitative comparison of the proposed architecture's efficacy. 4. Validate the hypothesis by investigating the role of long-range integration of features in other recognition tasks or applications, and extend the research by implementing variations of the proposed architecture.
    """
]

# Evaluation Workflow

## Summarization

In [76]:
def summary_reviews(reviews: List[str], title: str, client: openai.Client) -> Tuple[str, int]:
    """
    Summarizes a list of reviews using the OpenAI GPT-4 model, formatting the result as a JSON string.

    Args:
        reviews (List[str]): List of review strings to be summarized.
        title (str): Title of the subject to which the reviews pertain.
        client (openai.Client): OpenAI client instance used to send requests to the GPT-4 model.

    Returns:
        Tuple[str, int]: A tuple containing the JSON-formatted summary of reviews and the length of the output.
    """
    # Construct the review messages with proper formatting
    review_messages = "\n\n".join(reviews) + "\n\n"
    prompt = SUMMARY_PROMPT.format(Title=title, Review_Text=review_messages)

    # Use the GPT-4 model to generate a summary
    completion = client.chat.completions.create(
        model="gpt-4-turbo", 
        messages=[{"role": "system", "content": prompt}]
    )
    
    # Extract and clean the JSON output
    output = clean_json_output(completion.choices[0].message.content)
    length = len(json.loads(output))
    
    return output, length

In [77]:
human_reviews_summary, human_review_summary_length = summary_reviews(example_human_reviews, title=title, client=client)
print(human_reviews_summary)

{
    "1": {
        "summary": "Weak biological motivation and lack of improved results on standard CNN applications like ResNet.",
        "verbatim": "The biological motivation is weak at best. Long-range horizontal connections in cortex, which the authors use as motivation, are feature specific, i.e. between corresponding orientation domains. In contrast, in the authors' setup, they interact across all features. Moreover, the long-range connections are only along the x and y direction, but not in oblique directions. In my opinion, the author's proposal is not closer to biology than vision transformers, which also provide an all-to-all spatial interaction, albeit with a different mechanism and arguably stronger performance on large-scale datasets. If that's the case, why don't they show that incorporating cyclic orthogonal convolutions improves a standard ResNet-50 model on ImageNet?"
    },
    "2": {
        "summary": "Insufficient convincing performance data and transparency in 

In [80]:
gpt_reviews_summary, gpt_reviews_summary_length = summary_reviews(example_gpt4_reviews, title=title, client=client)
print(gpt_reviews_summary)

{
    "1": {
        "summary": "Lacks a comprehensive comparison with state-of-the-art methods.",
        "verbatim": "The paper lacks a comprehensive comparison with state-of-the-art methods for long-range integration of features in CNNs."
    },
    "2": {
        "summary": "Theoretical explanations are unclear or insufficiently detailed.",
        "verbatim": "Some parts of the paper, particularly the theoretical explanations, can be unclear or lack sufficient detail."
    },
    "3": {
        "summary": "Experimental setup needs more diversity and expansiveness.",
        "verbatim": "The study could benefit from a more detailed experimental setup, such as more diverse datasets or a larger number of experiments, to further validate the proposed architecture's efficacy."
    },
    "4": {
        "summary": "Lacks clear explanation and justification for the efficiency of cyclic orthogonal convolutions.",
        "verbatim": "The authors fail to provide a clear explanation of why 

## Match Summarized Points

In [81]:
def match_reviews(human_reviews: str, gpt_reviews: str, client: openai.Client) -> Tuple[str, int]:
    """
    Compares two sets of reviews to identify matching reviews between human-written and GPT-generated sets.

    Args:
        human_reviews (str): JSON-formatted summary of human-written reviews.
        gpt_reviews (str): JSON-formatted summary of GPT-generated reviews.
        client (openai.Client): OpenAI client instance for sending requests.

    Returns:
        Tuple[str, int]: A tuple containing the JSON-formatted comparison of reviews and the length of the output.
    """
    prompt = REVIEW_COMPARISON_RPOMPT.format(Review_A=human_reviews, Review_B=gpt_reviews)
    
    completion = client.chat.completions.create(
        model="gpt-4-turbo", messages=[{"role": "system", "content": prompt}]
    )

    output = clean_json_output(completion.choices[0].message.content)
    length = len(json.loads(output))
    
    return output, length

In [84]:
reviews_match, reviews_match_length = match_reviews(human_reviews_summary, gpt_reviews_summary, client=client)
print(reviews_match)

{
    "A1-B4": {
        "rationale": "Both Review A1 and Review B4 critique the biological basis and theoretical justification behind using cyclic orthogonal convolutions. A1 is critical of how the biological motivation doesn't align with true biological features and B4 points out that the rationale for the efficiency of such convolutions lacks clarity and needs further justification.",
        "similarity": "7"
    },
    "A2-B3": {
        "rationale": "Review A2 and Review B3 both express concerns about the adequacy of the experimental setups. A2 focuses on the insufficient transparency and performance data across different datasets, including the absence of detailed results on commonly recognized benchmarks like ImageNet. B3 suggests the need for more diverse datasets and a broader range of experiments to validate the architecture's efficacy, aligning with the concerns in A2 about not properly showcasing performance across datasets.",
        "similarity": "7"
    }
}


## Count Matches

In [89]:
def count_hits(matched_reviews: str, threshold: int = 7) -> int:
    """
    Counts the number of high-similarity hits from a JSON-formatted comparison of reviews, 
    filtering hits by a specified similarity threshold.

    Args:
        matched_reviews (str): JSON-formatted string containing comparison data.
        threshold (int): Minimum similarity score for a review to be considered a hit. Default is 7.

    Returns:
        int: Count of unique high-similarity hits.
    """
    comparison = json.loads(matched_reviews)
    hit_count = sum(1 for _, value in comparison.items() if int(value["similarity"]) >= threshold)
    return hit_count

In [90]:
hit_count = count_hits(reviews_match)
print("The number of high-similarity hits between human and GPT-4 reviews is: {}".format(hit_count))

The number of high-similarity hits between human and GPT-4 reviews is: 2


## Evaluation Metrics

In [91]:
from metric import calculate_hit_rate, calculate_jaccard_index, calculate_sorensen_dice_coefficient, calculate_szymkiewicz_simpson_coefficient

In [92]:
print("Hit Rate:", calculate_hit_rate(hit_count, human_review_summary_length))
print("Jaccard Index:", calculate_jaccard_index(hit_count, human_review_summary_length, gpt_reviews_summary_length))
print("Sørensen-Dice Coefficient:", calculate_sorensen_dice_coefficient(hit_count, human_review_summary_length, gpt_reviews_summary_length))
print("Szymkiewicz-Simpson Coefficient:", calculate_szymkiewicz_simpson_coefficient(hit_count, human_review_summary_length, gpt_reviews_summary_length))

Hit Rate: 0.6666666666666666
Jaccard Index: 0.4
Sørensen-Dice Coefficient: 0.5714285714285714
Szymkiewicz-Simpson Coefficient: 0.6666666666666666
