Evaluate the quality of extracted content by comparing it with the human references.

In [1]:
import sys
sys.path.append("..")

import nest_asyncio
nest_asyncio.apply()

import pandas as pd
import jsonlines as jl

from massw.metrics import compute_metrics, flatten_metrics

You try to use a model that was created with version 2.4.0.dev0, however, your version is 2.3.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





INFO:tensorflow:Reading checkpoint /Users/jimmy/.cache/huggingface/metrics/bleurt/BLEURT-20-D12/downloads/extracted/2b0bd60025f714bf0eca857470aa967f784a446243ab3666b88cb6794a07c374/BLEURT-20-D12.
INFO:tensorflow:Config file found, reading.
INFO:tensorflow:Will load checkpoint BLEURT-20-D12
INFO:tensorflow:Loads full paths and checks that files exists.
INFO:tensorflow:... name:BLEURT-20-D12
INFO:tensorflow:... bert_config_file:bert_config.json
INFO:tensorflow:... max_seq_length:512
INFO:tensorflow:... vocab_file:None
INFO:tensorflow:... do_lower_case:None
INFO:tensorflow:... sp_model:sent_piece
INFO:tensorflow:... dynamic_seq_length:True
INFO:tensorflow:Creating BLEURT scorer.
INFO:tensorflow:Creating SentencePiece tokenizer.
INFO:tensorflow:Creating SentencePiece tokenizer.
INFO:tensorflow:Will load model: /Users/jimmy/.cache/huggingface/metrics/bleurt/BLEURT-20-D12/downloads/extracted/2b0bd60025f714bf0eca857470aa967f784a446243ab3666b88cb6794a07c374/BLEURT-20-D12/sent_piece.model.
INFO

INFO:tensorflow:BLEURT initialized.
[nltk_data] Downloading package wordnet to /Users/jimmy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jimmy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/jimmy/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
aspects = ["context", "key_idea", "method", "outcome", "future_impact"]
annotation_path = "../data/annotation_0531.jsonl"
prediction_models = ["gpt4", "gpt35", "mixtral"]
prediction_paths = {model: f"../data/{model}_0531.jsonl" for model in prediction_models}

In [3]:
predictions = {}

with jl.open(annotation_path) as f:
    annotations = []
    for line in f:
        annotations.append({
            "id": line["id"],
            "texts": line["displayed_text"],
            "context": line["label_annotations"]["Multi-aspect Summary"]["Context"],
            "key_idea": line["label_annotations"]["Multi-aspect Summary"]["Key idea"],
            "method": line["label_annotations"]["Multi-aspect Summary"]["Method"],
            "outcome": line["label_annotations"]["Multi-aspect Summary"]["Outcome"],
            "future_impact": line["label_annotations"]["Multi-aspect Summary"]["Future Impact"],
        })

for model in prediction_models:
    prediction_path = prediction_paths[model]
    with jl.open(prediction_path) as f:
        predictions[model] = []
        for line in f:
            predictions[model].append({
                "id": line["id"],
                "context": line["Context"],
                "key_idea": line["Key Idea"],
                "method": line["Method"],
                "outcome": line["Outcome"],
                "future_impact": line["Future Impact"],
            })

print("Length of annotation records:", len(annotations))
for model in prediction_models:
    print(f"Length of predictions for {model}:", len(predictions[model]))

Length of annotation records: 240
Length of predictions for gpt4: 120
Length of predictions for gpt35: 120
Length of predictions for mixtral: 120


In [4]:
# Preserve annotations labeled by at least 2 annotators
id_counts = pd.Series([a["id"] for a in annotations]).value_counts()
annotations = [a for a in annotations if id_counts[a["id"]] >= 2]
print("Length of annotation records after filtering:", len(annotations))

Length of annotation records after filtering: 240


In [5]:
# Find intersection of annotations and predictions
annotation_ids = set([a["id"] for a in annotations])
prediction_ids = {model: set([p["id"] for p in predictions[model]]) for model in prediction_models}
common_ids = annotation_ids
for model in prediction_models:
    common_ids = common_ids.intersection(prediction_ids[model])
print("Number of common records:", len(common_ids))

Number of common records: 120


In [22]:
import warnings
warnings.filterwarnings("ignore")
# Compute all metrics
# Model as prediction source, and annotation as ground truth references

metrics = {}

refs = {}
for aspect in aspects:
    refs[aspect] = []
    for idx in common_ids:
        refs[aspect].append([a[aspect] for a in annotations if a["id"] == idx])

for model in prediction_models:
    print(f"Computing metrics for {model}")
    metrics[model] = {}
    model_predictions = {}
    for aspect in aspects:
        model_predictions[aspect] = []
        for idx in common_ids:
            found = [p[aspect] for p in predictions[model] if p["id"] == idx]
            assert len(found) == 1
            model_predictions[aspect].append(found[0])

    for aspect in aspects:
        print(f"Computing metrics for {model} on {aspect}")
        metrics[model][aspect] = compute_metrics(
            predictions=model_predictions[aspect],
            references=refs[aspect],
            metric_names=["nahit"]
            # metric_names=["bleurt", "cosine", "bertscore", "rouge", "bleu"]
        )

Computing metrics for gpt4
Computing metrics for gpt4 on context
Computing metrics for gpt4 on key_idea
Computing metrics for gpt4 on method
Computing metrics for gpt4 on outcome
Computing metrics for gpt4 on future_impact
Computing metrics for gpt35
Computing metrics for gpt35 on context
Computing metrics for gpt35 on key_idea
Computing metrics for gpt35 on method
Computing metrics for gpt35 on outcome
Computing metrics for gpt35 on future_impact
Computing metrics for mixtral
Computing metrics for mixtral on context
Computing metrics for mixtral on key_idea
Computing metrics for mixtral on method
Computing metrics for mixtral on outcome
Computing metrics for mixtral on future_impact


In [23]:
import pandas as pd

# View metrics
for model in prediction_models:
    for aspect in aspects:
        metrics[model][aspect] = flatten_metrics(metrics[model][aspect])

In [26]:
# Set float precision
pd.set_option('display.precision', 3)
gpt4_df = pd.DataFrame(metrics["gpt4"]).rename(columns={"future_impact": "projected_impact"})
gpt35_df = pd.DataFrame(metrics["gpt35"]).rename(columns={"future_impact": "projected_impact"})
mixtral_df = pd.DataFrame(metrics["mixtral"]).rename(columns={"future_impact": "projected_impact"})

In [27]:
display(gpt4_df)
display(gpt35_df)
display(mixtral_df)

Unnamed: 0,context,key_idea,method,outcome,projected_impact
N/A-precision,1.0,0.0,0.5,0.267,0.932
N/A-recall,0.583,0.0,0.421,0.364,0.923
N/A-f1,0.737,0.0,0.457,0.308,0.928
N/A in pred,0.117,0.008,0.133,0.125,0.858
N/A in ref,0.2,0.008,0.158,0.092,0.867


Unnamed: 0,context,key_idea,method,outcome,projected_impact
N/A-precision,0.0,0.0,1.0,0.583,1.0
N/A-recall,0.0,0.0,0.105,0.636,0.346
N/A-f1,0.0,0.0,0.19,0.609,0.514
N/A in pred,0.0,0.0,0.017,0.1,0.3
N/A in ref,0.2,0.008,0.158,0.092,0.867


Unnamed: 0,context,key_idea,method,outcome,projected_impact
N/A-precision,1.0,0.0,0.667,0.25,0.951
N/A-recall,0.042,0.0,0.421,0.364,0.75
N/A-f1,0.08,0.0,0.516,0.296,0.839
N/A in pred,0.008,0.008,0.1,0.133,0.683
N/A in ref,0.2,0.008,0.158,0.092,0.867


In [16]:
rows = ["Cosine Embedding", "BLEURT", "BERTScore-f1", "BLEU", "ROUGE-1"]
# rows = ["Cosine Embedding", "BLEURT", "BERTScore-f1", "BLEU", "ROUGE-1", "LLM Similarity"]
gpt4_df.loc[rows].to_csv("gpt4_quality.csv", float_format="%.3f")
gpt35_df.loc[rows].to_csv("gpt35_quality.csv", float_format="%.3f")
mixtral_df.loc[rows].to_csv("mixtral_quality.csv", float_format="%.3f")

display(gpt4_df.loc[rows])
display(gpt35_df.loc[rows])
display(mixtral_df.loc[rows])

Unnamed: 0,context,key_idea,method,outcome,projected_impact
Cosine Embedding,0.94,0.944,0.894,0.931,0.916
BLEURT,0.607,0.582,0.51,0.603,0.611
BERTScore-f1,0.934,0.928,0.908,0.933,0.933
BLEU,0.384,0.375,0.197,0.355,0.282
ROUGE-1,0.604,0.572,0.45,0.596,0.563


Unnamed: 0,context,key_idea,method,outcome,projected_impact
Cosine Embedding,0.934,0.936,0.895,0.928,0.876
BLEURT,0.597,0.575,0.51,0.608,0.498
BERTScore-f1,0.934,0.927,0.91,0.934,0.905
BLEU,0.524,0.439,0.197,0.452,0.17
ROUGE-1,0.635,0.582,0.445,0.626,0.371


Unnamed: 0,context,key_idea,method,outcome,projected_impact
Cosine Embedding,0.944,0.949,0.905,0.933,0.917
BLEURT,0.645,0.636,0.554,0.674,0.635
BERTScore-f1,0.946,0.943,0.92,0.948,0.936
BLEU,0.59,0.556,0.295,0.665,0.384
ROUGE-1,0.693,0.662,0.509,0.707,0.599


In [11]:
# Since every paper has two annotations, we can split the annotations into two
# groups by their ids.
annotations.sort(key=lambda x: x["id"])
annotations_1 = annotations[::2]
annotations_2 = annotations[1::2]

# Use one as reference and the other as prediction
cross_val_metrics = {}
for aspect in aspects:
    cross_val_metrics[aspect] = flatten_metrics(
        compute_metrics(
            predictions=[a[aspect] for a in annotations_1],
            references=[a[aspect] for a in annotations_2],
            metric_names=["bleurt", "cosine", "bertscore", "rouge", "bleu"]))


In [12]:
human_df = pd.DataFrame(cross_val_metrics).rename(columns={"future_impact": "projected_impact"})
human_df

Unnamed: 0,context,key_idea,method,outcome,projected_impact
Cosine Embedding,0.935,0.944,0.9,0.936,0.941
BLEU,0.594,0.464,0.357,0.608,0.642
Precision-1,0.694,0.563,0.524,0.699,0.757
Precision-2,0.608,0.472,0.373,0.62,0.69
Length Ratio,0.991,1.133,1.031,1.089,0.953
ROUGE-1,0.703,0.637,0.54,0.737,0.748
ROUGE-2,0.633,0.546,0.381,0.661,0.686
BERTScore-precision,0.942,0.934,0.922,0.947,0.959
BERTScore-recall,0.943,0.944,0.926,0.954,0.951
BERTScore-f1,0.942,0.938,0.924,0.95,0.955


In [13]:
human_df.loc[rows].to_csv("human_agreement.csv", float_format="%.3f")

display(human_df.loc[rows])

Unnamed: 0,context,key_idea,method,outcome,projected_impact
Cosine Embedding,0.935,0.944,0.9,0.936,0.941
BLEURT,0.656,0.618,0.559,0.671,0.742
BERTScore-f1,0.942,0.938,0.924,0.95,0.955
BLEU,0.594,0.464,0.357,0.608,0.642
ROUGE-1,0.703,0.637,0.54,0.737,0.748


In [21]:
# Load a few examples for manual inspection
ids = list(common_ids)[:5]
for idx in ids:
    print("ID:", idx)
    for model in prediction_models:
        found = [p for p in predictions[model] if p["id"] == idx]
        assert len(found) == 1
        print(f"[{model}]")
        print("Context:", found[0]["context"])
        print("Key Idea:", found[0]["key_idea"])
        print("Method:", found[0]["method"])
        print("Outcome:", found[0]["outcome"])
        print("Future Impact:", found[0]["future_impact"])
    human_annotations = [a for a in annotations if a["id"] == idx]
    for a in human_annotations:
        print("[Human]")
        print("Context:", a["context"])
        print("Key Idea:", a["key_idea"])
        print("Method:", a["method"])
        print("Outcome:", a["outcome"])
        print("Future Impact:", a["future_impact"])

    print("\n")

ID: ffd14676-a525-479f-a74e-2c5d3a85c510
[gpt4]
Context: Interest in parallel systems has been revived, focusing on computation through excitatory and inhibitory interactions in networks of neuron-like units, particularly for early stages of visual processing and the representation of small local fragments.
Key Idea: The paper tackles the challenge of representing shapes in parallel systems and proposes mechanisms for shape perception and visual attention, offering a novel interpretation of the Gestalt principle 'the whole is more than the sum of its parts'.
Method: N/A
Outcome: N/A
Future Impact: N/A
[gpt35]
Context: There has been a recent revival of interest in parallel systems in which computation is performed by excitatory and inhibitory interactions within a network of relatively simple, neuronlike units. This paper considers the difficulties involved in representing shapes in parallel systems.
Key Idea: The authors suggest ways of representing shapes in parallel systems which pr