# Extension: NLI as a metric
This extension is inspired by prior work https://aclanthology.org/2021.emnlp-main.619.pdf that implements a QG/QA framework to evaluate summarization methods. In particular, the paper introduces an evaluation metric that employs a **NLI** (***Natural Language Inference***) pipeline whith this configuartion:

*   **premise**: question + answer on SRC
*   **hypothesis**: question + answer on BT

For example, for the question “Where were the Red Hot Chili Peppers formed?”,
the response answer “LA”, and the knowledge answer “Los Angeles”, we run the NLI model with: “Where were the Red Hot Chili Peppers formed?
Los Angeles” as the premise, and with “Where were
the Red Hot Chili Peppers formed? LA” as the hypothesis.



In [None]:
!git clone https://github.com/soniafoco/askqe_biomqm

Cloning into 'askqe_biomqm'...
remote: Enumerating objects: 243, done.[K
remote: Counting objects: 100% (243/243), done.[K
remote: Compressing objects: 100% (160/160), done.[K
remote: Total 243 (delta 130), reused 182 (delta 69), pack-reused 0 (from 0)[K
Receiving objects: 100% (243/243), 4.33 MiB | 2.35 MiB/s, done.
Resolving deltas: 100% (130/130), done.


In [None]:
%cd /content/askqe_biomqm/extension_nli_metric

/content/askqe_biomqm


## Question generation
If you already ran the baseline, you can use the questions generated there.

You will find it at `askqe_biomqm/baseline/QG/vanilla_qwen-3b.jsonl`

In [None]:
!python QG/code/qwen-3b.py \
  --output_path vanilla_qwen-3b.jsonl \
  --prompt vanilla

## Backtranslation

In [None]:
!python backtranslation.py \
  --input_path QG/vanilla_qwen-3b.jsonl \
  --output_path QG/vanilla_bt_qwen-3b.jsonl

## Question answering (prompt forcing "no answer")

In [None]:
!python QA/code/qwen-3b-noanswer.py \
  --input_path QG/vanilla_qwen-3b.jsonl \
  --output_path QA/vanilla_src_na_qwen-3b.jsonl \
  --sentence_type src

In [None]:
!python QA/code/qwen-3b-noanswer.py \
  --input_path QG/vanilla_qwen-3b.jsonl \
  --output_path QA/vanilla_bt_na_qwen-3b.jsonl \
  --sentence_type bt_tgt

## Basic evaluation

In [None]:
!python evaluation/sbert/sbert-noanswer.py

In [None]:
!python evaluation/string-comparison/string-comparison-noanswer.py

## NLI evaluation

In [None]:
!python evaluation/nli/nli_metric.py \
  --input evaluation/string-comparison/biomqm_f1_na.jsonl \
  --output evaluation/nli/biomqm_nli.jsonl

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
PREMISE: What aspect did boys benefit more from? Increased liking of school.
HYPOTHESIS: What aspect did boys benefit more from? Increased school enjoyment.
label: entailment
{'f1': 0.5714285714285715, 'em': False, 'chrf': 48.7011980935128, 'bleu': 24.840753130578644}
PREMISE: What term describes the increased liking of school? Increased liking of school.
HYPOTHESIS: What term describes the increased liking of school? school enjoyment.
label: entailment
{'f1': 0.3333333333333333, 'em': False, 'chrf': 21.743228227401705, 'bleu': 18.393972058572114}
{'f1': 1.0, 'em': True, 'chrf': 100.0, 'bleu': 100.00000000000004}
PREMISE: What does the term 'average change' refer to in this context? the change in youth mental health.
HYPOTHESIS: What does the term 'average change' refer to in this context? the change in youth mental health over time.
label: neutral
{'f1': 0.8333333333333333, 'em': False, 'chrf': 93.33448629070456, 'bleu': 

In [None]:
!apt-get update
!apt-get install build-essential python3-dev libxml2-dev libxslt1-dev

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (91.180% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (91.18                                                                               Get:2 https://cli.github.com/packages stable InRelease [3,917 B]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [85.0 kB]
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [2,361 kB]
Hit:8 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,894 kB]
Get:10 http://ar

In [None]:
!pip install spacy==2.1.9

Collecting spacy==2.1.9
  Using cached spacy-2.1.9.tar.gz (30.7 MB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Installing build dependencies ... [?25l[?25herror
[1;31merror[0m: [1msubprocess-exited-with-error[0m

[31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m See above for output.

[1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.


In [None]:
!pip install allennlp allennlp-models

Collecting allennlp
  Using cached allennlp-2.10.1-py3-none-any.whl.metadata (21 kB)
Collecting allennlp-models
  Using cached allennlp_models-2.10.1-py3-none-any.whl.metadata (23 kB)
INFO: pip is looking at multiple versions of allennlp to determine which version is compatible with other requirements. This could take a while.
Collecting allennlp
  Using cached allennlp-2.10.0-py3-none-any.whl.metadata (20 kB)
  Using cached allennlp-2.9.3-py3-none-any.whl.metadata (19 kB)
  Using cached allennlp-2.9.2-py3-none-any.whl.metadata (19 kB)
  Using cached allennlp-2.9.1-py3-none-any.whl.metadata (19 kB)
  Using cached allennlp-2.9.0-py3-none-any.whl.metadata (18 kB)
  Using cached allennlp-2.8.0-py3-none-any.whl.metadata (17 kB)
  Using cached allennlp-2.7.0-py3-none-any.whl.metadata (17 kB)
INFO: pip is still looking at multiple versions of allennlp to determine which version is compatible with other requirements. This could take a while.
  Using cached allennlp-2.6.0-py3-none-any.whl.meta

In [None]:
!python evaluation/nli/nli_metric_allennlp.py \
  --input evaluation/string-comparison/biomqm_f1_na.jsonl \
  --output evaluation/nli/biomqm_nli_allennlp.jsonl

Traceback (most recent call last):
  File "/content/askqe_biomqm/extension_nli_metric/nli_metric_allennlp.py", line 1, in <module>
    from allennlp.predictors.predictor import Predictor
ModuleNotFoundError: No module named 'allennlp'


In [None]:
import json

# Define severity levels
severity_levels = {"critical": 4, "major": 3, "minor": 2, "neutral": 1, "no error": 0}

def get_highest_severity(xcomet_annotations):
    """Retrieve the highest severity from error spans across all xcomet_annotation entries."""
    max_severity = "no error"

    if not xcomet_annotations:
        return max_severity

    for annotation in xcomet_annotations:
        error_severity = annotation.get("severity", "no error")
        if severity_levels[error_severity.lower()] > severity_levels[max_severity.lower()]:
            max_severity = error_severity  # Update max severity

    return max_severity


def process_jsonl(input_file, output_file):
    """Process JSONL file and assign highest severity."""
    with open(input_file, "r", encoding="utf-8") as infile, open(output_file, "w", encoding="utf-8") as outfile:
        for line in infile:
            data = json.loads(line)
            data["severity"] = get_highest_severity(data.get("errors_tgt", []))  # Ensure it’s a list
            outfile.write(json.dumps(data, ensure_ascii=False) + "\n")

input_jsonl = "evaluation/nli/biomqm_nli.jsonl"  # Replace with actual file path
output_jsonl = "evaluation/nli/biomqm_nli_high.jsonl"  # Replace with actual file path
process_jsonl(input_jsonl, output_jsonl)


In [None]:
from typing import dataclass_transform
import json
import pandas as pd
import numpy as np

results = []


file_nli = "evaluation/nli/biomqm_nli_high.jsonl"  # Replace with actual file path

with open(file_nli, "r", encoding="utf-8") as f_nli:
    for line in f_nli:
      try:
        data = json.loads(line)

        if "scores" in data.keys():
          list_scores = data["scores"]
        if "avg_nli" in data.keys():
          avg_nli = data["avg_nli"]

        if list_scores and avg_nli:
          for i in range(len(list_scores)):
              results.append({
                  "Language": f"{data['lang_src']}-{data['lang_tgt']}",
                  "Severity": data["severity"],
                  "F1": list_scores[i]["f1"],
                  "EM": list_scores[i]["em"],
                  "CHRF": list_scores[i]["chrf"],
                  "BLEU": list_scores[i]["bleu"],
                  "NLI": avg_nli
              })
      except json.JSONDecodeError as e:
        print(f"Skipping invalid JSON line: {line.strip()}")
        continue


# Creazione di un DataFrame
df = pd.DataFrame(results)

print(results)

# Calcolare le medie per ogni combinazione di lingua e severità
summary = df.groupby(["Language", "Severity"]).mean()

# Mostra il risultato finale
display(summary)

[{'Language': 'en-de', 'Severity': 'Minor', 'F1': 1.0, 'EM': True, 'CHRF': 100.0, 'BLEU': 100.00000000000004, 'NLI': 1.0}, {'Language': 'en-de', 'Severity': 'Minor', 'F1': 0.5, 'EM': False, 'CHRF': 70.23643949930458, 'BLEU': 49.99999999999999, 'NLI': 1.0}, {'Language': 'en-de', 'Severity': 'Minor', 'F1': 0.8333333333333334, 'EM': False, 'CHRF': 73.9517143193614, 'BLEU': 53.7284965911771, 'NLI': 1.0}, {'Language': 'en-de', 'Severity': 'Minor', 'F1': 0.875, 'EM': False, 'CHRF': 98.05611302199749, 'BLEU': 84.08964152537145, 'NLI': 1.0}, {'Language': 'en-de', 'Severity': 'Minor', 'F1': 0, 'EM': False, 'CHRF': 87.72426647426647, 'BLEU': 0.0, 'NLI': 1.0}, {'Language': 'en-de', 'Severity': 'Major', 'F1': 0.8000000000000002, 'EM': False, 'CHRF': 79.67290664116794, 'BLEU': 66.87403049764218, 'NLI': 1.0}, {'Language': 'en-de', 'Severity': 'Major', 'F1': 0.5, 'EM': False, 'CHRF': 47.36402486402487, 'BLEU': 49.99999999999999, 'NLI': 1.0}, {'Language': 'en-de', 'Severity': 'Minor', 'F1': 0.66666666

Unnamed: 0_level_0,Unnamed: 1_level_0,F1,EM,CHRF,BLEU,NLI
Language,Severity,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
en-de,Critical,0.652755,0.4375,73.826526,60.112141,1.0
en-de,Major,0.692103,0.449251,77.169951,60.841105,0.883671
en-de,Minor,0.709226,0.50075,76.805058,63.04407,0.869787
en-de,Neutral,0.710317,0.439024,80.831886,61.9187,0.891986
en-de,no error,0.715373,0.52461,77.334985,65.134739,0.887821
en-es,Critical,0.671872,0.372093,75.136116,55.807089,0.83885
en-es,Major,0.701596,0.390977,75.547778,56.873539,0.883848
en-es,Minor,0.734935,0.452796,78.295853,62.575432,0.884432
en-es,Neutral,0.76982,0.541176,79.23737,66.330548,0.943682
en-es,no error,0.730074,0.495726,80.131709,65.05669,0.933316
