# **Term Extraction Ensemble (BERT + spaCy + Dictionary)**

This notebook implements a complete post-processing pipeline for the ATE-IT Subtask A (Automatic Term Extraction).  
It takes the raw predictions from a fine-tuned **BERT token classification model** and combines them with **spaCy noun-chunk spans** and a **gold-derived domain vocabulary** to produce a higher-quality list of domain terms for each sentence.

### Pipeline Summary
1. **Load BERT and spaCy predictions**  
   - Import model outputs in ATE-IT JSON format.  
   - Map predictions to sentence identifiers for easy lookup.

2. **Normalize and clean BERT terms**  
   - Remove punctuation, unify quotes, lowercase, collapse whitespace.  
   - Filter out spurious or generic one-word candidates.

3. **Build a domain vocabulary from the gold training set**  
   - Normalize gold terms.  
   - Track frequencies to identify strong (repeated) vs. weak (rare) terms.

4. **Merge BERT + spaCy + Dictionary knowledge**  
   - **Upgrade** short BERT terms to longer spaCy spans when they form a valid multi-word expression present in the gold vocabulary.  
   - **Add** additional spaCy multi-word spans only if they appear in the gold vocabulary.  
   - **Filter out** generic, meaningless, or uninformative unigrams.  
   - **Normalize and deduplicate** final terms.

5. **Generate final ensemble predictions**  
   - For each sentence, produce an improved term list combining all signals.  
   - Output saved in ATE-IT JSON format.

### Goal
The notebook improves recall and precision of automatic term extraction by combining:
- contextual predictions (BERT),
- linguistic structure (spaCy),
- and domain consistency (gold vocabulary).

This hybrid ensemble typically outperforms each component alone.


#### import and file paths
We load:
- the **train** file to extract the gold vocabulary,
- the **dev** file (gold) for evaluation and text,
- BERT and spaCy predictions on the dev set.

In [28]:
import json, os
def load_json(path: str):
    with open(path, "r", encoding="utf-8") as f:
        return json.load(f)
    

def save_json(obj, path: str):
    os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)
    print(f"✓ Saved cleaned predictions to {path}")

In [29]:
DATA_DIR = "../data/"
PRED_DIR = "../src/predictions/"

TRAIN_FILE = os.path.join(DATA_DIR, "subtask_a_train.json")
DEV_FILE = os.path.join(DATA_DIR, "subtask_a_dev.json")

BERT_DEV_PRED_FILE = os.path.join(
    PRED_DIR, "subtask_a_dev_bert_preds_2e-5_cased_cleaned.json"
)
SPACY_DEV_PRED_FILE = os.path.join(
    PRED_DIR, "subtask_a_dev_spacy_trained_preds.json"
)

ENSEMBLE_OUT_FILE = os.path.join(
    PRED_DIR, "subtask_a_dev_ensemble_bert_2e-5_cased_spacy_dictfilter.json"
)

os.makedirs(PRED_DIR, exist_ok=True)

In [30]:
print(BERT_DEV_PRED_FILE)
print(SPACY_DEV_PRED_FILE)

../src/predictions/subtask_a_dev_bert_preds_2e-5_cased_cleaned.json
../src/predictions/subtask_a_dev_spacy_trained_preds.json


### 1. Normalization and training vocabulary

We first define a canonical normalization function `norm()` and build:
- a **normalized vocabulary** of gold terms from the training set,
- a helper map from prediction JSONs to `(doc, par, sent) → term_list`.


In [31]:
import re
import unicodedata

def norm(t: str) -> str:
    if not t:
        return ""
    t = t.lower()
    t = unicodedata.normalize("NFKC", t)
    t = t.replace("’", "'").replace("`", "'")
    t = t.replace("“", '"').replace("”", '"')
    t = " ".join(t.split())
    # strip punteggiatura ai bordi
    t = t.strip(".,;:-'\"()[]{}")
    return t

## 2. Build a gold-derived vocabulary

We now construct a **normalized vocabulary** of domain terms from the gold training set.  
This vocabulary is later used to:
- validate candidate terms,
- decide which spaCy spans are trustworthy,
- avoid keeping generic words that never appear as gold terms.


In [32]:

def build_train_vocab(train_data: dict) -> set:
    vocab = set()
    for entry in train_data["data"]:
        for term in entry.get("term_list", []):
            n = norm(term)
            if n:
                vocab.add(n)
    return vocab


def build_term_map(pred_json: dict) -> dict:
    """
    Build a mapping:
        (document_id, paragraph_id, sentence_id) -> list of predicted terms
    from a prediction JSON in the ATE-IT format.
    """
    m = {}
    for e in pred_json["data"]:
        key = (e["document_id"], e["paragraph_id"], e["sentence_id"])
        m[key] = e.get("term_list", []) or []
    return m

#### 3. Helper: map predictions to sentence IDs

We define a helper function that converts a prediction JSON in ATE-IT format into a dictionary:
`(document_id, paragraph_id, sentence_id) -> list of predicted terms`.

This makes it easy to align BERT and spaCy predictions for the same sentence.


In [33]:
from collections import Counter

def build_train_vocab_with_freq(train_data):
    freq = Counter()
    for e in train_data["data"]:
        for term in e.get("term_list", []):
            norm = norm(term)
            if norm:
                freq[norm] += 1
    strong = {t for t, c in freq.items() if c >= 3}
    weak   = {t for t, c in freq.items() if c == 1}
    return freq, strong, weak


In [34]:
""" def norm(t: str) -> str:
    if not t:
        return ""
    t = t.lower().strip()
    t = " ".join(t.split())
    t = t.replace("’", "'")
    t = t.strip(".,;:-'\"()[]{}")
    return t """

' def norm(t: str) -> str:\n    if not t:\n        return ""\n    t = t.lower().strip()\n    t = " ".join(t.split())\n    t = t.replace("’", "\'")\n    t = t.strip(".,;:-\'"()[]{}")\n    return t '

### 3. Generic heads and acronym heuristics

We now define:
- a small list of **generic heads** (e.g. *rifiuti, materiali, servizio*),
- a heuristic to detect **acronyms** (e.g. *RAEE, R1, TARI*),
- a filter for **generic unigrams** that never appear as gold terms.

The goal is to:
- keep important single-word terms if they are in the gold vocabulary,
- discard only very generic heads that never occur as true domain terms.


In [35]:
GENERIC_HEADS = {
    "rifiuti", "materiali", "utenti", "plastica", "carta",
    "residui", "tariffe", "gestore", "servizio", "modalità",
    "conferimento", "costi", "parte", "quota", "impianto"
}
def looks_like_acronym(n: str) -> bool:
    # es: "tmb", "raee", "r.a.e.e."
    n_clean = n.replace(".", "")
    return (len(n_clean) >= 2 and len(n_clean) <= 6 and n_clean.isalpha())

def filter_generic_unigrams(terms, train_vocab_norm):
    filtered = []
    for t in terms:
        n = norm(t)
        tokens = n.split()
        if len(tokens) == 1:
            if n in GENERIC_HEADS and n not in train_vocab_norm and not looks_like_acronym(n):
                continue
            if n in GENERIC_HEADS and n not in train_vocab_norm:
                # scarta "quota", "parte", ecc. se non compaiono mai come termini gold
                continue
        filtered.append(t)
    return filtered

## 4. Multiword upgrade: BERT → spaCy spans

BERT sometimes predicts short fragments (e.g. *ferro*) where the gold term is a
longer span (e.g. *materiali ferrosi*).

We therefore:
1. Look for **spaCy multiword spans** that:
   - are present in the gold vocabulary,
   - contain the BERT term as a contiguous token subsequence.
2. If such a span exists, we **upgrade** the BERT term to the longer spaCy span.
3. We also maintain a small list of **GENERIC_BAD** terms that we never keep.


In [36]:
GENERIC_BAD = {
    "parte", "gestione", "città", "territorio", "comune",
    "ore", "no", "si", "anno", "mese", "giorno"
} 
def contains_as_subspan(longer: str, shorter: str) -> bool:
    long_tokens = longer.split()
    short_tokens = shorter.split()
    L, S = len(long_tokens), len(short_tokens)
    if S > L:
        return False
    for i in range(L - S + 1):
        if long_tokens[i:i+S] == short_tokens:
            return True
    return False


def upgrade_with_longer_spacy(bert_terms, spacy_terms, train_vocab_norm):
    """
    Upgrade BERT terms to longer spaCy spans ONLY WHEN BENEFICIAL.
    """
    final = []
    seen = set()
    
    spacy_norm_map = {norm(t): t for t in spacy_terms or []}

    for b in bert_terms or []:
        b_norm = norm(b)
        if not b_norm or b_norm in GENERIC_BAD:
            continue

        best = None

        # search longest valid spaCy span containing the BERT term
        for s_norm, s in spacy_norm_map.items():
            if len(s_norm.split()) < 2:
                continue
            if s_norm not in train_vocab_norm:
                continue
            if contains_as_subspan(s_norm, b_norm):
                if best is None or len(s_norm.split()) > len(norm(best).split()):
                    best = s


        chosen = best if best else b
        c_norm = norm(chosen)

        if c_norm not in seen and c_norm not in GENERIC_BAD:
            final.append(chosen)
            seen.add(c_norm)

    return final


### 5. BERT + spaCy + vocabulary merge
now define the main merge function that combines:
- cleaned **BERT terms**,
- **spaCy noun-chunk spans**,
- and the **gold-derived vocabulary**.

The strategy is:
1. First, **upgrade** BERT terms to longer spaCy spans when they match a gold term.
2. Then, **add extra spaCy multiword spans** that:
   - are in the gold vocabulary,
   - are not already covered,
   - are not clearly generic or meaningless.


In [37]:
def merge_bert_spacy_with_dict(bert_terms, spacy_terms, train_vocab_norm):
    """
    BEST ensemble so far:
    1. upgrade BERT with spaCy
    2. add dictionary-filtered spaCy spans
    3. skip generic or meaningless words
    """
    upgraded = upgrade_with_longer_spacy(
        bert_terms=bert_terms,
        spacy_terms=spacy_terms,
        train_vocab_norm=train_vocab_norm,
    )

    final = upgraded[:]
    seen = {norm(t) for t in upgraded}

    for s in spacy_terms or []:
        s_norm = norm(s)

        if len(s.split()) < 2:
            continue
        if s_norm not in train_vocab_norm:
            continue
        if s_norm in seen:
            continue
        if s_norm in GENERIC_BAD:
            continue

        final.append(s)
        seen.add(s_norm)

    return final


### 6. Per-sentence merge helper
 helper `merge_sentence()` that:
1. Applies the BERT+spaCy+vocabulary merge strategy.
2. Removes only **truly generic unigrams** (using the gold vocabulary as a whitelist).
3. Normalizes and deduplicates the final term list.

This function is called once per sentence in the dev/test set.


In [38]:
def merge_sentence(bert_terms, spacy_terms, train_vocab_norm):
    merged = merge_bert_spacy_with_dict(
        bert_terms=bert_terms,
        spacy_terms=spacy_terms,
        train_vocab_norm=train_vocab_norm
    )
    merged = filter_generic_unigrams(merged, train_vocab_norm)

    # dedupe and normalize
    seen = set()
    final = []
    for t in merged:
        n = norm(t)
        if n not in seen:
            final.append(n)
            seen.add(n)
    return final


### 7. BUILD BERT + SPACY ENSEMBLE USING merge_sentence()

In [39]:
from tqdm import tqdm
import json

# ---- Load train data and build vocabulary ----
with open(TRAIN_FILE, "r", encoding="utf-8") as f:
    train_data = json.load(f)

train_vocab_norm = build_train_vocab(train_data)
print(f"# unique normalized terms from train gold: {len(train_vocab_norm)}")

# ---- Load dev gold (for evaluation) ----
with open(DEV_FILE, "r", encoding="utf-8") as f:
    dev_data = json.load(f)

# ---- Load BERT and spaCy predictions ----
with open(BERT_DEV_PRED_FILE, "r", encoding="utf-8") as f:
    bert_pred = json.load(f)

with open(SPACY_DEV_PRED_FILE, "r", encoding="utf-8") as f:
    spacy_pred = json.load(f)

# Convert JSON predictions → dict[(doc,par,sent)] → [terms...]
bert_map = build_term_map(bert_pred)
spacy_map = build_term_map(spacy_pred)

# ---- Build ensemble predictions using merge_sentence ----
ensemble_output = {"data": []}

print("Building improved BERT+spaCy ensemble ...")

for idx, row in enumerate(tqdm(dev_data["data"])):

    key = (row["document_id"], row["paragraph_id"], row["sentence_id"])

    bert_terms = bert_map.get(key, []) or []
    spacy_terms = spacy_map.get(key, []) or []

    #  NEW MERGE FUNCTION 
    merged_terms = merge_sentence(
        bert_terms=bert_terms,
        spacy_terms=spacy_terms,
        train_vocab_norm=train_vocab_norm
    )

    # Debug on first 3
    if idx < 3:
        print("\n---------------------------------------")
        print("Sentence", idx)
        print("TEXT:", row["sentence_text"])
        print("  BERT  :", bert_terms)
        print("  SPACY :", spacy_terms)
        print("  MERGED:", merged_terms)

    # Save
    ensemble_output["data"].append({
        "document_id": row["document_id"],
        "paragraph_id": row["paragraph_id"],
        "sentence_id": row["sentence_id"],
        "term_list": merged_terms,
    })





# unique normalized terms from train gold: 710
Building improved BERT+spaCy ensemble ...


100%|██████████| 577/577 [00:00<00:00, 288555.31it/s]


---------------------------------------
Sentence 0
TEXT: Non Domestica; CAMPEGGI, DISTRIBUTORI CARBURANTI, PARCHEGGI; 1,22; 4,73 
  BERT  : []
  SPACY : []
  MERGED: []

---------------------------------------
Sentence 1
TEXT: Il presente disciplinare per la gestione dei centri di raccolta comunali è stato redatto ai sensi e per effetto del DM 13/05/2009, pubblicato sulla G.U. n. 165 del 18/07/2009, con il quale sono state apportate le modifiche sostanziali al DM 08/04/2008, Disciplina dei centri di raccolta dei rifiuti urbani raccolti in modo differenziato, come previsto dall'art. 183, comma 7, lettera cc) del Dlgs 3 aprile 2006, n. 152, e ss.mm.ii.
  BERT  : ['gestione dei centri di raccolta comunali', 'centri di raccolta dei rifiuti urbani raccolti in modo differenziato']
  SPACY : ['gestione dei centri di raccolta comunali', 'centri di raccolta dei rifiuti urbani raccolti']
  MERGED: ['gestione dei centri di raccolta comunali', 'centri di raccolta dei rifiuti urbani raccolti in mo




#### Save predictions

In [40]:
# ---- Save final merged predictions ----
with open(ENSEMBLE_OUT_FILE, "w", encoding="utf-8") as f:
    json.dump(ensemble_output, f, ensure_ascii=False, indent=2)

print(f"\nEnsemble predictions saved to: {ENSEMBLE_OUT_FILE}")



Ensemble predictions saved to: ../src/predictions/subtask_a_dev_ensemble_bert_2e-5_cased_spacy_dictfilter.json


### EVALUATION AND RESULTS

In [41]:
def type_f1_score(gold_standard, system_output):
  """
  Evaluates a term extraction system's performance using Type Precision,
  Type Recall, and Type F1 score based on the set of unique terms extracted
  at least once across the entire dataset.

  Args:
    gold_standard: A list of lists, where each inner list contains the
                   gold standard terms for an item.
    system_output: A list of lists, where each inner list contains the
                   terms extracted by the system for the corresponding item.

  Returns:
    A tuple containing the Type Precision, Type Recall, and Type F1 score.
  """

  # Get the set of all unique gold standard terms across the dataset
  all_gold_terms = set()
  for item_terms in gold_standard:
    all_gold_terms.update(item_terms)

  # Get the set of all unique system extracted terms across the dataset
  all_system_terms = set()
  for item_terms in system_output:
    all_system_terms.update(item_terms)

  # Calculate True Positives (terms present in both sets)
  type_true_positives = len(all_gold_terms.intersection(all_system_terms))

  # Calculate False Positives (terms in system output but not in gold standard)
  type_false_positives = len(all_system_terms - all_gold_terms)

  # Calculate False Negatives (terms in gold standard but not in system output)
  type_false_negatives = len(all_gold_terms - all_system_terms)

  # Calculate Type Precision, Type Recall, and Type F1 score
  type_precision = type_true_positives / (type_true_positives + type_false_positives) if (type_true_positives + type_false_positives) > 0 else 0
  type_recall = type_true_positives / (type_true_positives + type_false_negatives) if (type_true_positives + type_false_negatives) > 0 else 0
  type_f1 = 2 * (type_precision * type_recall) / (type_precision + type_recall) if (type_precision + type_recall) > 0 else 0

  return type_precision, type_recall, type_f1

In [42]:
                   
def micro_f1_score(gold_standard, system_output):
  """
  Evaluates a term extraction system's performance using Precision, Recall,
  and F1 score based on individual term matching (micro-average).

  Args:
    gold_standard: A list of lists, where each inner list contains the
        gold standard terms for an item.
    system_output: A list of lists, where each inner list contains the
                   terms extracted by the system for the corresponding item.

  Returns:
    A tuple containing the Precision, Recall, and F1 score.
  """
  total_true_positives = 0
  total_false_positives = 0
  total_false_negatives = 0

  # Iterate through each item's gold standard and system output terms
  for gold, system in zip(gold_standard, system_output):
    # Convert to sets for efficient comparison
    gold_set = set(gold)
    system_set = set(system)

    # Calculate True Positives, False Positives, and False Negatives for the current item
    true_positives = len(gold_set.intersection(system_set))
    false_positives = len(system_set - gold_set)
    false_negatives = len(gold_set - system_set)

    # Accumulate totals across all items
    total_true_positives += true_positives
    total_false_positives += false_positives
    total_false_negatives += false_negatives

  # Calculate Precision, Recall, and F1 score (micro-average)
  precision = total_true_positives / (total_true_positives + total_false_positives) if (total_true_positives + total_false_positives) > 0 else 0
  recall = total_true_positives / (total_true_positives + total_false_negatives) if (total_true_positives + total_false_negatives) > 0 else 0
  f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

  return precision, recall, f1

In [43]:

#Extract gold + predicted lists
dev_gold = [entry["term_list"] for entry in dev_data["data"]]
ensemble_preds = [entry["term_list"] for entry in ensemble_output["data"]]

precision, recall, f1 = micro_f1_score(dev_gold, ensemble_preds)
type_precision, type_recall, type_f1 = type_f1_score(dev_gold, ensemble_preds)

print("\n=====================================================")
print("    IMPROVED BERT + SPACY + DICTIONARY MERGE")
print("=====================================================")

print("\nMicro-averaged Metrics:")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1 Score:  {f1:.4f}")

print("\nType-level Metrics:")
print(f"  Type Precision: {type_precision:.4f}")
print(f"  Type Recall:    {type_recall:.4f}")
print(f"  Type F1 Score:  {type_f1:.4f}")


    IMPROVED BERT + SPACY + DICTIONARY MERGE

Micro-averaged Metrics:
  Precision: 0.7277
  Recall:    0.7406
  F1 Score:  0.7341

Type-level Metrics:
  Type Precision: 0.6443
  Type Recall:    0.6736
  Type F1 Score:  0.6586


#### Error Analysis: False Positives and False Negatives

To better understand model behaviour, we compute:
- **False Positives (FP):** predicted terms that are not present in the gold list.
- **False Negatives (FN):** gold terms that the system failed to predict.

This helps identify:
- systematic missing multiword expressions,
- over-predicted generic terms,
- vocabulary mismatches,
- potential improvements in filtering or upgrading logic.


In [44]:

def get_fp_fn_from_listformat(gold_entries, pred_entries):
    """
    Compute false positives and false negatives by expanding each sentence-level
    term list into flat (doc, par, sent, term) rows and comparing them as sets.

    Returns:
        fp_df : DataFrame of false positives
        fn_df : DataFrame of false negatives
    """

    gold_rows = []
    pred_rows = []

    # ---- Expand GOLD ----
    for e in gold_entries:
        key = (e["document_id"], e["paragraph_id"], e["sentence_id"])
        for t in e.get("term_list", []):
            n = norm(t)
            if n:
                gold_rows.append((*key, n))

    # ---- Expand PRED ----
    for e in pred_entries:
        key = (e["document_id"], e["paragraph_id"], e["sentence_id"])
        for t in e.get("term_list", []):
            n = norm(t)
            if n:
                pred_rows.append((*key, n))

    # ---- Compute FP / FN ----
    gold_set = set(gold_rows)
    pred_set = set(pred_rows)

    fp = sorted(list(pred_set - gold_set))
    fn = sorted(list(gold_set - pred_set))

    fp_df = pd.DataFrame(fp, columns=["document_id", "paragraph_id", "sentence_id", "term"])
    fn_df = pd.DataFrame(fn, columns=["document_id", "paragraph_id", "sentence_id", "term"])

    return fp_df, fn_df



In [45]:
gold_entries = dev_data["data"]      # gold JSON
pred_entries = ensemble_output["data"]  # merged predictions JSON

fp_df, fn_df = get_fp_fn_from_listformat(gold_entries, pred_entries)

print("False Positives:", len(fp_df))
print("False Negatives:", len(fn_df))

display(fp_df.head(20))
display(fn_df.head(20))


False Positives: 125
False Negatives: 117


Unnamed: 0,document_id,paragraph_id,sentence_id,term
0,doc_agropoli_09,11,0,denominazione conferita
1,doc_auletta_01,9,0,isee
2,doc_auletta_13,36,1,segnalazione per disservizi
3,doc_battipaglia_02,6,0,ambiente e gestione dei rifiuti
4,doc_battipaglia_02,20,0,forme di gestione dei rifiuti
5,doc_battipaglia_13,20,0,manutenzione verde pubblico
6,doc_capaccio_06,10,1,oli conferiti
7,doc_capaccio_06,10,1,oli esausti
8,doc_capaccio_06,10,1,recupero
9,doc_capaccio_06,10,1,trasporto


Unnamed: 0,document_id,paragraph_id,sentence_id,term
0,doc_agropoli_13,1,14,conferire
1,doc_auletta_13,36,1,gestore dello spazzamento e lavaggio delle strade
2,doc_auletta_13,40,2,condizioni igieniche e di decoro
3,doc_battipaglia_02,6,0,gestione dei rifiuti
4,doc_battipaglia_02,20,0,emissione
5,doc_battipaglia_13,2,3,carta
6,doc_capaccio_06,7,1,raccolta degli oli esausti
7,doc_capaccio_06,10,1,conferiti
8,doc_capaccio_06,10,1,recupero degli oli esausti
9,doc_capaccio_10,3,3,sacchetto trasparente


#### Sentence-level error table

To inspect model behavior at the sentence level, we build a table where each row
corresponds to one sentence that has at least one error (FP or FN).

For each sentence we store:
- the number of gold and predicted terms,
- how many **missing** terms (FN) and **extra** terms (FP),
- the list of missing/extra terms (normalized),
- the original sentence text.


In [46]:
 
def build_sentence_error_table(dev_data, pred_data):
    """
    Build a table where each row corresponds to a sentence that has errors.
    Coherent with:
      - norm()
      - collect_sentence_errors()
      - merge structure
      - (doc, par, sent) key
    """

    # ---- Build gold map ----
    gold_map = {}
    for e in dev_data["data"]:
        key = (e["document_id"], e["paragraph_id"], e["sentence_id"])
        gold_map[key] = set(norm(t) for t in e.get("term_list", []))

    # ---- Build pred map ----
    pred_map = {}
    for e in pred_data["data"]:
        key = (e["document_id"], e["paragraph_id"], e["sentence_id"])
        pred_map[key] = set(norm(t) for t in e.get("term_list", []))

    # ---- Build rows ----
    rows = []
    for key in gold_map:
        doc, par, sent = key
        gold_set = gold_map[key]
        pred_set = pred_map.get(key, set())

        missing = gold_set - pred_set
        extra   = pred_set - gold_set

        if missing or extra:
            rows.append({
                "document_id": doc,
                "paragraph_id": par,
                "sentence_id": sent,
                "n_gold": len(gold_set),
                "n_pred": len(pred_set),
                "n_missing": len(missing),
                "missing_terms": sorted(missing),
                "n_extra": len(extra),
                "extra_terms": sorted(extra),
                "sentence_text": next(
                    r["sentence_text"] 
                    for r in dev_data["data"] 
                    if (r["document_id"], r["paragraph_id"], r["sentence_id"]) == key
                )
            })

    df = pd.DataFrame(rows)
    return df
errors_df = build_sentence_error_table(dev_data, ensemble_output)
print(f"Error table shape: {errors_df.shape}")
errors_df.head(20)


Error table shape: (129, 10)


Unnamed: 0,document_id,paragraph_id,sentence_id,n_gold,n_pred,n_missing,missing_terms,n_extra,extra_terms,sentence_text
0,doc_caserta_06,3,1,2,2,2,[disciplina dei centri di raccolta dei rifiuti...,2,[centri di raccolta dei rifiuti urbani raccolt...,Il presente disciplinare per la gestione dei c...
1,doc_poggiomarino_01,6,1,1,1,1,[raccolta],1,[servizio supplementare di raccolta],"È un Servizio Supplementare di raccolta, rivol..."
2,doc_nola_05,2,2,2,3,0,[],1,[servizio di raccolta dei rifiuti derivanti da...,ll servizio di raccolta dei rifiuti derivanti ...
3,doc_poggiomarino_12,17,4,0,1,0,[],1,[carta],- giornali; - la carta per alimenti;
4,doc_capaccio_10,3,3,2,1,1,[sacchetto trasparente],0,[],MULTIMATERIALE; Sacchetto blu trasparente; Lun...
5,doc_salerno_05,11,2,1,2,0,[],1,[tessuto],"Indumenti usati, accessori, lenzuola, coperte,..."
6,doc_caserta_06,6,2,1,1,1,[gestione del centro di raccolta],1,[centro di raccolta],- alla vigilanza nel rispetto delle norme del ...
7,doc_capaccio_15,5,12,1,2,1,"[pile portatili, batterie e accumulatori al pi...",2,"[pile portatili, utenze domestiche]","PILE PORTATILI, BATTERIE E ACCUMULATORI AL PIO..."
8,doc_nola_05,2,0,2,0,2,"[frazione verde, ritiro]",0,[],RITIRO FRAZIONE VERDE
9,doc_salerno_05,7,6,2,5,2,"[conferiti, plastica, acciaio e alluminio]",5,"[acciaio, alluminio, cartoni per liquidi, plas...",I cartoni per liquidi vanno conferiti con plas...
