# **Term Extraction Ensemble (BERT + spaCy + Dictionary)**

This notebook implements a complete post-processing pipeline for the ATE-IT Subtask A (Automatic Term Extraction).  
It takes the raw predictions from a fine-tuned **BERT token classification model** and combines them with **spaCy noun-chunk spans** and a **gold-derived domain vocabulary** to produce a higher-quality list of domain terms for each sentence.

### Pipeline Summary
1. **Load BERT and spaCy predictions**  
   - Import model outputs in ATE-IT JSON format.  
   - Map predictions to sentence identifiers for easy lookup.

2. **Normalize and clean BERT terms**  
   - Remove punctuation, unify quotes, lowercase, collapse whitespace.  
   - Filter out spurious or generic one-word candidates.

3. **Build a domain vocabulary from the gold training set**  
   - Normalize gold terms.  
   - Track frequencies to identify strong (repeated) vs. weak (rare) terms.

4. **Merge BERT + spaCy + Dictionary knowledge**  
   - **Upgrade** short BERT terms to longer spaCy spans when they form a valid multi-word expression present in the gold vocabulary.  
   - **Add** additional spaCy multi-word spans only if they appear in the gold vocabulary.  
   - **Filter out** generic, meaningless, or uninformative unigrams.  
   - **Normalize and deduplicate** final terms.

5. **Generate final ensemble predictions**  
   - For each sentence, produce an improved term list combining all signals.  
   - Output saved in ATE-IT JSON format.

### Goal
The notebook improves recall and precision of automatic term extraction by combining:
- contextual predictions (BERT),
- linguistic structure (spaCy),
- and domain consistency (gold vocabulary).

This hybrid ensemble typically outperforms each component alone.


In [98]:
import json, os
def load_json(path: str):
    with open(path, "r", encoding="utf-8") as f:
        return json.load(f)
    

def save_json(obj, path: str):
    os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)
    print(f"✓ Saved cleaned predictions to {path}")

In [99]:
DATA_DIR = "../data/"
PRED_DIR = "../src/predictions/"

TRAIN_FILE = os.path.join(DATA_DIR, "subtask_a_train.json")
DEV_FILE = os.path.join(DATA_DIR, "subtask_a_dev.json")

# BERT_DEV_PRED_FILE = os.path.join(
#     PRED_DIR, "subtask_a_dev_bert_token_classification_preds_clean.json"
# )
BERT_DEV_PRED_FILE = os.path.join(
    PRED_DIR, "subtask_a_dev_bert_preds_2e-5_changed_cleaned.json"
)
SPACY_DEV_PRED_FILE = os.path.join(
    PRED_DIR, "subtask_a_dev_spacy_trained_preds.json"
)

ENSEMBLE_OUT_FILE = os.path.join(
    PRED_DIR, "subtask_a_dev_ensemble_bert_2e-5_changed_spacy_dictfilter_metrics_changed.json"
)

os.makedirs(PRED_DIR, exist_ok=True)

In [100]:
print(BERT_DEV_PRED_FILE)
print(SPACY_DEV_PRED_FILE)

../src/predictions/subtask_a_dev_bert_preds_2e-5_changed_cleaned.json
../src/predictions/subtask_a_dev_spacy_trained_preds.json


In [101]:
import re
import unicodedata

def norm(t: str) -> str:
    if not t:
        return ""
    t = t.lower()
    t = unicodedata.normalize("NFKC", t)
    t = t.replace("’", "'").replace("`", "'")
    t = t.replace("“", '"').replace("”", '"')
    t = " ".join(t.split())
    # strip punteggiatura ai bordi
    t = t.strip(".,;:-'\"()[]{}")
    return t

In [102]:
# ==============================
# Train vocabulary from gold terms
# ==============================

def build_train_vocab(train_data: dict) -> set:
    vocab = set()
    for entry in train_data["data"]:
        for term in entry.get("term_list", []):
            n = norm(term)
            if n:
                vocab.add(n)
    return vocab


def build_term_map(pred_json: dict) -> dict:
    """
    Build a mapping:
        (document_id, paragraph_id, sentence_id) -> list of predicted terms
    from a prediction JSON in the ATE-IT format.
    """
    m = {}
    for e in pred_json["data"]:
        key = (e["document_id"], e["paragraph_id"], e["sentence_id"])
        m[key] = e.get("term_list", []) or []
    return m

In [103]:
from collections import Counter

def build_train_vocab_with_freq(train_data):
    freq = Counter()
    for e in train_data["data"]:
        for term in e.get("term_list", []):
            norm = norm(term)
            if norm:
                freq[norm] += 1
    strong = {t for t, c in freq.items() if c >= 3}
    weak   = {t for t, c in freq.items() if c == 1}
    return freq, strong, weak


For each sentence:

keeps all BERT terms as baseline,

adds spaCy terms only if their normalized form appears in the train vocabulary (and they’re multi-word and not duplicates).

In [104]:
def norm(t: str) -> str:
    if not t:
        return ""
    t = t.lower().strip()
    t = " ".join(t.split())
    t = t.replace("’", "'")
    t = t.strip(".,;:-'\"()[]{}")
    return t

In [105]:
GENERIC_HEADS = {
    "rifiuti", "materiali", "utenti", "plastica", "carta",
    "residui", "tariffe", "gestore", "servizio", "modalità",
    "conferimento", "costi", "parte", "quota", "impianto"
}
def looks_like_acronym(n: str) -> bool:
    # es: "tmb", "raee", "r.a.e.e."
    n_clean = n.replace(".", "")
    return (len(n_clean) >= 2 and len(n_clean) <= 6 and n_clean.isalpha())

def filter_generic_unigrams(terms, train_vocab_norm):
    filtered = []
    for t in terms:
        n = norm(t)
        tokens = n.split()
        if len(tokens) == 1:
            if n in GENERIC_HEADS and n not in train_vocab_norm and not looks_like_acronym(n):
                continue
            if n in GENERIC_HEADS and n not in train_vocab_norm:
                # scarta "quota", "parte", ecc. se non compaiono mai come termini gold
                continue
        filtered.append(t)
    return filtered

In [106]:
GENERIC_BAD = {
    "parte", "gestione", "città", "territorio", "comune",
    "ore", "no", "si", "anno", "mese", "giorno"
} 
def contains_as_subspan(longer: str, shorter: str) -> bool:
    long_tokens = longer.split()
    short_tokens = shorter.split()
    L, S = len(long_tokens), len(short_tokens)
    if S > L:
        return False
    for i in range(L - S + 1):
        if long_tokens[i:i+S] == short_tokens:
            return True
    return False


def upgrade_with_longer_spacy(bert_terms, spacy_terms, train_vocab_norm):
    """
    Upgrade BERT terms to longer spaCy spans ONLY WHEN BENEFICIAL.
    """
    final = []
    seen = set()
    
    spacy_norm_map = {norm(t): t for t in spacy_terms or []}

    for b in bert_terms or []:
        b_norm = norm(b)
        if not b_norm or b_norm in GENERIC_BAD:
            continue

        best = None

        # search longest valid spaCy span containing the BERT term
        for s_norm, s in spacy_norm_map.items():
            if len(s_norm.split()) < 2:
                continue
            if s_norm not in train_vocab_norm:
                continue
            if contains_as_subspan(s_norm, b_norm):
                if best is None or len(s_norm.split()) > len(norm(best).split()):
                    best = s


        chosen = best if best else b
        c_norm = norm(chosen)

        if c_norm not in seen and c_norm not in GENERIC_BAD:
            final.append(chosen)
            seen.add(c_norm)

    return final


In [107]:
def merge_bert_spacy_with_dict(bert_terms, spacy_terms, train_vocab_norm):
    """
    BEST ensemble so far:
    1. upgrade BERT with spaCy
    2. add dictionary-filtered spaCy spans
    3. skip generic or meaningless words
    """
    upgraded = upgrade_with_longer_spacy(
        bert_terms=bert_terms,
        spacy_terms=spacy_terms,
        train_vocab_norm=train_vocab_norm,
    )

    final = upgraded[:]
    seen = {norm(t) for t in upgraded}

    for s in spacy_terms or []:
        s_norm = norm(s)

        if len(s.split()) < 2:
            continue
        if s_norm not in train_vocab_norm:
            continue
        if s_norm in seen:
            continue
        if s_norm in GENERIC_BAD:
            continue

        final.append(s)
        seen.add(s_norm)

    return final


In [108]:
def merge_sentence(bert_terms, spacy_terms, train_vocab_norm):
    merged = merge_bert_spacy_with_dict(
        bert_terms=bert_terms,
        spacy_terms=spacy_terms,
        train_vocab_norm=train_vocab_norm
    )
    merged = filter_generic_unigrams(merged, train_vocab_norm)

    # dedupe and normalize
    seen = set()
    final = []
    for t in merged:
        n = norm(t)
        if n not in seen:
            final.append(n)
            seen.add(n)
    return final


In [109]:
                   
def micro_f1_score(gold_standard, system_output):
  """
  Evaluates a term extraction system's performance using Precision, Recall,
  and F1 score based on individual term matching (micro-average).

  Args:
    gold_standard: A list of lists, where each inner list contains the
        gold standard terms for an item.
    system_output: A list of lists, where each inner list contains the
                   terms extracted by the system for the corresponding item.

  Returns:
    A tuple containing the Precision, Recall, and F1 score.
  """
  total_true_positives = 0
  total_false_positives = 0
  total_false_negatives = 0

  # Iterate through each item's gold standard and system output terms
  for gold, system in zip(gold_standard, system_output):
    # Convert to sets for efficient comparison
    gold_set = set(gold)
    system_set = set(system)

    # Calculate True Positives, False Positives, and False Negatives for the current item
    true_positives = len(gold_set.intersection(system_set))
    false_positives = len(system_set - gold_set)
    false_negatives = len(gold_set - system_set)

    # Accumulate totals across all items
    total_true_positives += true_positives
    total_false_positives += false_positives
    total_false_negatives += false_negatives

  # Calculate Precision, Recall, and F1 score (micro-average)
  precision = total_true_positives / (total_true_positives + total_false_positives) if (total_true_positives + total_false_positives) > 0 else 0
  recall = total_true_positives / (total_true_positives + total_false_negatives) if (total_true_positives + total_false_negatives) > 0 else 0
  f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

  return precision, recall, f1

In [110]:
def type_f1_score(gold_standard, system_output):
  """
  Evaluates a term extraction system's performance using Type Precision,
  Type Recall, and Type F1 score based on the set of unique terms extracted
  at least once across the entire dataset.

  Args:
    gold_standard: A list of lists, where each inner list contains the
                   gold standard terms for an item.
    system_output: A list of lists, where each inner list contains the
                   terms extracted by the system for the corresponding item.

  Returns:
    A tuple containing the Type Precision, Type Recall, and Type F1 score.
  """

  # Get the set of all unique gold standard terms across the dataset
  all_gold_terms = set()
  for item_terms in gold_standard:
    all_gold_terms.update(item_terms)

  # Get the set of all unique system extracted terms across the dataset
  all_system_terms = set()
  for item_terms in system_output:
    all_system_terms.update(item_terms)

  # Calculate True Positives (terms present in both sets)
  type_true_positives = len(all_gold_terms.intersection(all_system_terms))

  # Calculate False Positives (terms in system output but not in gold standard)
  type_false_positives = len(all_system_terms - all_gold_terms)

  # Calculate False Negatives (terms in gold standard but not in system output)
  type_false_negatives = len(all_gold_terms - all_system_terms)

  # Calculate Type Precision, Type Recall, and Type F1 score
  type_precision = type_true_positives / (type_true_positives + type_false_positives) if (type_true_positives + type_false_positives) > 0 else 0
  type_recall = type_true_positives / (type_true_positives + type_false_negatives) if (type_true_positives + type_false_negatives) > 0 else 0
  type_f1 = 2 * (type_precision * type_recall) / (type_precision + type_recall) if (type_precision + type_recall) > 0 else 0

  return type_precision, type_recall, type_f1

###   BUILD BERT + SPACY ENSEMBLE USING merge_sentence()

In [None]:
from tqdm import tqdm
import json

# ---- Load train data and build vocabulary ----
with open(TRAIN_FILE, "r", encoding="utf-8") as f:
    train_data = json.load(f)

train_vocab_norm = build_train_vocab(train_data)
print(f"# unique normalized terms from train gold: {len(train_vocab_norm)}")

# ---- Load dev gold (for evaluation) ----
with open(DEV_FILE, "r", encoding="utf-8") as f:
    dev_data = json.load(f)

# ---- Load BERT and spaCy predictions ----
with open(BERT_DEV_PRED_FILE, "r", encoding="utf-8") as f:
    bert_pred = json.load(f)

with open(SPACY_DEV_PRED_FILE, "r", encoding="utf-8") as f:
    spacy_pred = json.load(f)

# Convert JSON predictions → dict[(doc,par,sent)] → [terms...]
bert_map = build_term_map(bert_pred)
spacy_map = build_term_map(spacy_pred)

# ---- Build ensemble predictions using merge_sentence ----
ensemble_output = {"data": []}

print("Building improved BERT+spaCy ensemble ...")

for idx, row in enumerate(tqdm(dev_data["data"])):

    key = (row["document_id"], row["paragraph_id"], row["sentence_id"])

    bert_terms = bert_map.get(key, []) or []
    spacy_terms = spacy_map.get(key, []) or []

    #  NEW MERGE FUNCTION 
    merged_terms = merge_sentence(
        bert_terms=bert_terms,
        spacy_terms=spacy_terms,
        train_vocab_norm=train_vocab_norm
    )

    # Debug on first 3
    if idx < 3:
        print("\n---------------------------------------")
        print("Sentence", idx)
        print("TEXT:", row["sentence_text"])
        print("  BERT  :", bert_terms)
        print("  SPACY :", spacy_terms)
        print("  MERGED:", merged_terms)

    # Save
    ensemble_output["data"].append({
        "document_id": row["document_id"],
        "paragraph_id": row["paragraph_id"],
        "sentence_id": row["sentence_id"],
        "term_list": merged_terms,
    })





# unique normalized terms from train gold: 710
Building improved BERT+spaCy ensemble ...


100%|██████████| 577/577 [00:00<00:00, 108219.53it/s]


---------------------------------------
Sentence 0
TEXT: Non Domestica; CAMPEGGI, DISTRIBUTORI CARBURANTI, PARCHEGGI; 1,22; 4,73 
  BERT  : []
  SPACY : []
  MERGED: []

---------------------------------------
Sentence 1
TEXT: Il presente disciplinare per la gestione dei centri di raccolta comunali è stato redatto ai sensi e per effetto del DM 13/05/2009, pubblicato sulla G.U. n. 165 del 18/07/2009, con il quale sono state apportate le modifiche sostanziali al DM 08/04/2008, Disciplina dei centri di raccolta dei rifiuti urbani raccolti in modo differenziato, come previsto dall'art. 183, comma 7, lettera cc) del Dlgs 3 aprile 2006, n. 152, e ss.mm.ii.
  BERT  : ['disciplinare per la gestione dei centri di raccolta comunali', 'centri di raccolta dei rifiuti urbani raccolti in']
  SPACY : ['gestione dei centri di raccolta comunali', 'centri di raccolta dei rifiuti urbani raccolti']
  MERGED: ['disciplinare per la gestione dei centri di raccolta comunali', 'centri di raccolta dei rifiuti 




#### Save predictions

In [112]:
# ---- Save final merged predictions ----
with open(ENSEMBLE_OUT_FILE, "w", encoding="utf-8") as f:
    json.dump(ensemble_output, f, ensure_ascii=False, indent=2)

print(f"\nEnsemble predictions saved to: {ENSEMBLE_OUT_FILE}")



Ensemble predictions saved to: ../src/predictions/subtask_a_dev_ensemble_bert_2e-5_changed_spacy_dictfilter_metrics_changed.json


In [113]:

#Extract gold + predicted lists
dev_gold = [entry["term_list"] for entry in dev_data["data"]]
ensemble_preds = [entry["term_list"] for entry in ensemble_output["data"]]

precision, recall, f1 = micro_f1_score(dev_gold, ensemble_preds)
type_precision, type_recall, type_f1 = type_f1_score(dev_gold, ensemble_preds)

print("\n=====================================================")
print("    IMPROVED BERT + SPACY + DICTIONARY MERGE")
print("=====================================================")

print("\nMicro-averaged Metrics:")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1 Score:  {f1:.4f}")

print("\nType-level Metrics:")
print(f"  Type Precision: {type_precision:.4f}")
print(f"  Type Recall:    {type_recall:.4f}")
print(f"  Type F1 Score:  {type_f1:.4f}")


    IMPROVED BERT + SPACY + DICTIONARY MERGE

Micro-averaged Metrics:
  Precision: 0.7687
  Recall:    0.7295
  F1 Score:  0.7486

Type-level Metrics:
  Type Precision: 0.7277
  Type Recall:    0.6736
  Type F1 Score:  0.6996


In [114]:
import pandas as pd
def get_fp_fn_from_listformat(gold_entries, pred_entries):
    """
    gold_entries: list of rows from dev_data["data"]
    pred_entries: list of rows from ensemble_output["data"]
    
    Each entry has:
        - document_id
        - paragraph_id
        - sentence_id
        - term_list (list of terms)
    
    Returns DataFrames:
        fp_df (false positives)
        fn_df (false negatives)
    """

    gold_rows = []
    pred_rows = []

    # --- Expand GOLD ---
    for e in gold_entries:
        doc = e["document_id"]
        par = e["paragraph_id"]
        sid = e["sentence_id"]
        for t in e["term_list"]:
            t_norm = norm(t)
            if t_norm:
                gold_rows.append((doc, par, sid, t_norm))

    # --- Expand PRED ---
    for e in pred_entries:
        doc = e["document_id"]
        par = e["paragraph_id"]
        sid = e["sentence_id"]
        for t in e["term_list"]:
            t_norm = norm(t)
            if t_norm:
                pred_rows.append((doc, par, sid, t_norm))

    gold_set = set(gold_rows)
    pred_set = set(pred_rows)

    fp = pred_set - gold_set
    fn = gold_set - pred_set

    fp_df = pd.DataFrame(list(fp),
                         columns=["document_id", "paragraph_id", "sentence_id", "term"])
    fn_df = pd.DataFrame(list(fn),
                         columns=["document_id", "paragraph_id", "sentence_id", "term"])

    return fp_df, fn_df


In [115]:
gold_entries = dev_data["data"]      # gold JSON
pred_entries = ensemble_output["data"]  # merged predictions JSON

fp_df, fn_df = get_fp_fn_from_listformat(gold_entries, pred_entries)

print("False Positives:", len(fp_df))
print("False Negatives:", len(fn_df))

display(fp_df.head(20))
display(fn_df.head(20))


False Positives: 99
False Negatives: 122


Unnamed: 0,document_id,paragraph_id,sentence_id,term
0,doc_sorrento_20,1,3,plastica mono
1,doc_santegidiodelmontealbino_03,15,2,sacco azzurro
2,doc_sorrento_20,1,3,raccolta differenziata
3,doc_salerno_06,27,1,svuotamento
4,doc_caserta_06,10,4,frazioni di rifiuti
5,doc_prataprincipatodiultra_02,2,0,porta a porta spinto
6,doc_nocerainferiore_06,2,1,sistema di raccolta differenziata dei rifiuti ...
7,doc_sorrento_10,28,2,parte fissa
8,doc_caserta_02,68,8,pannocarta
9,doc_poggiomarino_12,23,2,centro di raccolta differenziata


Unnamed: 0,document_id,paragraph_id,sentence_id,term
0,doc_sorrento_10,28,2,coefficienti per la determinazione della parte...
1,doc_capaccio_10,6,6,sacchetto trasparente
2,doc_salerno_03,2,27,sacchetti
3,doc_nocerainferiore_06,10,1,depositare i rifiuti
4,doc_sorrento_10,56,0,costi variabili
5,doc_capaccio_21,21,1,imballaggi in cartone
6,doc_sorrento_22,2,0,materiali ferrosi
7,doc_sorrento_15,2,0,materiali in plastica
8,doc_poggiomarino_12,17,65,r1
9,doc_nocerainferiore_06,4,0,"carta, cartone, cartoncino"


#### SENTENCE-LEVEL ERROR ANALYSIS

In [116]:
from collections import defaultdict
from typing import List
def collect_sentence_errors(pred_json, gold_json):

    """
    Build structures:
       errors[(doc,par,sent)] = {
           "gold": [...],
           "pred": [...],
           "fp": [...],
           "fn": [...],
       }
    and a list sorted by the number of errors.
    """
    pred_map = {}
    for e in pred_json["data"]:
        key = (e["document_id"], e["paragraph_id"], e["sentence_id"])
        pred_map[key] = set(norm(t) for t in e.get("term_list", []))

    gold_map = {}
    for e in gold_json["data"]:
        key = (e["document_id"], e["paragraph_id"], e["sentence_id"])
        gold_map[key] = set(norm(t) for t in e.get("term_list", []))

    errors = {}
    for key in gold_map:
        gold = gold_map[key]
        pred = pred_map.get(key, set())

        fp = sorted(pred - gold)
        fn = sorted(gold - pred)

        errors[key] = {
            "gold": gold,
            "pred": pred,
            "fp": fp,
            "fn": fn,
            "err_count": len(fp) + len(fn),
        }

    # sort keys by number of errors descending
    sorted_keys = sorted(errors.keys(), key=lambda k: errors[k]["err_count"], reverse=True)
    return errors, sorted_keys


In [117]:
def print_top_error_sentences(errors, sorted_keys, dev_data, top_n=10):
    """
    print the N worst sentences with text, gold, pred, FP, FN.
    """
    # Build sentence map
    sent_map = {}
    for r in dev_data["data"]:
        key = (r["document_id"], r["paragraph_id"], r["sentence_id"])
        sent_map[key] = r["sentence_text"]

    print(f"\n=== Top {top_n} error sentences ===")
    for i, key in enumerate(sorted_keys[:top_n]):
        doc, par, sent = key
        e = errors[key]
        txt = sent_map[key]

        print("\n------------------------------------------------------------")
        print(f"[{i+1}] Doc: {doc}  Par: {par}  Sent: {sent}")
        print("TEXT :", txt)
        print("GOLD :", sorted(e['gold']))
        print("PRED :", sorted(e['pred']))
        print("FP   :", e['fp'])
        print("FN   :", e['fn'])


In [118]:

from typing import List
ITALIAN_BAD_ENDINGS = {
    "di", "dei", "degli", "del", "della", "dello", "delle",
    "e", "ed",
    "a", "ai", "agli", "al", "alla", "alle", "allo",
    "da", "dal", "dai", "dagli", "dalla", "dalle",
    "con", "per", "su", "tra", "fra"
}
def looks_truncated(term: str) -> bool:
    """
    Heuristica: termini che finiscono con una stopword funzionale
    (es. 'gestione dei', 'batterie e') sono probabilmente tagliati.
    """
    tokens = term.split()
    if len(tokens) < 2:
        return False  # una sola parola: può essere un termine valido (es. 'multimateriale')
    last = tokens[-1]
    return last in ITALIAN_BAD_ENDINGS
class Span:
    def __init__(self, term: str, start: int, end: int):
        self.term = term
        self.start = start
        self.end = end

    def length(self) -> int:
        return self.end - self.start

    def __repr__(self):
        return f"Span(term={self.term!r}, start={self.start}, end={self.end})"

def find_spans(sentence: str, term: str) -> List[Span]:
    """
    Trova tutte le occorrenze (span carattere) di 'term' in 'sentence' (case-insensitive).
    """
    spans = []
    sent_l = sentence.lower()
    t = term.lower()
    start = 0
    while True:
        idx = sent_l.find(t, start)
        if idx == -1:
            break
        spans.append(Span(term, idx, idx + len(t)))
        start = idx + 1
    return spans


In [119]:
def explain_removal_path(term, sentence_text, bert_terms_raw, spacy_terms_raw, train_vocab_norm):
    """
    Show whether a term was removed because:
      - truncated
      - nested
      - generic unigram filter
      - not in train vocabulary
      - overridden by spaCy upgrade
      - dedup
    """
    n = norm(term)

    print("\n=== DEBUG TERM:", term, "→", n, "===")

    # Step 1: raw BERT terms
    print("1) Present in raw BERT?", term in bert_terms_raw)

    # Step 2: truncation check
    if looks_truncated(n):
        print("⚠️  Removed because it looks truncated (ending stopword).")
    
    # Step 3: nested removal check
    spans = find_spans(sentence_text, n)
    print("Span(s):", spans)
    if len(spans) > 1:
        print("Potential nested conflict → check longest-span rule.")

    # Step 4: generic unigram?
    if len(n.split()) == 1:
        if n in GENERIC_HEADS:
            print("⚠️  Likely removed: generic unigram not in training vocab.")
        if n in GENERIC_BAD:
            print("⚠️  Likely removed: generic BAD term (stopword-like).")

    # Step 5: train vocab
    in_vocab = n in train_vocab_norm
    print("In train vocabulary?", in_vocab)

    # Step 6: spaCy upgrade
    up = upgrade_with_longer_spacy(bert_terms_raw, spacy_terms_raw, train_vocab_norm)
    if n not in [norm(x) for x in up]:
        print("❗ Not present after spaCy upgrade → may have been replaced by longer span.")

    print("Done.")


In [120]:
errors, sorted_keys = collect_sentence_errors(ensemble_output, dev_data)

# Print 10 worst sentences
print_top_error_sentences(errors, sorted_keys, dev_data, top_n=15)

# Debug a specific term in a bad sentence
bad_key = sorted_keys[0]  # worst sentence
doc, par, sent = bad_key
row = next(r for r in dev_data["data"] if (r["document_id"],r["paragraph_id"],r["sentence_id"]) == bad_key)

bert_raw = bert_map.get(bad_key, [])
spacy_raw = spacy_map.get(bad_key, [])
sentence_text = row["sentence_text"]

explain_removal_path("gestione dei rifiuti urbani", sentence_text, bert_raw, spacy_raw, train_vocab_norm)


=== Top 15 error sentences ===

------------------------------------------------------------
[1] Doc: doc_salerno_05  Par: 7  Sent: 6
TEXT : I cartoni per liquidi vanno conferiti con plastica, acciaio e alluminio.
GOLD : ['conferiti', 'plastica, acciaio e alluminio']
PRED : ['acciaio', 'alluminio', 'cartoni', 'plastica', 'vanno conferiti']
FP   : ['acciaio', 'alluminio', 'cartoni', 'plastica', 'vanno conferiti']
FN   : ['conferiti', 'plastica, acciaio e alluminio']

------------------------------------------------------------
[2] Doc: doc_poggiomarino_12  Par: 17  Sent: 65
TEXT : R1 frigoriferi e sistemi per il condizionamento – R2 lavatrici, lavastoviglie, cucine, scaldabagni – R3 Tv e Monitor – R4 Piccoli elettrodomestici, hardware da Information Tecnology, elettronica di consumo, apparecchi illuminanti – R5 Sorgenti luminose.
GOLD : ['frigoriferi e sistemi per il condizionamento', 'lavatrici, lavastoviglie, cucine, scaldabagni', 'r1', 'r2', 'r3', 'r4', 'r5']
PRED : []
FP   : []
FN 