# **Term Extraction Ensemble (BERT + spaCy + Dictionary)**

This notebook implements a complete post-processing pipeline for the ATE-IT Subtask A (Automatic Term Extraction).  
It takes the raw predictions from a fine-tuned **BERT token classification model** and combines them with **spaCy noun-chunk spans** and a **gold-derived domain vocabulary** to produce a higher-quality list of domain terms for each sentence.

### Pipeline Summary
1. **Load BERT and spaCy predictions**  
   - Import model outputs in ATE-IT JSON format.  
   - Map predictions to sentence identifiers for easy lookup.

2. **Normalize and clean BERT terms**  
   - Remove punctuation, unify quotes, lowercase, collapse whitespace.  
   - Filter out spurious or generic one-word candidates.

3. **Build a domain vocabulary from the gold training set**  
   - Normalize gold terms.  
   - Track frequencies to identify strong (repeated) vs. weak (rare) terms.

4. **Merge BERT + spaCy + Dictionary knowledge**  
   - **Upgrade** short BERT terms to longer spaCy spans when they form a valid multi-word expression present in the gold vocabulary.  
   - **Add** additional spaCy multi-word spans only if they appear in the gold vocabulary.  
   - **Filter out** generic, meaningless, or uninformative unigrams.  
   - **Normalize and deduplicate** final terms.

5. **Generate final ensemble predictions**  
   - For each sentence, produce an improved term list combining all signals.  
   - Output saved in ATE-IT JSON format.

### Goal
The notebook improves recall and precision of automatic term extraction by combining:
- contextual predictions (BERT),
- linguistic structure (spaCy),
- and domain consistency (gold vocabulary).

This hybrid ensemble typically outperforms each component alone.


#### import and file paths
We load:
- the **train** file to extract the gold vocabulary,
- the **dev** file (gold) for evaluation and text,
- BERT and spaCy predictions on the dev set.

In [12]:
import json, os
def load_json(path: str):
    with open(path, "r", encoding="utf-8") as f:
        return json.load(f)
    

def save_json(obj, path: str):
    os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)
    print(f"✓ Saved cleaned predictions to {path}")

In [None]:
DATA_DIR = "../../data/"
PRED_DIR = "../../src/final_train_dev_training/predictions/"

TRAIN_FILE = os.path.join(DATA_DIR, "subtask_a_train.json")
DEV_FILE = os.path.join(DATA_DIR, "subtask_a_dev.json")
TEST_FILE  = os.path.join(DATA_DIR, "test.json")

BERT_TEST_PRED_FILE = os.path.join(
    PRED_DIR, "subtask_a_test_bert_preds_train_dev_cleaned.json"
)
SPACY_TEST_PRED_FILE = os.path.join( 
    PRED_DIR, "subtask_a_test_spacy_trained_preds_train_dev.json" 
)

ENSEMBLE_OUT_FILE = os.path.join(
    PRED_DIR, "subtask_a_test_ensemble_bert_spacy_dictfilter.json"
)

os.makedirs(PRED_DIR, exist_ok=True)

In [14]:
train_data = load_json(TRAIN_FILE)
dev_data   = load_json(DEV_FILE)

# unisci i due insiemi
full_train_dev = {
    "data": train_data["data"] + dev_data["data"]
}


### 1. Normalization and training vocabulary

We first define a canonical normalization function `norm()` and build:
- a **normalized vocabulary** of gold terms from the training set,
- a helper map from prediction JSONs to `(doc, par, sent) → term_list`.


In [15]:
import re
import unicodedata

def norm(t: str) -> str:
    if not t:
        return ""
    t = t.lower()
    t = unicodedata.normalize("NFKC", t)
    t = t.replace("’", "'").replace("`", "'")
    t = t.replace("“", '"').replace("”", '"')
    t = " ".join(t.split())
    # strip punteggiatura ai bordi
    t = t.strip(".,;:-'\"()[]{}")
    return t

## 2. Build a gold-derived vocabulary

We now construct a **normalized vocabulary** of domain terms from the gold training set.  
This vocabulary is later used to:
- validate candidate terms,
- decide which spaCy spans are trustworthy,
- avoid keeping generic words that never appear as gold terms.


In [16]:

def build_train_vocab(*datasets) -> set:
    vocab = set()
    for data in datasets:
        rows = data["data"] if isinstance(data, dict) and "data" in data else data
        for entry in rows:
            for term in entry.get("term_list", []):
                n = norm(term)
                if n:
                    vocab.add(n)
    return vocab



def build_term_map(pred_json: dict) -> dict:
    """
    Build a mapping:
        (document_id, paragraph_id, sentence_id) -> list of predicted terms
    from a prediction JSON in the ATE-IT format.
    """
    m = {}
    for e in pred_json["data"]:
        key = (e["document_id"], e["paragraph_id"], e["sentence_id"])
        m[key] = e.get("term_list", []) or []
    return m

#### 3. Helper: map predictions to sentence IDs

We define a helper function that converts a prediction JSON in ATE-IT format into a dictionary:
`(document_id, paragraph_id, sentence_id) -> list of predicted terms`.

This makes it easy to align BERT and spaCy predictions for the same sentence.


In [17]:
from collections import Counter

def build_train_vocab_with_freq(train_data):
    freq = Counter()
    for e in train_data["data"]:
        for term in e.get("term_list", []):
            n = norm(term)
            if n:
                freq[n] += 1
    strong = {t for t, c in freq.items() if c >= 3}
    weak   = {t for t, c in freq.items() if c == 1}
    return freq, strong, weak



### 3. Generic heads and acronym heuristics

We now define:
- a small list of **generic heads** (e.g. *rifiuti, materiali, servizio*),
- a heuristic to detect **acronyms** (e.g. *RAEE, R1, TARI*),
- a filter for **generic unigrams** that never appear as gold terms.

The goal is to:
- keep important single-word terms if they are in the gold vocabulary,
- discard only very generic heads that never occur as true domain terms.


In [18]:
GENERIC_HEADS = {
    "rifiuti", "materiali", "utenti", "plastica", "carta",
    "residui", "tariffe", "gestore", "servizio", "modalità",
    "conferimento", "costi", "parte", "quota", "impianto"
}
def looks_like_acronym(n: str) -> bool:
    # es: "tmb", "raee", "r.a.e.e."
    n_clean = n.replace(".", "")
    return (len(n_clean) >= 2 and len(n_clean) <= 6 and n_clean.isalpha())

def filter_generic_unigrams(terms, train_vocab_norm):
    filtered = []
    for t in terms:
        n = norm(t)
        tokens = n.split()
        if len(tokens) == 1:
            if n in GENERIC_HEADS and n not in train_vocab_norm and not looks_like_acronym(n):
                continue
            if n in GENERIC_HEADS and n not in train_vocab_norm:
                # scarta "quota", "parte", ecc. se non compaiono mai come termini gold
                continue
        filtered.append(t)
    return filtered

## 4. Multiword upgrade: BERT → spaCy spans

BERT sometimes predicts short fragments (e.g. *ferro*) where the gold term is a
longer span (e.g. *materiali ferrosi*).

We therefore:
1. Look for **spaCy multiword spans** that:
   - are present in the gold vocabulary,
   - contain the BERT term as a contiguous token subsequence.
2. If such a span exists, we **upgrade** the BERT term to the longer spaCy span.
3. We also maintain a small list of **GENERIC_BAD** terms that we never keep.


In [19]:
GENERIC_BAD = {
    "parte", "gestione", "città", "territorio", "comune",
    "ore", "no", "si", "anno", "mese", "giorno"
} 
def contains_as_subspan(longer: str, shorter: str) -> bool:
    long_tokens = longer.split()
    short_tokens = shorter.split()
    L, S = len(long_tokens), len(short_tokens)
    if S > L:
        return False
    for i in range(L - S + 1):
        if long_tokens[i:i+S] == short_tokens:
            return True
    return False


def upgrade_with_longer_spacy(bert_terms, spacy_terms, train_vocab_norm):
    """
    Upgrade BERT terms to longer spaCy spans ONLY WHEN BENEFICIAL.
    """
    final = []
    seen = set()
    
    spacy_norm_map = {norm(t): t for t in spacy_terms or []}

    for b in bert_terms or []:
        b_norm = norm(b)
        if not b_norm or b_norm in GENERIC_BAD:
            continue

        best = None

        # search longest valid spaCy span containing the BERT term
        for s_norm, s in spacy_norm_map.items():
            if len(s_norm.split()) < 2:
                continue
            if s_norm not in train_vocab_norm:
                continue
            if contains_as_subspan(s_norm, b_norm):
                if best is None or len(s_norm.split()) > len(norm(best).split()):
                    best = s


        chosen = best if best else b
        c_norm = norm(chosen)

        if c_norm not in seen and c_norm not in GENERIC_BAD:
            final.append(chosen)
            seen.add(c_norm)

    return final


### 5. BERT + spaCy + vocabulary merge
now define the main merge function that combines:
- cleaned **BERT terms**,
- **spaCy noun-chunk spans**,
- and the **gold-derived vocabulary**.

The strategy is:
1. First, **upgrade** BERT terms to longer spaCy spans when they match a gold term.
2. Then, **add extra spaCy multiword spans** that:
   - are in the gold vocabulary,
   - are not already covered,
   - are not clearly generic or meaningless.


In [20]:
def merge_bert_spacy_with_dict(bert_terms, spacy_terms, train_vocab_norm):
    """
    BEST ensemble so far:
    1. upgrade BERT with spaCy
    2. add dictionary-filtered spaCy spans
    3. skip generic or meaningless words
    """
    upgraded = upgrade_with_longer_spacy(
        bert_terms=bert_terms,
        spacy_terms=spacy_terms,
        train_vocab_norm=train_vocab_norm,
    )

    final = upgraded[:]
    seen = {norm(t) for t in upgraded}

    for s in spacy_terms or []:
        s_norm = norm(s)

        if len(s.split()) < 2:
            continue
        if s_norm not in train_vocab_norm:
            continue
        if s_norm in seen:
            continue
        if s_norm in GENERIC_BAD:
            continue

        final.append(s)
        seen.add(s_norm)

    return final


### 6. Per-sentence merge helper
 helper `merge_sentence()` that:
1. Applies the BERT+spaCy+vocabulary merge strategy.
2. Removes only **truly generic unigrams** (using the gold vocabulary as a whitelist).
3. Normalizes and deduplicates the final term list.

This function is called once per sentence in the dev/test set.


In [21]:
def merge_sentence(bert_terms, spacy_terms, train_vocab_norm):
    merged = merge_bert_spacy_with_dict(
        bert_terms=bert_terms,
        spacy_terms=spacy_terms,
        train_vocab_norm=train_vocab_norm
    )
    merged = filter_generic_unigrams(merged, train_vocab_norm)

    # dedupe and normalize
    seen = set()
    final = []
    for t in merged:
        n = norm(t)
        if n not in seen:
            final.append(n)
            seen.add(n)
    return final


### 7. BUILD BERT + SPACY ENSEMBLE USING merge_sentence()

In [22]:
from tqdm import tqdm  # <-- non "import tqdm"

# ---- Load TRAIN + DEV data and build vocabulary ----
print("Loading TRAIN + DEV data to build vocabulary...")
train_data = load_json(TRAIN_FILE)
dev_data   = load_json(DEV_FILE)

train_vocab_norm = build_train_vocab(train_data, dev_data)
print(f"# unique normalized terms from TRAIN+DEV gold: {len(train_vocab_norm)}")

# ---- Load TEST data (per scorrere frasi nell'ordine corretto) ----
print("Loading TEST data...")
test_data = load_json(TEST_FILE)
test_rows = test_data["data"] if isinstance(test_data, dict) and "data" in test_data else test_data
print(f"# test sentences: {len(test_rows)}")

# ---- Load BERT, SpaCy predictions ----
print("Loading BERT predictions (TEST)...")
bert_pred = load_json(BERT_TEST_PRED_FILE)
bert_map = build_term_map(bert_pred)

print("Loading SpaCy predictions (TEST)...")
spacy_pred = load_json(SPACY_TEST_PRED_FILE)
spacy_map = build_term_map(spacy_pred)

# ---- Build ensemble predictions using merge_sentence ----
ensemble_output = {"data": []}

print("\nBuilding BERT + SpaCy ensemble on TEST ...")

for idx, row in enumerate(tqdm(test_rows)):
    key = (row["document_id"], row["paragraph_id"], row["sentence_id"])

    bert_terms  = bert_map.get(key, []) or []
    spacy_terms = spacy_map.get(key, []) or []

    merged_terms = merge_sentence(
        bert_terms=bert_terms,
        spacy_terms=spacy_terms,
        train_vocab_norm=train_vocab_norm
    )

    # Debug sulle prime 3 frasi
    if idx < 3:
        print("\n---------------------------------------")
        print("Sentence", idx)
        print("TEXT:", row["sentence_text"])
        print("  BERT      :", bert_terms)
        print("  SPACY     :", spacy_terms)
        print("  MERGED OUT:", merged_terms)

    ensemble_output["data"].append({
        "document_id": row["document_id"],
        "paragraph_id": row["paragraph_id"],
        "sentence_id": row["sentence_id"],
        "term_list": merged_terms,
    })

# ---- Save final merged predictions (SUBMISSION) ----
save_json(ensemble_output, ENSEMBLE_OUT_FILE)
print(f"\nEnsemble predictions saved to: {ENSEMBLE_OUT_FILE}")


Loading TRAIN + DEV data to build vocabulary...
# unique normalized terms from TRAIN+DEV gold: 812
Loading TEST data...
# test sentences: 1142
Loading BERT predictions (TEST)...
Loading SpaCy predictions (TEST)...

Building BERT + SpaCy ensemble on TEST ...


100%|██████████| 1142/1142 [00:00<00:00, 137565.56it/s]


---------------------------------------
Sentence 0
TEXT: COMUNE DI AMATO
  BERT      : []
  SPACY     : []
  MERGED OUT: []

---------------------------------------
Sentence 1
TEXT: PROVINCIA DI CATANZARO
  BERT      : []
  SPACY     : []
  MERGED OUT: []

---------------------------------------
Sentence 2
TEXT: (UFFICIO DEL SINDACO)
  BERT      : []
  SPACY     : []
  MERGED OUT: []
✓ Saved cleaned predictions to ../../src/final_train_dev_training/predictions/subtask_a_test_ensemble_bert_spacy_dictfilter.json

Ensemble predictions saved to: ../../src/final_train_dev_training/predictions/subtask_a_test_ensemble_bert_spacy_dictfilter.json





#### Save predictions