# LLM-based Reranking of Candidate Terms for ATE-IT (Subtask A)

This notebook shows how to:

1. Load the ensemble predictions from **BERT + spaCy + dictionary filter**  
   (`subtask_a_dev_ensemble_bert_spacy_dictfilter.json`).
2. (Optionally) Load the original **dev sentences** (to give context to the LLM).
3. Call a **Gemini LLM** to **rerank and filter** candidate terms for each sentence.
4. Save the new predictions in the **same competition format** (JSON with `data` → `term_list`).

### Imports and basic setup

In [68]:
import os
import json
from pathlib import Path
from typing import List, Dict, Any, Tuple, Optional

from dotenv import load_dotenv
import google.generativeai as genai
from tqdm import tqdm

print("✓ Imports loaded")


✓ Imports loaded


In [69]:
# Current working dir = .../ATE-IT_SofiaMaule/src
CWD = Path.cwd()
REPO_ROOT = CWD.parent  # .../ATE-IT_SofiaMaule

PREDICTIONS_PATH = REPO_ROOT / "src" / "predictions" / "subtask_a_dev_ensemble_bert_spacy_dictfilter.json"
DEV_SENTENCES_PATH = REPO_ROOT / "data" / "subtask_a_dev.json"
OUTPUT_PATH = REPO_ROOT / "src" / "predictions" / "subtask_a_dev_reranked_llm_2.json"

print("Current working dir:", CWD)
print("Repo root          :", REPO_ROOT)
print("Predictions path   :", PREDICTIONS_PATH)
print("Dev sentences path :", DEV_SENTENCES_PATH)
print("Output path        :", OUTPUT_PATH)




Current working dir: c:\Users\super\Documents\UniPd\ATA\ATE-IT_SofiaMaule\src
Repo root          : c:\Users\super\Documents\UniPd\ATA\ATE-IT_SofiaMaule
Predictions path   : c:\Users\super\Documents\UniPd\ATA\ATE-IT_SofiaMaule\src\predictions\subtask_a_dev_ensemble_bert_spacy_dictfilter.json
Dev sentences path : c:\Users\super\Documents\UniPd\ATA\ATE-IT_SofiaMaule\data\subtask_a_dev.json
Output path        : c:\Users\super\Documents\UniPd\ATA\ATE-IT_SofiaMaule\src\predictions\subtask_a_dev_reranked_llm_2.json


### 3. Initialize GROQ  model
 with an API key stored in `.env` as `GROQI_API_KEY`.




In [70]:
from groq import Groq
from dotenv import load_dotenv
import os

# Load env
load_dotenv()

api_key = os.getenv("GROQ_API_KEY")
if not api_key:
    raise RuntimeError("GROQ_API_KEY not found in .env")

client = Groq(api_key=api_key)

print("✓ Groq client initialized")

# Choose a Groq-supported model
#model_name = "openai/gpt-oss-20b"
model_name = "qwen/qwen3-32b"


✓ Groq client initialized


In [71]:
# Load predictions

with open(PREDICTIONS_PATH, "r", encoding="utf-8") as f:
    ensemble_pred = json.load(f)

# Sanity check that the top-level key is "data"
assert "data" in ensemble_pred, "Unexpected format: 'data' key not found in predictions JSON"

print(f"Loaded {len(ensemble_pred['data'])} prediction entries")

#  Load dev sentences

if DEV_SENTENCES_PATH.exists():
    with open(DEV_SENTENCES_PATH, "r", encoding="utf-8") as f:
        dev_sentences = json.load(f)
    assert "data" in dev_sentences, "Unexpected format: 'data' key not found in dev sentences JSON"
    print(f"Loaded {len(dev_sentences['data'])} dev sentences")
else:
    dev_sentences = None
    print("Dev sentences file not found. Reranking will use only candidate terms, without sentence context.")


Loaded 577 prediction entries
Loaded 577 dev sentences


In [72]:
# Build index: (doc, par, sent) -> sentence_text

sentence_index: Dict[Tuple[str, int, int], str] = {}

if dev_sentences is not None:
    for entry in dev_sentences["data"]:
        key = (entry["document_id"], entry["paragraph_id"], entry["sentence_id"])
        sentence_index[key] = entry["sentence_text"]

    print(f"Indexed {len(sentence_index)} sentence texts")


Indexed 577 sentence texts


### Define the LLM prompt for reranking

We now define a **prompt template**. For each sentence, we will provide:

- The **sentence text** (if available).
- The list of **candidate terms** from BERT+spaCy.

We ask the LLM to:

1. Decide which candidates are **real domain-relevant terms** in context.
2. Assign a **score** in `[0, 1]` (e.g., 0.0–1.0) representing term quality.
3. Return a **strict JSON** object with this structure:

```json
{
  "reranked_terms": [
    {"term": "centri di raccolta", "score": 0.95, "keep": true},
    {"term": "disciplina", "score": 0.80, "keep": true},
    ...
  ]
}
Then we will:
- Keep only items where keep == true.
- Sort them by score descending.
- Use the sorted term strings as the new term_list for that sentence.

## 1. Domain-aware reranking prompt

We enrich the system prompt with **explicit domain knowledge** and tell the model that its scores will be combined with a rule-based domain scorer.


In [73]:
system_prompt_rerank = """
You are an automatic term extraction *reranking* agent for Italian municipal waste management texts.

You will receive:
- one sentence (in Italian), and
- a list of candidate terms extracted by a baseline system.

Your task:
- For each candidate term, decide if it is a good domain-relevant term in the context of the sentence.
- Assign a relevance score between 0.0 and 1.0 (higher = better).
- Decide whether to keep or discard each candidate.

A valid "term" in this task is:
- a single- or multi-word expression
- that refers to a concept in the municipal waste management domain
- typically nouns or noun phrases (sometimes adjectives or verbs as part of a phrase)
- examples: "tassa rifiuti", "tari", "isola ecologica comunale", "impianto di trattamento rifiuti urbani"

Non-terms are:
- generic function words (e.g., "e", "di", "per", "che")
- pure numbers or dates not part of a waste term
- person names, city names, street names (unless part of an official name of a waste service)

IMPORTANT DOMAIN-SPECIFIC RULES:
- DO NOT output generic single materials as terms (e.g., "plastica", "carta", "metalli", "metallo", "alluminio", "vetro"),
  unless they are part of a multi-word term (e.g., "raccolta plastica", "plastica, acciaio e alluminio").
- Prefer complete multi-word terms over shorter fragments.
  For example, prefer "modalità di conferimento" over "modalità" alone.
- Single-word terms are usually NOT valid unless they refer to well-defined waste concepts
  such as "TARI", "TARES", or "disciplinare" when clearly used as the waste regulation.
- DO NOT discard relevant acronyms commonly used in the waste domain, such as:
  "CCR", "RUP", "RAEE", "R.A.E.E.", "PAP". These acronyms should normally be kept as valid terms.

OUTPUT FORMAT (STRICT JSON):

You MUST output ONLY a JSON object with this exact structure:

{
  "reranked_terms": [
    {"term": "...", "score": 0.0, "keep": true or false},
    ...
  ]
}

Rules:
- Do not add new terms that are not in the candidate list.
- Do not modify the spelling of the candidates.
- If a candidate looks truncated or not a full concept, set "keep": false and give it a low score.
- If a candidate matches a valid domain concept, do NOT remove it just because it is short.
- If no candidate is good, you may return an empty list: "reranked_terms": [].
- The JSON must be valid and parseable by Python's json.loads().
"""


In [74]:

def build_llm_input(
    sentence_text: Optional[str],
    candidates: List[str]
) -> str:
    """
    Build the USER part of the prompt sent to Gemini, consistent with the 'System: ... / User: ...' pattern.
    """
    lines = []

    if sentence_text is not None:
        lines.append(f"Sentence (Italian): {sentence_text}")
    else:
        lines.append("Sentence: [NOT AVAILABLE]")

    lines.append("Candidate terms:")
    for t in candidates:
        lines.append(f"- {t}")

    lines.append(
        "\nNow produce ONLY the JSON object with the structure described in the system instructions."
    )

    return "\n".join(lines)


##  Rule-based domain scorer (hybrid with LLM)

We now define a small **domain vocabulary** and a function that adjusts the LLM scores:

- boost terms that match domain patterns (TARI, RAEE, centro di raccolta, conferire, ecc.)
- penalize generic or clearly bad terms
- detect truncated phrases


In [75]:
# Domain vocab + helpers for hybrid scoring

import math
ACRONYMS = {
    "ccr",
    "rup",
    "raee",
    "r.a.e.e.",
    "pap",
}

BAD_SINGLE_WORDS = {
    "plastica",
    "carta",
    "cartone",
    "metallo",
    "metalli",
    "alluminio",
    "vetro",
}
# You can refine / expand this over time
DOMAIN_STRONG_TERMS = {
    "rifiuti urbani",
    "rifiuti ingombranti",
    "rifiuti pericolosi",
    "raccolta differenziata",
    "raccolta porta a porta",
    "servizio di raccolta",
    "servizio di igiene urbana",
    "centro di raccolta",
    "centri di raccolta comunali",
    "isola ecologica",
    "ecocentro",
    "piattaforma ecologica",
    "impianto di trattamento rifiuti",
    "impianto di smaltimento",
    "tassa rifiuti",
    "tari",
    "disciplinare",
    "regolamento",
    "utenze domestiche",
    "utenze non domestiche",
    "modalità di conferimento",
    "modalità di raccolta",
    "conferimento",
    "conferire",
    "conferiti",
    "vanno conferiti",
}

# Substring-based signals (if term contains one of these, it's likely relevant)
DOMAIN_KEYWORD_SUBSTRINGS = [
    "rifiuti",
    "raccolta",
    "confer",
    "ecologic",
    "centro di raccolta",
    "impianto",
    "tariff",
    "tassa",
    "tari",
    "raee",
    "isola ecologica",
]

# Terms that are often too generic if used alone
GENERIC_WEAK_TERMS = {
    "rifiuti",
    "plastica",
    "carta",
    "vetro",
    "metalli",
    "alluminio",
    "legno",
}

FUNCTION_ENDINGS = {"di", "dei", "degli", "delle", "del", "e", "o", "ed", "al", "allo", "alla", "ai", "agli", "alle"}




In [76]:

def normalize_term_text(t: str) -> str:
    return " ".join(t.lower().strip().split())


def looks_truncated(term: str) -> bool:
    """
    Heuristic: terms that end with a function word or are extremely short / odd.
    """
    t = normalize_term_text(term)
    tokens = t.split()
    if len(tokens) == 0:
        return True
    if len(tokens) == 1 and tokens[0] in {"r", "oo"}:
        return True
    if tokens[-1] in FUNCTION_ENDINGS:
        return True
    return False


def contains_domain_substring(term: str) -> bool:
    t = normalize_term_text(term)
    return any(sub in t for sub in DOMAIN_KEYWORD_SUBSTRINGS)


# Hyperparameters for hybrid scoring
LLM_WEIGHT = 0.7       # how much we trust the LLM score
RULE_WEIGHT = 0.3      # how much we trust the rules
KEEP_THRESHOLD = 0.35  # final score threshold to keep a term


def hybrid_score_term(term: str, llm_score: float, sentence_text: str | None = None) -> tuple[float, bool]:
    """
    Combine LLM score with domain-driven rule-based score.

    Returns:
      final_score, keep_flag
    """
    t_norm = normalize_term_text(term)
    base_rule_score = 0.5  # neutral baseline

    # Strong domain terms → boost
    if t_norm in {normalize_term_text(x) for x in DOMAIN_STRONG_TERMS}:
        base_rule_score += 0.3

    # Contains domain-ish substring → slight boost
    if contains_domain_substring(term):
        base_rule_score += 0.15

    # Weak generic terms used alone → penalty
    if t_norm in GENERIC_WEAK_TERMS and len(t_norm.split()) == 1:
        base_rule_score -= 0.2

    # Truncated or clearly bad-looking → strong penalty
    if looks_truncated(term):
        base_rule_score -= 0.4

    # Clip rule score to [0, 1]
    rule_score = max(0.0, min(1.0, base_rule_score))

    # Combine LLM + rules
    llm_score_clipped = max(0.0, min(1.0, llm_score))
    final_score = LLM_WEIGHT * llm_score_clipped + RULE_WEIGHT * rule_score

    # Decision to keep
    
    keep_flag = final_score >= KEEP_THRESHOLD

    # Safety: if LLM says keep AND rule_score > 0.7, force keep even if threshold borderline
    if llm_score_clipped >= 0.6 and rule_score >= 0.7:
        keep_flag = True

    return final_score, keep_flag

###  LLM call helper

We now define a helper function that:

1. Builds the prompt text.
2. Calls `model.generate_content(...)`.
3. Extracts and parses the JSON part.
4. Returns a **list of dicts** (`{"term", "score", "keep"}`) or an empty list if something fails.


We call Groq as before, but then **post-process each term** with `hybrid_score_term(...)` 
and overwrite both `score` and `keep` according to the hybrid logic.


In [77]:
# ✅ Groq-based LLM reranking with simple domain heuristics

def call_llm_rerank(
    sentence_text: Optional[str],
    candidates: List[str],
    dry_run: bool = False
) -> List[Dict[str, Any]]:

    if not candidates:
        return []

    user_prompt = build_llm_input(sentence_text, candidates)

    # Dry-run option to skip API call
    if dry_run:
        return [{"term": c, "score": 0.5, "keep": True} for c in candidates]

    try:
        # Call Groq LLM — ChatCompletion style
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": system_prompt_rerank},
                {"role": "user", "content": user_prompt},
            ],
            temperature=0.15,
        )

        text = response.choices[0].message.content.strip()

    except Exception as e:
        print("⚠️ LLM call failed:", e)
        # Fallback: keep all as-is
        return [{"term": c, "score": 0.5, "keep": True} for c in candidates]

    # Try to parse JSON from the model output
    try:
        data = json.loads(text)
    except json.JSONDecodeError:
        # Try to extract substring containing JSON
        try:
            start = text.index("{")
            end = text.rindex("}") + 1
            data = json.loads(text[start:end])
        except Exception as e2:
            print("⚠️ JSON parsing failed, using fallback:", e2)
            return [{"term": c, "score": 0.5, "keep": True} for c in candidates]

    reranked = data.get("reranked_terms", [])
    cleaned: List[Dict[str, Any]] = []

    # --- Base cleaning from JSON ---
    for item in reranked:
        term = item.get("term")
        if term is None or term not in candidates:
            continue

        # Safe parsing
        try:
            score = float(item.get("score", 0.0))
        except Exception:
            score = 0.0

        keep = bool(item.get("keep", True))

        cleaned.append({"term": term, "score": score, "keep": keep})

    # --- Domain heuristics post-processing ---

    # 1) Always keep acronyms (if present among candidates)
    for t in candidates:
        if t.lower() in ACRONYMS:
            # Check if already present
            found = next((x for x in cleaned if x["term"] == t), None)
            if found:
                found["keep"] = True
                # boost score a bit
                found["score"] = max(found["score"], 0.8)
            else:
                cleaned.append({"term": t, "score": 0.8, "keep": True})

    # 2) Remove generic single-material words if the model kept them
    for item in cleaned:
        term_lower = item["term"].lower().strip()
        # if it is exactly one token and in BAD_SINGLE_WORDS → force drop
        if len(term_lower.split()) == 1 and term_lower in BAD_SINGLE_WORDS:
            item["keep"] = False
            # optionally lower score
            item["score"] = min(item["score"], 0.1)

    # If everything got filtered out, fall back to keeping all candidates
    if not cleaned:
        cleaned = [{"term": c, "score": 0.5, "keep": True} for c in candidates]

    return cleaned





### Apply reranking to all sentences

We now:

1. Iterate over each entry in `ensemble_pred["data"]`.
2. Extract:
   - `document_id`, `paragraph_id`, `sentence_id`
   - `term_list` (candidate terms)
3. Look up the **sentence text** using `sentence_index` (if available).
4. Call `call_llm_rerank(...)`.
5. Filter and sort terms:
   - Keep only `keep == True`.
   - Sort by `score` descending.
6. Build a new `data` list with the **same structure** as the original predictions, but with reranked `term_list`.

We also add a `dry_run` option for debugging without making real API calls.

In [78]:
#  Apply reranking

def rerank_all_entries(
    predictions: Dict[str, Any],
    sentence_index: Dict[Tuple[str, int, int], str],
    use_sentence_context: bool = True,
    dry_run: bool = False,
) -> Dict[str, Any]:
    """
    Apply LLM reranking to all entries in predictions["data"].

    Returns a new dict with the same structure, but reranked term_list.
    """
    new_data = []

    for entry in tqdm(predictions["data"], desc="Reranking terms"):
        doc_id = entry["document_id"]
        par_id = entry["paragraph_id"]
        sent_id = entry["sentence_id"]
        candidates = entry.get("term_list", [])

        key = (doc_id, par_id, sent_id)
        sentence_text = sentence_index.get(key) if (use_sentence_context and sentence_index) else None

        reranked_items = call_llm_rerank(
            sentence_text=sentence_text,
            candidates=candidates,
            dry_run=dry_run,
        )

        # Filter by keep==True and sort by score (descending)
        kept = [item for item in reranked_items if item["keep"]]
        kept_sorted = sorted(kept, key=lambda x: x["score"], reverse=True)

        new_term_list = [item["term"] for item in kept_sorted]

        new_entry = {
            "document_id": doc_id,
            "paragraph_id": par_id,
            "sentence_id": sent_id,
            "term_list": new_term_list,
        }
        new_data.append(new_entry)

    return {"data": new_data}


## 8. Quick dry-run (no real LLM calls)

Before spending tokens, we can test the pipeline in **dry_run** mode, which:

- Skips real LLM calls.
- Assigns a dummy score of 0.5 to every candidate.
- Keeps all terms, but goes through the whole structure.

This is useful to detect path / format issues.


In [79]:
# 8. Dry run test on a small subset

test_predictions = {
    "data": ensemble_pred["data"][:5]  # only first 5 entries for a quick test
}

reranked_test = rerank_all_entries(
    predictions=test_predictions,
    sentence_index=sentence_index,
    use_sentence_context=True,
    dry_run=True,   # <-- no real API calls
)

print(json.dumps(reranked_test["data"][:2], indent=2, ensure_ascii=False))


Reranking terms: 100%|██████████| 5/5 [00:00<?, ?it/s]

[
  {
    "document_id": "doc_praiano_07",
    "paragraph_id": 32,
    "sentence_id": 7,
    "term_list": []
  },
  {
    "document_id": "doc_caserta_06",
    "paragraph_id": 3,
    "sentence_id": 1,
    "term_list": [
      "disciplinare",
      "centri di raccolta comunali",
      "disciplina",
      "centri di raccolta dei rifiuti urbani raccolti"
    ]
  }
]





In [80]:
# 9. Full reranking with real LLM calls

USE_SENTENCE_CONTEXT = True    # set to False if you don't have dev sentences
DRY_RUN = False                # set to True if you want to test without API calls

reranked_full = rerank_all_entries(
    predictions=ensemble_pred,
    sentence_index=sentence_index,
    use_sentence_context=USE_SENTENCE_CONTEXT,
    dry_run=DRY_RUN,
)

print("✓ Completed LLM reranking")


Reranking terms: 100%|██████████| 577/577 [30:14<00:00,  3.14s/it]

✓ Completed LLM reranking





In [81]:

# 10. Save output

OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)

with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
    json.dump(reranked_full, f, ensure_ascii=False, indent=2)

print(f"✓ Saved reranked predictions to: {OUTPUT_PATH}")


✓ Saved reranked predictions to: c:\Users\super\Documents\UniPd\ATA\ATE-IT_SofiaMaule\src\predictions\subtask_a_dev_reranked_llm_2.json


## 11. Evaluate reranking with Micro / Type F1

We now evaluate the **reranked LLM output** against the **gold dev annotations**, using the usual:

- `micro_f1_score(...)`
- `type_f1_score(...)`

We assume those functions are already defined in the notebook (as you pasted).


In [82]:
def micro_f1_score(gold_standard, system_output):
    """
    Evaluates performance using Precision, Recall, and F1 score 
    based on individual term matching (micro-average).
    """
    total_true_positives = 0
    total_false_positives = 0
    total_false_negatives = 0
    
    for gold, system in zip(gold_standard, system_output):
        gold_set = set(gold)
        system_set = set(system)
        
        true_positives = len(gold_set.intersection(system_set))
        false_positives = len(system_set - gold_set)
        false_negatives = len(gold_set - system_set)
        
        total_true_positives += true_positives
        total_false_positives += false_positives
        total_false_negatives += false_negatives
    
    precision = total_true_positives / (total_true_positives + total_false_positives) if (total_true_positives + total_false_positives) > 0 else 0
    recall = total_true_positives / (total_true_positives + total_false_negatives) if (total_true_positives + total_false_negatives) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return precision, recall, f1, total_true_positives, total_false_positives, total_false_negatives


def type_f1_score(gold_standard, system_output):
    """
    Evaluates performance using Type Precision, Type Recall, and Type F1 score
    based on the set of unique terms extracted at least once across the entire dataset.
    """
    all_gold_terms = set()
    for item_terms in gold_standard:
        all_gold_terms.update(item_terms)
    
    all_system_terms = set()
    for item_terms in system_output:
        all_system_terms.update(item_terms)
    
    type_true_positives = len(all_gold_terms.intersection(all_system_terms))
    type_false_positives = len(all_system_terms - all_gold_terms)
    type_false_negatives = len(all_gold_terms - all_system_terms)
    
    type_precision = type_true_positives / (type_true_positives + type_false_positives) if (type_true_positives + type_false_positives) > 0 else 0
    type_recall = type_true_positives / (type_true_positives + type_false_negatives) if (type_true_positives + type_false_negatives) > 0 else 0
    type_f1 = 2 * (type_precision * type_recall) / (type_precision + type_recall) if (type_precision + type_recall) > 0 else 0
    
    return type_precision, type_recall, type_f1


print("✓ Evaluation functions defined")

✓ Evaluation functions defined


In [83]:
with open(DEV_SENTENCES_PATH, "r", encoding="utf-8") as f:
    gold_json = json.load(f)

with open(OUTPUT_PATH, "r", encoding="utf-8") as f:
    reranked_json = json.load(f)

with open(PREDICTIONS_PATH, "r", encoding="utf-8") as f:
    ensemble_json = json.load(f)

gold_data = gold_json["data"]
reranked_data = reranked_json["data"]
ensemble_data = ensemble_json["data"]

In [84]:
def check_alignment(gold, system, name: str):
    if len(gold) != len(system):
        raise ValueError(f"[{name}] Length mismatch: gold={len(gold)}, system={len(system)}")

    for i, (g, s) in enumerate(zip(gold, system)):
        g_key = (g["document_id"], g["paragraph_id"], g["sentence_id"])
        s_key = (s["document_id"], s["paragraph_id"], s["sentence_id"])
        if g_key != s_key:
            raise ValueError(
                f"[{name}] ID mismatch at index {i}:\n"
                f"  gold   = {g_key}\n"
                f"  system = {s_key}"
            )

    print(f"✓ Alignment OK for {name}")

check_alignment(gold_data, ensemble_data, name="Ensemble vs Gold")
check_alignment(gold_data, reranked_data, name="Reranked vs Gold")

✓ Alignment OK for Ensemble vs Gold
✓ Alignment OK for Reranked vs Gold


In [85]:
# Extract term lists (gold, ensemble, reranked)

gold_terms = [entry.get("term_list", []) for entry in gold_data]
ensemble_terms = [entry.get("term_list", []) for entry in ensemble_data]
reranked_terms = [entry.get("term_list", []) for entry in reranked_data]

print("Example gold terms     :", gold_terms[1][:5])
print("Example ensemble terms :", ensemble_terms[1][:5])
print("Example reranked terms :", reranked_terms[1][:5])


Example gold terms     : ['disciplina dei centri di raccolta dei rifiuti urbani raccolti in modo differenziato', 'disciplinare per la gestione dei centri di raccolta comunali']
Example ensemble terms : ['disciplinare', 'centri di raccolta comunali', 'disciplina', 'centri di raccolta dei rifiuti urbani raccolti']
Example reranked terms : ['disciplinare', 'centri di raccolta comunali']


In [86]:
# Compute metrics for original ensemble baseline

ens_p, ens_r, ens_f1, ens_tp, ens_fp, ens_fn = micro_f1_score(gold_terms, ensemble_terms)
ens_tp_p, ens_tp_r, ens_tp_f1 = type_f1_score(gold_terms, ensemble_terms)

print("=== Ensemble (BERT + spaCy + dict) ===")
print(f"Micro Precision : {ens_p:.3f}")
print(f"Micro Recall    : {ens_r:.3f}")
print(f"Micro F1        : {ens_f1:.3f}")
print(f"  TP / FP / FN  : {ens_tp} / {ens_fp} / {ens_fn}")
print()
print(f"Type Precision  : {ens_tp_p:.3f}")
print(f"Type Recall     : {ens_tp_r:.3f}")
print(f"Type F1         : {ens_tp_f1:.3f}")


=== Ensemble (BERT + spaCy + dict) ===
Micro Precision : 0.739
Micro Recall    : 0.696
Micro F1        : 0.717
  TP / FP / FN  : 314 / 111 / 137

Type Precision  : 0.688
Type Recall     : 0.620
Type F1         : 0.652


In [87]:
#  Compute metrics for LLM-reranked system

llm_p, llm_r, llm_f1, llm_tp, llm_fp, llm_fn = micro_f1_score(gold_terms, reranked_terms)
llm_tp_p, llm_tp_r, llm_tp_f1 = type_f1_score(gold_terms, reranked_terms)

print("=== LLM Reranked (Groq) ===")
print(f"Micro Precision : {llm_p:.3f}")
print(f"Micro Recall    : {llm_r:.3f}")
print(f"Micro F1        : {llm_f1:.3f}")
print(f"  TP / FP / FN  : {llm_tp} / {llm_fp} / {llm_fn}")
print()
print(f"Type Precision  : {llm_tp_p:.3f}")
print(f"Type Recall     : {llm_tp_r:.3f}")
print(f"Type F1         : {llm_tp_f1:.3f}")


=== LLM Reranked (Groq) ===
Micro Precision : 0.813
Micro Recall    : 0.539
Micro F1        : 0.648
  TP / FP / FN  : 243 / 56 / 208

Type Precision  : 0.750
Type Recall     : 0.521
Type F1         : 0.615


=== LLM Reranked (Groq) ===
Micro Precision : 0.739
Micro Recall    : 0.696
Micro F1        : 0.717
  TP / FP / FN  : 314 / 111 / 137

Type Precision  : 0.688
Type Recall     : 0.620
Type F1         : 0.652

In [88]:
# Compact comparison summary

import pandas as pd

summary = pd.DataFrame(
    [
        ["ensemble", ens_p, ens_r, ens_f1, ens_tp_p, ens_tp_r, ens_tp_f1],
        ["llm_reranked", llm_p, llm_r, llm_f1, llm_tp_p, llm_tp_r, llm_tp_f1],
    ],
    columns=[
        "model",
        "micro_precision",
        "micro_recall",
        "micro_f1",
        "type_precision",
        "type_recall",
        "type_f1",
    ],
)

summary


Unnamed: 0,model,micro_precision,micro_recall,micro_f1,type_precision,type_recall,type_f1
0,ensemble,0.738824,0.696231,0.716895,0.688073,0.619835,0.652174
1,llm_reranked,0.812709,0.538803,0.648,0.75,0.520661,0.614634


## 11. Next steps and tuning ideas

Some ideas to improve the reranking quality:

1. **Adjust the prompt**:
   - Be more strict or more permissive in the instructions.
   - Emphasize multi-word terms or certain POS patterns.

2. **Control filtering**:
   - After reranking, you can:
     - Keep only the top-K terms per sentence (e.g., top 3 or top 5).
     - Discard terms with score `< 0.4` (or another threshold).

3. **Domain-specific heuristics**:
   - Penalize candidates that end with stopwords like *"di"*, *"dei"*, *"e"*, etc.
   - Boost terms that include frequent domain keywords (*"rifiuti"*, *"raccolta"*, *"centro di raccolta"*, etc.).

You can implement these as a post-processing step on `reranked_full["data"]` before saving, or directly instruct the LLM in the prompt.


## ADVANCED DEBUG ANALYSIS FOR RERANKING QUALITY

In [89]:
from collections import Counter, defaultdict
import pandas as pd

print("=== DEBUG ANALYSIS START ===\n")

debug_rows = []

for i, (gold, ens, rer) in enumerate(zip(gold_terms, ensemble_terms, reranked_terms)):

    gold_set = set(gold)
    ens_set = set(ens)
    rer_set = set(rer)

    lost_terms = list(ens_set - rer_set)
    added_terms = list(rer_set - ens_set)

    false_positives = list(rer_set - gold_set)
    true_positives = list(rer_set & gold_set)
    new_true_positives = list((rer_set - ens_set) & gold_set)

    hard_missed = list(gold_set - ens_set - rer_set)

    debug_rows.append({
        "index": i,
        "gold": gold,
        "ensemble": ens,
        "reranked": rer,
        "lost_terms": lost_terms,
        "added_terms": added_terms,
        "remaining_false_positives": false_positives,
        "new_true_positives": new_true_positives,
        "hard_missed_terms": hard_missed,
    })

df_debug = pd.DataFrame(debug_rows)
print("✓ Debug dataframe created")


=== DEBUG ANALYSIS START ===

✓ Debug dataframe created


### Lost terms during reranking

In [90]:
lost_counter = Counter(
    t for row in debug_rows for t in row["lost_terms"]
)

print("\n=== TOP LOST TERMS (reranking removed them but they were in ensemble) ===")
for term, count in lost_counter.most_common(20):
    print(f"{term:40s}  →  removed {count} times")



=== TOP LOST TERMS (reranking removed them but they were in ensemble) ===
vetro                                     →  removed 13 times
plastica                                  →  removed 10 times
conferire                                 →  removed 8 times
raccolta                                  →  removed 5 times
alluminio                                 →  removed 5 times
carta                                     →  removed 5 times
busta                                     →  removed 3 times
sacchetti                                 →  removed 3 times
rifiuti                                   →  removed 3 times
essere conferiti                          →  removed 3 times
sacco                                     →  removed 3 times
utente                                    →  removed 3 times
materiali                                 →  removed 2 times
depositare                                →  removed 2 times
secchiello                                →  removed 2 times
conferit

### added terms with reranking

In [91]:
added_counter = Counter(
    t for row in debug_rows for t in row["added_terms"]
)

print("\n=== TERMS ADDED BY RERANKING (not in ensemble) ===")
for term, count in added_counter.most_common(20):
    print(f"{term:40s}  →  added {count} times")



=== TERMS ADDED BY RERANKING (not in ensemble) ===


### False positives rimasti

In [92]:
fp_counter = Counter(
    t for row in debug_rows for t in row["remaining_false_positives"]
)

print("\n=== REMAINING FALSE POSITIVES (still wrong after reranking) ===")
for term, count in fp_counter.most_common(20):
    print(f"{term:40s}  →  FP {count} times")



=== REMAINING FALSE POSITIVES (still wrong after reranking) ===
rifiuti                                   →  FP 3 times
centro di raccolta                        →  FP 2 times
raccolta differenziata                    →  FP 2 times
banda stagnata                            →  FP 2 times
centri di raccolta comunali               →  FP 1 times
disciplinare                              →  FP 1 times
servizio di raccolta dei rifiuti derivanti da sfalci e potature  →  FP 1 times
batterie e accumulatori al piombo derivanti dalla manutenzione  →  FP 1 times
pile portatili                            →  FP 1 times
tari                                      →  FP 1 times
sacchetti del non differenziabile         →  FP 1 times
utenze domestiche per la raccolta differenziata  →  FP 1 times
rifiuti non pericolosi e non ingombranti  →  FP 1 times
servizio integrato gestione rifiuti –     →  FP 1 times
raccolta differenziata e servizi complementari  →  FP 1 times
incendi dei rifiuti                  

### true positives added with reranking

In [93]:
tp_gain_counter = Counter(
    t for row in debug_rows for t in row["new_true_positives"]
)

print("\n=== TRUE POSITIVES ADDED BY RERANKING (good improvements) ===")
for term, count in tp_gain_counter.most_common(20):
    print(f"{term:40s}  →  NEW TP {count} times")



=== TRUE POSITIVES ADDED BY RERANKING (good improvements) ===


### Hard cases (termini gold mancanti sia da ensemble che reranking)

In [94]:
hard_counter = Counter(
    t for row in debug_rows for t in row["hard_missed_terms"]
)

print("\n=== HARD MISSED TERMS (neither ensemble nor reranking found them) ===")
for term, count in hard_counter.most_common(20):
    print(f"{term:40s}  →  MISSED {count} times")



=== HARD MISSED TERMS (neither ensemble nor reranking found them) ===
sacchetto trasparente                     →  MISSED 4 times
plastica, acciaio e alluminio             →  MISSED 3 times
conferiti                                 →  MISSED 3 times
raccolta                                  →  MISSED 3 times
frazione verde                            →  MISSED 2 times
carta, cartone, cartoncino                →  MISSED 2 times
conferimento                              →  MISSED 2 times
modalità di conferimento                  →  MISSED 2 times
rifiuti                                   →  MISSED 2 times
busta con legaccio                        →  MISSED 2 times
carta - cartone - tetra pak               →  MISSED 2 times
misure di gestione ambientale             →  MISSED 2 times
r.a.e.e.                                  →  MISSED 2 times
tariffe                                   →  MISSED 2 times
conferire                                 →  MISSED 2 times
plastica                     

In [95]:
# Sentences where LLM made the worst damage:
df_debug["lost_count"] = df_debug["lost_terms"].apply(len)
df_debug["fp_count"] = df_debug["remaining_false_positives"].apply(len)

worst_lost = df_debug.sort_values("lost_count", ascending=False).head(10)
worst_fp = df_debug.sort_values("fp_count", ascending=False).head(10)

print("\n=== SENTENCES WHERE RERANKING LOST MOST TERMS ===")
worst_lost[["index", "lost_terms"]]

print("\n=== SENTENCES WHERE RERANKING HAS MOST FALSE POSITIVES ===")
worst_fp[["index", "remaining_false_positives"]]



=== SENTENCES WHERE RERANKING LOST MOST TERMS ===

=== SENTENCES WHERE RERANKING HAS MOST FALSE POSITIVES ===


Unnamed: 0,index,remaining_false_positives
402,402,"[oli conferiti, oli esausti, recupero]"
35,35,[batterie e accumulatori al piombo derivanti d...
1,1,"[centri di raccolta comunali, disciplinare]"
97,97,"[servizio integrato gestione rifiuti –, raccol..."
299,299,"[raccolta differenziata, porta a porta spinto]"
371,371,"[banda stagnata, lattine]"
415,415,"[parte fissa, parte variabile]"
538,538,"[scarti di produzione, tessuti sporchi ed inqu..."
6,6,[servizio di raccolta dei rifiuti derivanti da...
70,70,[centro di raccolta]
