# SpaCy Term Extraction for Italian Text

This notebook demonstrates two approaches to term extraction using SpaCy:
1. **Baseline**: Rule-based extraction with EntityRuler
2. **Trained**: Custom NER model fine-tuned for term extraction

Dataset: EvalITA 2025 ATE-IT (Automatic Term Extraction - Italian Testbed)

## Setup and Imports
- For complete information about the Italian SpaCy models available: https://spacy.io/models/it

In [19]:
#!python -m spacy download it_core_news_sm
#!python -m spacy download it_core_news_md
#!python -m spacy download it_core_news_lg

In [20]:
import json
import os
import random
from pathlib import Path
from typing import List, Dict, Tuple

import spacy
from spacy.tokens import DocBin, Doc
from spacy.training import Example
from spacy.util import minibatch, compounding
from spacy.pipeline import EntityRuler
from tqdm import tqdm

# Load Italian model
try:
    nlp = spacy.load('it_core_news_md')
    print("✓ Italian model loaded successfully")
except:
    print("Model not found. Install with: python -m spacy download it_core_news_md")

✓ Italian model loaded successfully


## Data Loading and Processing

In [21]:
def load_jsonl(path: str) -> List[Dict]:
    """Load a JSON lines file or JSON array file."""
    with open(path, 'r', encoding='utf-8') as f:
        text = f.read().strip()
    if not text:
        return []
    try:
        # Try parsing as single JSON object/array
        data = json.loads(text)
    except json.JSONDecodeError:
        # Fall back to JSONL (one JSON per line)
        data = []
        for line in text.splitlines():
            line = line.strip()
            if line:
                data.append(json.loads(line))
    return data


def build_sentence_gold_map(records: List[Dict]) -> List[Dict]:
    """Convert dataset rows into list of sentences with aggregated terms.
    
    Handles both formats:
    - Records with 'term_list' field (list of terms) for input files in json format
    - Records with individual 'term' field (one term per row) for input files in csv format
    """
    out = {}
    
    # Support both dict with 'data' key and plain list
    if isinstance(records, dict) and 'data' in records:
        rows = records['data']
    else:
        rows = records
    
    for r in rows:
        key = (r.get('document_id'), r.get('paragraph_id'), r.get('sentence_id'))
        if key not in out:
            out[key] = {
                'document_id': r.get('document_id'),
                'paragraph_id': r.get('paragraph_id'),
                'sentence_id': r.get('sentence_id'),
                'sentence_text': r.get('sentence_text', ''),
                'terms': []
            }
        
        # Support both 'term_list' (list) and 'term' (single value)
        if isinstance(r.get('term_list'), list):
            for t in r.get('term_list'):
                if t and t not in out[key]['terms']:
                    out[key]['terms'].append(t)
        else:
            term = r.get('term')
            if term and term not in out[key]['terms']:
                out[key]['terms'].append(term)
    
    return list(out.values())


# Test: Load a small sample
test_data = {
    'data': [
        {
            'document_id': 'doc1',
            'paragraph_id': 'p1',
            'sentence_id': 's1',
            'sentence_text': 'La tassa di successione è un tributo.',
            'term_list': ['tassa di successione', 'tributo']
        }
    ]
}

test_sentences = build_sentence_gold_map(test_data)
assert len(test_sentences) == 1
assert test_sentences[0]['terms'] == ['tassa di successione', 'tributo']
print("✓ Data loading functions work correctly")

✓ Data loading functions work correctly


In [22]:
# Load actual training and dev data
train_data = load_jsonl('../data/subtask_a_train.json')
dev_data = load_jsonl('../data/subtask_a_dev.json')

train_sentences = build_sentence_gold_map(train_data)
dev_sentences = build_sentence_gold_map(dev_data)

print(f"Training sentences: {len(train_sentences)}")
print(f"Dev sentences: {len(dev_sentences)}")
print(f"\nExample sentence:")
print(f"  Text: {train_sentences[6]['sentence_text']}")
print(f"  Terms: {train_sentences[6]['terms']}")

Training sentences: 2308
Dev sentences: 577

Example sentence:
  Text: AFFIDAMENTO DEL “SERVIZIO DI SPAZZAMENTO, RACCOLTA, TRASPORTO E SMALTIMENTO/RECUPERO DEI RIFIUTI URBANI ED ASSIMILATI E SERVIZI COMPLEMENTARI DELLA CITTA' DI AGROPOLI” VALEVOLE PER UN QUINQUENNIO
  Terms: ['raccolta', 'recupero', 'servizio di raccolta', 'servizio di spazzamento', 'smaltimento', 'trasporto']


## Evaluation Metrics

Using the official evaluation metrics from the competition.

In [23]:
def micro_f1_score(gold_standard, system_output):
    """
    Evaluates performance using Precision, Recall, and F1 score 
    based on individual term matching (micro-average).
    
    Args:
        gold_standard: List of lists, where each inner list contains gold standard terms
        system_output: List of lists, where each inner list contains extracted terms
    
    Returns:
        Tuple containing (precision, recall, f1, tp, fp, fn)
    """
    total_true_positives = 0
    total_false_positives = 0
    total_false_negatives = 0
    
    # Iterate through each item's gold standard and system output terms
    for gold, system in zip(gold_standard, system_output):
        # Convert to sets for efficient comparison
        gold_set = set(gold)
        system_set = set(system)
        
        # Calculate TP, FP, FN for the current item
        true_positives = len(gold_set.intersection(system_set))
        false_positives = len(system_set - gold_set)
        false_negatives = len(gold_set - system_set)
        
        # Accumulate totals across all items
        total_true_positives += true_positives
        total_false_positives += false_positives
        total_false_negatives += false_negatives
    
    # Calculate Precision, Recall, and F1 score (micro-average)
    precision = total_true_positives / (total_true_positives + total_false_positives) if (total_true_positives + total_false_positives) > 0 else 0
    recall = total_true_positives / (total_true_positives + total_false_negatives) if (total_true_positives + total_false_negatives) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return precision, recall, f1, total_true_positives, total_false_positives, total_false_negatives


def type_f1_score(gold_standard, system_output):
    """
    Evaluates performance using Type Precision, Type Recall, and Type F1 score
    based on the set of unique terms extracted at least once across the entire dataset.
    
    Args:
        gold_standard: List of lists, where each inner list contains gold standard terms
        system_output: List of lists, where each inner list contains extracted terms
    
    Returns:
        Tuple containing (type_precision, type_recall, type_f1)
    """
    # Get the set of all unique gold standard terms across the dataset
    all_gold_terms = set()
    for item_terms in gold_standard:
        all_gold_terms.update(item_terms)
    
    # Get the set of all unique system extracted terms across the dataset
    all_system_terms = set()
    for item_terms in system_output:
        all_system_terms.update(item_terms)
    
    # Calculate True Positives (terms present in both sets)
    type_true_positives = len(all_gold_terms.intersection(all_system_terms))
    
    # Calculate False Positives (terms in system output but not in gold standard)
    type_false_positives = len(all_system_terms - all_gold_terms)
    
    # Calculate False Negatives (terms in gold standard but not in system output)
    type_false_negatives = len(all_gold_terms - all_system_terms)
    
    # Calculate Type Precision, Type Recall, and Type F1 score
    type_precision = type_true_positives / (type_true_positives + type_false_positives) if (type_true_positives + type_false_positives) > 0 else 0
    type_recall = type_true_positives / (type_true_positives + type_false_negatives) if (type_true_positives + type_false_negatives) > 0 else 0
    type_f1 = 2 * (type_precision * type_recall) / (type_precision + type_recall) if (type_precision + type_recall) > 0 else 0
    
    return type_precision, type_recall, type_f1


# Test: Simple case
gold_test = [['term1', 'term2'], ['term3']]
pred_test = [['term1', 'term4'], ['term3']]
precision, recall, f1, tp, fp, fn = micro_f1_score(gold_test, pred_test)
assert tp == 2  # term1 and term3
assert fp == 1  # term4
assert fn == 1  # term2
print("✓ Evaluation functions work correctly")
print(f"  Test metrics: P={precision:.2f}, R={recall:.2f}, F1={f1:.2f}")

# Test type-level metrics
type_p, type_r, type_f1 = type_f1_score(gold_test, pred_test)
print(f"  Type metrics: P={type_p:.2f}, R={type_r:.2f}, F1={type_f1:.2f}")

✓ Evaluation functions work correctly
  Test metrics: P=0.67, R=0.67, F1=0.67
  Type metrics: P=0.67, R=0.67, F1=0.67


## Baseline Model: Rule-Based EntityRuler

Simple approach using SpaCy's EntityRuler:
- Creates exact match patterns from training terms
- Fast and deterministic
- No generalization to unseen terms

In [24]:
import re
import unicodedata

def norm(t: str) -> str:
    if not t:
        return ""
    t = t.lower()
    t = unicodedata.normalize("NFKC", t)
    t = t.replace("’", "'").replace("`", "'")
    t = t.replace("“", '"').replace("”", '"')
    t = " ".join(t.split())
    # strip punteggiatura ai bordi
    t = t.strip(".,;:-'\"()[]{}")
    return t

In [25]:
class SpacyRuleBaseline:
    """Rule-based extractor using EntityRuler."""

    def __init__(self, model: str = "it_core_news_sm"):
        """
        Se ti serve solo l'EntityRuler potresti anche usare spacy.blank("it")
        per essere più leggera. Per ora mantengo il modello pre-addestrato.
        """
        try:
            self.nlp = spacy.load(model)
            print(f"✓ Loaded spaCy model: {model}")
        except Exception:
            print(f" Could not load {model}, using blank('it') instead")
            self.nlp = spacy.blank("it")

        self.ruler = None

    def build(self, terms: List[str]):
        """Build EntityRuler patterns from term list (token-based, LOWER, normalizzati)."""

        # 1) normalizza e deduplica i termini
        norm_terms = set()
        for t in terms:
            if not t:
                continue
            n = norm(t)  # usa la tua funzione norm(t)
            if not n:
                continue
            norm_terms.add(n)

        # 2) crea pattern token-based in LOWER, ordinati per lunghezza desc
        patterns = []
        # ordino per numero di token e lunghezza, per favorire i multiword
        def sort_key(s: str):
            toks = s.split()
            return (len(toks), len(s))

        for t_norm in sorted(norm_terms, key=sort_key, reverse=True):
            tokens = t_norm.split()
            # se vuoi, qui puoi filtrare: solo multiword, ecc.
            # if len(tokens) < 2:  # solo multiword
            #     continue

            token_pattern = [{"LOWER": tok} for tok in tokens]
            patterns.append({"label": "TERM", "pattern": token_pattern})

        # 3) rimpiazza eventuale entity_ruler esistente
        if "entity_ruler" in self.nlp.pipe_names:
            self.nlp.remove_pipe("entity_ruler")

        self.nlp.add_pipe("entity_ruler", name="entity_ruler", first=True)
        self.ruler = self.nlp.get_pipe("entity_ruler")
        self.ruler.add_patterns(patterns)

        print(f"Built EntityRuler with {len(patterns)} patterns")

    def predict(self, texts: List[str]) -> List[List[str]]:
        """Extract terms from texts."""
        if self.ruler is None:
            raise RuntimeError("Model not built. Call build() first.")

        results = []
        for doc in tqdm(self.nlp.pipe(texts, batch_size=32),
                        desc="Predicting", total=len(texts)):
            # prendo tutte le entità TERM, così come sono nel testo
            terms = [ent.text for ent in doc.ents if ent.label_ == "TERM"]
            results.append(terms)
        return results

    def save(self, path: str):
        """Save EntityRuler patterns."""
        os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
        if self.ruler:
            self.ruler.to_disk(path)
            print(f"Model saved to {path}")

    def load(self, path: str):
        """Load EntityRuler patterns."""
        if "entity_ruler" in self.nlp.pipe_names:
            self.nlp.remove_pipe("entity_ruler")

        self.nlp.add_pipe("entity_ruler", name="entity_ruler", first=True)
        self.ruler = self.nlp.get_pipe("entity_ruler")
        self.ruler.from_disk(path)
        print(f"Model loaded from {path}")


In [26]:
# Test: Simple predictions
baseline = SpacyRuleBaseline()
baseline.build(['tributo', 'tassa di successione'])
test_preds = baseline.predict(['Il tributo è una tassa di successione.'])
assert 'tributo' in test_preds[0]
assert 'tassa di successione' in test_preds[0]
print("✓ Baseline model works correctly")
print(f"  Test predictions: {test_preds[0]}")

✓ Loaded spaCy model: it_core_news_sm
Built EntityRuler with 2 patterns


Predicting: 100%|██████████| 1/1 [00:00<00:00, 117.78it/s]

✓ Baseline model works correctly
  Test predictions: ['tributo', 'tassa di successione']





### Run and Evaluate Baseline Model

In [27]:
# Extract unique terms from training data
train_terms = set()
for s in train_sentences:
    train_terms.update(t for t in s['terms'] if t)

print(f"Unique training terms: {len(train_terms)}")

# Build baseline model
baseline_model = SpacyRuleBaseline()
baseline_model.build(sorted(train_terms))

# Predict on dev set
dev_texts = [s['sentence_text'] for s in dev_sentences]
dev_gold = [s['terms'] for s in dev_sentences]

print("\nRunning baseline predictions...")
baseline_preds = baseline_model.predict(dev_texts)

# Evaluate
precision, recall, f1, tp, fp, fn = micro_f1_score(dev_gold, baseline_preds)
type_precision, type_recall, type_f1 = type_f1_score(dev_gold, baseline_preds)

print("\nBaseline Model Results:")
print("Micro-averaged metrics:")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1 Score:  {f1:.4f}")
print(f"  TP={tp}, FP={fp}, FN={fn}")
print("\nType-level metrics:")
print(f"  Type Precision: {type_precision:.4f}")
print(f"  Type Recall:    {type_recall:.4f}")
print(f"  Type F1 Score:  {type_f1:.4f}")

# Store metrics for later comparison
baseline_metrics = {
    'precision': precision,
    'recall': recall,
    'f1': f1,
    'type_precision': type_precision,
    'type_recall': type_recall,
    'type_f1': type_f1
}

Unique training terms: 713
✓ Loaded spaCy model: it_core_news_sm
Built EntityRuler with 710 patterns

Running baseline predictions...


Predicting: 100%|██████████| 577/577 [00:05<00:00, 105.19it/s]


Baseline Model Results:
Micro-averaged metrics:
  Precision: 0.2889
  Recall:    0.3415
  F1 Score:  0.3130
  TP=154, FP=379, FN=297

Type-level metrics:
  Type Precision: 0.3494
  Type Recall:    0.3595
  Type F1 Score:  0.3544





In [28]:
# Save baseline model
baseline_model.save('models/spacy_baseline_refined')

Model saved to models/spacy_baseline_refined


## Trained Model: Custom NER

Neural approach that learns from examples:
- Fine-tunes SpaCy's NER model on term extraction task
- Learns patterns and context from labeled data
- Can generalize to similar terms not seen during training

In [29]:
class SpacyTrainedModel:
    """Trainable NER model for term extraction."""
    
    def __init__(self, model: str = 'it_core_news_sm'):
        self.model_name = model
        self.nlp = None
    
    def _prepare_training_data(self, sentences: List[str], term_lists: List[List[str]]) -> List[Example]:
        """Convert to SpaCy training format with character-span annotations."""
        training_data = []
        
        for sent_text, terms in zip(sentences, term_lists):
            doc = self.nlp.make_doc(sent_text)
            entities = []
            
            # Find character spans for each term
            for term in terms:
                if not term:
                    continue
                
                # Find all occurrences
                start_idx = 0
                while True:
                    start_idx = sent_text.find(term, start_idx)
                    if start_idx == -1:
                        break
                    
                    end_idx = start_idx + len(term)
                    span = doc.char_span(start_idx, end_idx, label='TERM', alignment_mode='expand')
                    if span is not None:
                        entities.append((start_idx, end_idx, 'TERM'))
                    
                    start_idx = end_idx
            
            # Remove overlapping entities
            # It might be that overlapping occurrences exist and should be captured
            # For example, if "servizio di raccolta" is found, we might want to also capture "raccolta" 
            # In this case, we need to ensure that both spans are included
            entities = self._remove_overlapping(entities)
            example = Example.from_dict(doc, {'entities': entities})
            training_data.append(example)
        
        return training_data
    
    def _remove_overlapping(self, entities: List[Tuple[int, int, str]]) -> List[Tuple[int, int, str]]:
        """Keep longer spans when entities overlap."""
        if not entities:
            return []
        
        # Sort by start, then by length (descending)
        entities = sorted(entities, key=lambda x: (x[0], -(x[1] - x[0])))
        
        non_overlapping = []
        for start, end, label in entities:
            overlaps = False
            for prev_start, prev_end, _ in non_overlapping:
                if not (end <= prev_start or start >= prev_end):
                    overlaps = True
                    break
            if not overlaps:
                non_overlapping.append((start, end, label))
        
        return non_overlapping
    
    def train(self, sentences: List[str], term_lists: List[List[str]], 
              n_iter: int = 30, dropout: float = 0.2, batch_size: int = 8):
        """Train NER model on labeled data."""
        print(f"Initializing model: {self.model_name}")
        
        # Load base model
        try:
            self.nlp = spacy.load(self.model_name)
        except:
            print(f"Model not found, using blank Italian model")
            self.nlp = spacy.blank('it')
        
        # Setup NER
        if 'ner' not in self.nlp.pipe_names:
            ner = self.nlp.add_pipe('ner')
        else:
            ner = self.nlp.get_pipe('ner')
        ner.add_label('TERM')
        
        # Prepare training data
        print("Preparing training examples...")
        train_examples = self._prepare_training_data(sentences, term_lists)
        train_examples = [ex for ex in train_examples if len(ex.reference.ents) > 0] # Keep only examples with entities
        print(f"Training on {len(train_examples)} examples")
        
        # Train
        other_pipes = [pipe for pipe in self.nlp.pipe_names if pipe != 'ner']
        with self.nlp.disable_pipes(*other_pipes):
            #optimizer = self.nlp.begin_training()
            if self.model_name == 'it_core_news_sm':
                optimizer = self.nlp.resume_training()
            else:
                optimizer = self.nlp.begin_training()

            for iteration in tqdm(range(n_iter), desc="Training", total=n_iter):
                random.shuffle(train_examples)
                losses = {}
                batches = minibatch(train_examples, size=compounding(4.0, batch_size, 1.001))
                
                for batch in batches:
                    self.nlp.update(batch, drop=dropout, losses=losses)
                
                if iteration % 5 == 0:
                    print(f"  Iteration {iteration}: Loss = {losses.get('ner', 0):.3f}")
        
        print("Training complete!")
    
    def predict(self, sentences: List[str]) -> List[List[str]]:
        """Extract terms from sentences."""
        if self.nlp is None:
            raise RuntimeError("Model not trained. Call train() or load() first.")
        
        results = []
        for doc in self.nlp.pipe(sentences, batch_size=32):
            terms = [ent.text for ent in doc.ents if ent.label_ == 'TERM']
            results.append(terms)
        return results
    
    def save(self, path: str):
        """Save trained model."""
        if self.nlp is None:
            raise RuntimeError("No model to save")
        
        output_dir = Path(path)
        output_dir.mkdir(parents=True, exist_ok=True)
        self.nlp.to_disk(output_dir)
        print(f"Model saved to {output_dir}")
    
    def load(self, path: str):
        """Load trained model."""
        self.nlp = spacy.load(path)
        if 'ner' not in self.nlp.pipe_names:
            raise ValueError("Loaded model doesn't have NER component")
        print(f"Model loaded from {path}")


# Test: Simple training
test_model = SpacyTrainedModel()
test_sents = ['Il tributo è importante.', 'La tassa di successione è un tributo.']
test_terms = [['tributo'], ['tassa di successione', 'tributo']]
test_model.train(test_sents, test_terms, n_iter=20)
test_pred = test_model.predict(['Il tributo è fondamentale.'])
assert 'tributo' in test_pred[0]
print(f"✓ Trained model works correctly")
print(f"  Test prediction: {test_pred[0]}")

Initializing model: it_core_news_sm
Preparing training examples...
Training on 2 examples


Training:  15%|█▌        | 3/20 [00:00<00:00, 21.44it/s]

  Iteration 0: Loss = 5.953


Training:  45%|████▌     | 9/20 [00:00<00:00, 22.93it/s]

  Iteration 5: Loss = 4.633


Training:  75%|███████▌  | 15/20 [00:00<00:00, 24.10it/s]

  Iteration 10: Loss = 3.874


Training: 100%|██████████| 20/20 [00:00<00:00, 23.06it/s]

  Iteration 15: Loss = 2.896
Training complete!
✓ Trained model works correctly
  Test prediction: ['tributo']





### Train and Evaluate Trained Model

Note: This cell might take several minutes to run.

**Additional configurations to test**
- Keep overlapping entities
- Keep documents with 0 entities in the training set
- Change hyperparameters (*n_iter*, *dropout*, *batch_size*)

In [30]:
# Prepare training data
train_texts = [s['sentence_text'] for s in train_sentences]
train_term_lists = [s['terms'] for s in train_sentences]

# Initialize and train model
trained_model = SpacyTrainedModel(model='it_core_news_md') #CHANGED

trained_model.train(
    train_texts, 
    train_term_lists,
    n_iter=30,
    dropout=0.1, #changed
    batch_size=8
)

Initializing model: it_core_news_md
Preparing training examples...




Training on 623 examples


Training:   3%|▎         | 1/30 [00:15<07:30, 15.53s/it]

  Iteration 0: Loss = 2764.503


Training:  20%|██        | 6/30 [01:32<06:09, 15.38s/it]

  Iteration 5: Loss = 569.971


Training:  37%|███▋      | 11/30 [02:50<04:53, 15.47s/it]

  Iteration 10: Loss = 332.613


Training:  53%|█████▎    | 16/30 [04:06<03:35, 15.39s/it]

  Iteration 15: Loss = 205.661


Training:  70%|███████   | 21/30 [05:24<02:19, 15.47s/it]

  Iteration 20: Loss = 174.807


Training:  87%|████████▋ | 26/30 [06:38<00:59, 14.91s/it]

  Iteration 25: Loss = 114.507


Training: 100%|██████████| 30/30 [07:38<00:00, 15.30s/it]

Training complete!





In [31]:
dev_texts = [s['sentence_text'] for s in dev_sentences]
dev_gold = [s['terms'] for s in dev_sentences]

In [32]:
# Predict on dev set
print("Running trained model predictions...")
trained_preds = trained_model.predict(dev_texts)

# Evaluate
precision, recall, f1, tp, fp, fn = micro_f1_score(dev_gold, trained_preds)
type_precision, type_recall, type_f1 = type_f1_score(dev_gold, trained_preds)

print("\nTrained Model Results:")
print("Micro-averaged metrics:")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1 Score:  {f1:.4f}")
print(f"  TP={tp}, FP={fp}, FN={fn}")
print("\nType-level metrics:")
print(f"  Type Precision: {type_precision:.4f}")
print(f"  Type Recall:    {type_recall:.4f}")
print(f"  Type F1 Score:  {type_f1:.4f}")

# Store metrics for later comparison
trained_metrics = {
    'precision': precision,
    'recall': recall,
    'f1': f1,
    'type_precision': type_precision,
    'type_recall': type_recall,
    'type_f1': type_f1
}

Running trained model predictions...

Trained Model Results:
Micro-averaged metrics:
  Precision: 0.5894
  Recall:    0.3215
  F1 Score:  0.4161
  TP=145, FP=101, FN=306

Type-level metrics:
  Type Precision: 0.5597
  Type Recall:    0.3678
  Type F1 Score:  0.4439


In [33]:
# Save trained model
trained_model.save('models/spacy_trained_new_md')

Model saved to models\spacy_trained_new_md


## Results Comparison

In [None]:
import pandas as pd
"""{
        'Model': 'Baseline (Rules)',
        'Precision': baseline_metrics['precision'],
        'Recall': baseline_metrics['recall'],
        'F1': baseline_metrics['f1']
    },"""
# Micro-averaged comparison
results_df = pd.DataFrame([
    {
        'Model': 'Trained (NER)',
        'Precision': trained_metrics['precision'],
        'Recall': trained_metrics['recall'],
        'F1': trained_metrics['f1']
    },
])

print("Micro-averaged Metrics:")
print(results_df.to_markdown(index=False))
"""{
    'Model': 'Baseline (Rules)',
    'Type Precision': baseline_metrics['type_precision'],
    'Type Recall': baseline_metrics['type_recall'],
    'Type F1': baseline_metrics['type_f1']
},"""
# Type-level comparison
type_results_df = pd.DataFrame([
   
    {
        'Model': 'Trained (NER)',
        'Type Precision': trained_metrics['type_precision'],
        'Type Recall': trained_metrics['type_recall'],
        'Type F1': trained_metrics['type_f1']
    }
])

print("\n\nType-level Metrics:")
print(type_results_df.to_markdown(index=False))



Micro-averaged Metrics:
| Model         |   Precision |   Recall |       F1 |
|:--------------|------------:|---------:|---------:|
| Trained (NER) |    0.589431 | 0.321508 | 0.416069 |


Type-level Metrics:
| Model         |   Type Precision |   Type Recall |   Type F1 |
|:--------------|-----------------:|--------------:|----------:|
| Trained (NER) |         0.559748 |      0.367769 |   0.44389 |


' f1_improvement = (trained_metrics[\'f1\'] - baseline_metrics[\'f1\']) / baseline_metrics[\'f1\'] * 100\ntype_f1_improvement = (trained_metrics[\'type_f1\'] - baseline_metrics[\'type_f1\']) / baseline_metrics[\'type_f1\'] * 100\nprint(f"\n\nMicro F1 Score improvement: {f1_improvement:+.1f}%")\nprint(f"Type F1 Score improvement: {type_f1_improvement:+.1f}%") '

## Save Predictions to Files

In [35]:
def save_predictions(predictions: List[List[str]], 
                     sentences: List[Dict], 
                     output_path: str):
    """Save predictions in competition format."""
    output = {'data': []}
    for pred, sent in zip(predictions, sentences):
        output['data'].append({
            'document_id': sent['document_id'],
            'paragraph_id': sent['paragraph_id'],
            'sentence_id': sent['sentence_id'],
            'term_list': pred
        })
    
    os.makedirs(os.path.dirname(output_path) or '.', exist_ok=True)
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(output, f, ensure_ascii=False, indent=2)
    print(f"Saved {len(predictions)} predictions to {output_path}")


# Save both sets of predictions
#save_predictions(baseline_preds, dev_sentences, 'predictions/new_subtask_a_dev_spacy_baseline_preds.json')
save_predictions(trained_preds, dev_sentences, 'predictions/new/subtask_a_dev_spacy_trained_MD_preds_refined.json')

Saved 577 predictions to predictions/new/subtask_a_dev_spacy_trained_MD_preds_refined.json


## Load and Test Saved Models

In [36]:
# Test loading baseline
loaded_baseline = SpacyRuleBaseline()
loaded_baseline.load('models/spacy_baseline')
test_preds_baseline = loaded_baseline.predict([dev_texts[0]])
assert test_preds_baseline[0] == baseline_preds[0]
print("✓ Baseline model saved and loaded correctly")

# Test loading trained model
loaded_trained = SpacyTrainedModel()
loaded_trained.load('models/spacy_trained')
test_preds_trained = loaded_trained.predict([dev_texts[0]])
assert test_preds_trained[0] == trained_preds[0]
print("✓ Trained model saved and loaded correctly")

print("\nAll models successfully saved and can be reloaded!")

✓ Loaded spaCy model: it_core_news_sm
Model loaded from models/spacy_baseline


Predicting: 100%|██████████| 1/1 [00:00<?, ?it/s]

✓ Baseline model saved and loaded correctly





Model loaded from models/spacy_trained
✓ Trained model saved and loaded correctly

All models successfully saved and can be reloaded!


## Example Predictions

In [37]:
# Show example predictions from both models
example_idx = 120
example_text = dev_texts[example_idx]
example_gold = dev_gold[example_idx]
example_baseline = baseline_preds[example_idx]
example_trained = trained_preds[example_idx]

print(f"Sentence: {example_text}\n")
print(f"Gold terms: {example_gold}\n")
print(f"Baseline predictions: {example_baseline}")
print(f"Trained predictions: {example_trained}\n")

# Show what each model got right/wrong
baseline_correct = set(example_baseline) & set(example_gold)
trained_correct = set(example_trained) & set(example_gold)

print(f"Baseline correct: {baseline_correct}")
print(f"Trained correct: {trained_correct}")

Sentence: a. per incendi dei rifiuti nei contenitori € 2.000

Gold terms: ['rifiuti']

Baseline predictions: ['rifiuti']
Trained predictions: ['rifiuti']

Baseline correct: {'rifiuti'}
Trained correct: {'rifiuti'}
