# BERT Token Classification for Italian Term Extraction

This notebook demonstrates a BERT-based approach to term extraction:
- Uses BIO tagging scheme (Beginning-Inside-Outside)
- Fine-tunes Italian BERT model for token classification
- Trains on labeled data to recognize term boundaries

Dataset: EvalITA 2025 ATE-IT (Automatic Term Extraction - Italian Testbed)


## Setup and Imports

In [71]:
import json
import os
import numpy as np
import torch
from transformers import (
    AutoTokenizer, #choose this
    AutoModelForTokenClassification,  #choose this
    TrainingArguments, 
    Trainer,
    DataCollatorForTokenClassification
)
from torch.utils.data import Dataset
import pandas as pd

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("Setup complete")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Setup complete
PyTorch version: 2.8.0+cpu
CUDA available: False


In [None]:
# Define label mappings for BIO tagging scheme
label_list = ['O', 'B-TERM', 'I-TERM']
label2id = {k: v for v, k in enumerate(label_list)}
id2label = {v: k for v, k in enumerate(label_list)}

print(f"Labels: {label_list}")
print(f"Label to ID: {label2id}")

# Model configuration
model_name = "dbmdz/bert-base-italian-cased-wwm" # for everything before: bert-base-italian-uncased
output_model_dir = "models/bert_token_classification_3e-5_changed" #TODO CHANHGE

print(f"\nModel: {model_name}")
print(f"Output directory: {output_model_dir}")

Labels: ['O', 'B-TERM', 'I-TERM']
Label to ID: {'O': 0, 'B-TERM': 1, 'I-TERM': 2}

Model: dbmdz/bert-base-italian-uncased
Output directory: models/bert_token_classification_2e-5_changed


## Data Loading and Processing

In [37]:
def load_jsonl(path: str):
    """Load a JSON lines file or JSON array file."""
    with open(path, 'r', encoding='utf-8') as f:
        text = f.read().strip()
    if not text:
        return []
    try:
        data = json.loads(text)
    except json.JSONDecodeError:
        data = []
        for line in text.splitlines():
            line = line.strip()
            if line:
                data.append(json.loads(line))
    return data

#aggregate the terms concatenating the paragraphs
def build_sentence_gold_map(records):
    """Convert dataset rows into list of sentences with aggregated terms."""
    out = {}
    
    if isinstance(records, dict) and 'data' in records:
        rows = records['data']
    else:
        rows = records
    
    for r in rows:
        key = (r.get('document_id'), r.get('paragraph_id'), r.get('sentence_id'))
        if key not in out:
            out[key] = {
                'document_id': r.get('document_id'),
                'paragraph_id': r.get('paragraph_id'),
                'sentence_id': r.get('sentence_id'),
                'sentence_text': r.get('sentence_text', ''),
                'terms': []
            }
        
        if isinstance(r.get('term_list'), list):
            for t in r.get('term_list'):
                if t and t not in out[key]['terms']:
                    out[key]['terms'].append(t)
        else:
            term = r.get('term')
            if term and term not in out[key]['terms']:
                out[key]['terms'].append(term)
    
    return list(out.values())


print("✓ Data loading functions defined")

✓ Data loading functions defined


In [38]:
# Load training and dev data
train_data = load_jsonl('../data/subtask_a_train.json')
dev_data = load_jsonl('../data/subtask_a_dev.json')

train_sentences = build_sentence_gold_map(train_data)
dev_sentences = build_sentence_gold_map(dev_data)

print(f"Training sentences: {len(train_sentences)}")
print(f"Dev sentences: {len(dev_sentences)}")
print(f"\nExample sentence:")
print(f"  Text: {train_sentences[6]['sentence_text']}")
print(f"  Terms: {train_sentences[6]['terms']}")

Training sentences: 2308
Dev sentences: 577

Example sentence:
  Text: AFFIDAMENTO DEL “SERVIZIO DI SPAZZAMENTO, RACCOLTA, TRASPORTO E SMALTIMENTO/RECUPERO DEI RIFIUTI URBANI ED ASSIMILATI E SERVIZI COMPLEMENTARI DELLA CITTA' DI AGROPOLI” VALEVOLE PER UN QUINQUENNIO
  Terms: ['raccolta', 'recupero', 'servizio di raccolta', 'servizio di spazzamento', 'smaltimento', 'trasporto']


In [39]:
def preprocess_text(text, force_lower=True):
    # fix encoding issues
    text = text.replace("\u00a0", " ")

    # normalize spaces
    text = " ".join(text.split())

    # unify apostrophes
    text = text.replace("’", "'").replace("`", "'")

    # lowercase if model is uncased
    if force_lower:
        text = text.lower()

    # remove weird control characters
    text = "".join(c for c in text if c.isprintable())

    return text


In [40]:
for entry in train_sentences:
    # pulisci il testo della frase
    entry["sentence_text"] = preprocess_text(entry["sentence_text"])
    # (opzionale ma consigliato) pulisci anche i termini gold
    entry["terms"] = [preprocess_text(t) for t in entry["terms"]]

for entry in dev_sentences:
    entry["sentence_text"] = preprocess_text(entry["sentence_text"])
    entry["terms"] = [preprocess_text(t) for t in entry["terms"]]

print(f"Training sentences: {len(train_sentences)}")
print(f"Dev sentences: {len(dev_sentences)}")
print(f"\nExample sentence:")
print(f"  Text: {train_sentences[6]['sentence_text']}")
print(f"  Terms: {train_sentences[6]['terms']}")

Training sentences: 2308
Dev sentences: 577

Example sentence:
  Text: affidamento del “servizio di spazzamento, raccolta, trasporto e smaltimento/recupero dei rifiuti urbani ed assimilati e servizi complementari della citta' di agropoli” valevole per un quinquennio
  Terms: ['raccolta', 'recupero', 'servizio di raccolta', 'servizio di spazzamento', 'smaltimento', 'trasporto']


## Evaluation Metrics

Using the official evaluation metrics from the competition.

In [41]:
def micro_f1_score(gold_standard, system_output):
    """
    Evaluates performance using Precision, Recall, and F1 score 
    based on individual term matching (micro-average).
    """
    total_true_positives = 0
    total_false_positives = 0
    total_false_negatives = 0
    
    for gold, system in zip(gold_standard, system_output):
        gold_set = set(gold)
        system_set = set(system)
        
        true_positives = len(gold_set.intersection(system_set))
        false_positives = len(system_set - gold_set)
        false_negatives = len(gold_set - system_set)
        
        total_true_positives += true_positives
        total_false_positives += false_positives
        total_false_negatives += false_negatives
    
    precision = total_true_positives / (total_true_positives + total_false_positives) if (total_true_positives + total_false_positives) > 0 else 0
    recall = total_true_positives / (total_true_positives + total_false_negatives) if (total_true_positives + total_false_negatives) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return precision, recall, f1, total_true_positives, total_false_positives, total_false_negatives


def type_f1_score(gold_standard, system_output):
    """
    Evaluates performance using Type Precision, Type Recall, and Type F1 score
    based on the set of unique terms extracted at least once across the entire dataset.
    """
    all_gold_terms = set()
    for item_terms in gold_standard:
        all_gold_terms.update(item_terms)
    
    all_system_terms = set()
    for item_terms in system_output:
        all_system_terms.update(item_terms)
    
    type_true_positives = len(all_gold_terms.intersection(all_system_terms))
    type_false_positives = len(all_system_terms - all_gold_terms)
    type_false_negatives = len(all_gold_terms - all_system_terms)
    
    type_precision = type_true_positives / (type_true_positives + type_false_positives) if (type_true_positives + type_false_positives) > 0 else 0
    type_recall = type_true_positives / (type_true_positives + type_false_negatives) if (type_true_positives + type_false_negatives) > 0 else 0
    type_f1 = 2 * (type_precision * type_recall) / (type_precision + type_recall) if (type_precision + type_recall) > 0 else 0
    
    return type_precision, type_recall, type_f1


print("✓ Evaluation functions defined")

✓ Evaluation functions defined


## Initialize BERT Model and Tokenizer

Always load
- tokenizer
- bert model

In [42]:
# Initialize tokenizer and model
print("Initializing BERT tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name, 
    num_labels=len(label_list), #labels we're trying to predict
    id2label=id2label, 
    label2id=label2id
)

print(f"✓ Tokenizer loaded: {tokenizer.__class__.__name__}")
print(f"✓ Model loaded with {model.num_labels} labels")
print(f"  Vocabulary size: {tokenizer.vocab_size}")


Initializing BERT tokenizer and model...


Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-italian-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✓ Tokenizer loaded: BertTokenizerFast
✓ Model loaded with 3 labels
  Vocabulary size: 31102


## BIO Tag Generation for Training Data

In [43]:
import string
# Initialize tokenizer and model
print("Initializing BERT tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name, 
    num_labels=len(label_list), #labels we're trying to predict
    id2label=id2label, 
    label2id=label2id
)

print(f"✓ Tokenizer loaded: {tokenizer.__class__.__name__}")
print(f"✓ Model loaded with {model.num_labels} labels")
print(f"  Vocabulary size: {tokenizer.vocab_size}")

Initializing BERT tokenizer and model...


Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-italian-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✓ Tokenizer loaded: BertTokenizerFast
✓ Model loaded with 3 labels
  Vocabulary size: 31102


## Process Training and Dev Data with BIO Tags

In [44]:
def create_ner_tags(text: str, terms: list[str], tokenizer, label2id: dict):
    """
    Crea token e BIO tag per una frase, dato l'elenco dei termini gold.

    text: frase pre-processata (come in preprocess_text)
    terms: lista di termini gold pre-processati
    tokenizer: tokenizer HuggingFace
    label2id: dict, es. {'O': 0, 'B-TERM': 1, 'I-TERM': 2}

    Ritorna:
        tokens: list[str]
        ner_tags: list[int] (stessa lunghezza di tokens)
    """

    # --- 1) Trova tutti gli span (start_char, end_char) dei termini nel testo ---

    def is_boundary(ch: str | None) -> bool:
        """True se il carattere è None o non alfanumerico (quindi buon confine di parola)."""
        if ch is None:
            return True
        return not ch.isalnum()

    spans = []  # lista di (start, end)
    for term in terms:
        term = term.strip()
        if not term:
            continue

        start = 0
        while True:
            idx = text.find(term, start)
            if idx == -1:
                break

            end = idx + len(term)

            # Controllo confini di parola
            before = text[idx - 1] if idx > 0 else None
            after = text[end] if end < len(text) else None

            if is_boundary(before) and is_boundary(after):
                spans.append((idx, end))

            start = idx + len(term)

    # opzionale: ordina gli span per inizio
    spans.sort(key=lambda x: x[0])

    # --- 2) Tokenizza con offset mapping ---
    encoded = tokenizer(
        text,
        return_offsets_mapping=True,
        add_special_tokens=False
    )

    tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"])
    offsets = encoded["offset_mapping"]

    ner_tags = [label2id["O"]] * len(tokens)

    # --- 3) Assegna BIO tag in base agli span ---
    for i, (tok_start, tok_end) in enumerate(offsets):
        # alcuni tokenizer possono dare (0, 0) per token speciali, ma noi add_special_tokens=False
        if tok_start == tok_end:
            ner_tags[i] = label2id["O"]
            continue

        tag = "O"
        for span_start, span_end in spans:
            # se il token inizia dentro uno span
            if tok_start >= span_start and tok_start < span_end:
                if tok_start == span_start:
                    tag = "B-TERM"
                else:
                    tag = "I-TERM"
                break

        ner_tags[i] = label2id[tag]

    return tokens, ner_tags


In [45]:
import pandas as pd
# Process training data
print("Processing training data...")
for i, entry in enumerate(train_sentences):
    text = entry['sentence_text']
    terms = entry['terms']
    
    tokens, ner_tags = create_ner_tags(text, terms, tokenizer, label2id)
    entry['tokens'] = tokens
    entry['ner_tags'] = ner_tags
    
    if i % 1000 == 0:
        print(f"  Processed {i}/{len(train_sentences)}")

print(f"✓ Training data processed: {len(train_sentences)} sentences")

# Process dev data
print("\nProcessing dev data...")
for i, entry in enumerate(dev_sentences):
    text = entry['sentence_text']
    terms = entry['terms']
    
    tokens, ner_tags = create_ner_tags(text, terms, tokenizer, label2id)
    entry['tokens'] = tokens
    entry['ner_tags'] = ner_tags
    
    if i % 200 == 0:
        print(f"  Processed {i}/{len(dev_sentences)}")

print(f"✓ Dev data processed: {len(dev_sentences)} sentences")

print(f"\nSample train sentence:")
print(f"  Text: {train_sentences[6]['sentence_text']}")
print(f"  Terms: {train_sentences[6]['terms']}")
token_tags = []
for token, tag in zip(train_sentences[6]['tokens'], train_sentences[6]['ner_tags']):
    token_tags.append((token, id2label[tag]))
print(f"\n{pd.DataFrame(token_tags, columns=['Token', 'Tag']).to_markdown()}")

Processing training data...
  Processed 0/2308
  Processed 1000/2308
  Processed 2000/2308
✓ Training data processed: 2308 sentences

Processing dev data...
  Processed 0/577
  Processed 200/577
  Processed 400/577
✓ Dev data processed: 577 sentences

Sample train sentence:
  Text: affidamento del “servizio di spazzamento, raccolta, trasporto e smaltimento/recupero dei rifiuti urbani ed assimilati e servizi complementari della citta' di agropoli” valevole per un quinquennio
  Terms: ['raccolta', 'recupero', 'servizio di raccolta', 'servizio di spazzamento', 'smaltimento', 'trasporto']

|    | Token         | Tag    |
|---:|:--------------|:-------|
|  0 | affidamento   | O      |
|  1 | del           | O      |
|  2 | “             | O      |
|  3 | servizio      | B-TERM |
|  4 | di            | I-TERM |
|  5 | spa           | I-TERM |
|  6 | ##zzamento    | I-TERM |
|  7 | ,             | O      |
|  8 | raccolta      | B-TERM |
|  9 | ,             | O      |
| 10 | trasporto     | 

## Prepare Dataset for BERT Training

In [46]:
class TokenClassificationDataset(Dataset):
    """Dataset per token classification usando i token BERT e le ner_tags già pre-calcolate."""
    
    def __init__(self, sentences, tokenizer, max_length=512):
        """
        sentences: lista di dict (train_sentences / dev_sentences),
                   ognuno con 'sentence_text', 'tokens', 'ner_tags'
        """
        self.sentences = sentences
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.sentences)
    
    def __getitem__(self, idx):
        entry = self.sentences[idx]
        
        # subtoken BERT (senza special tokens) e label allineate 1:1
        bert_tokens = entry["tokens"]
        bert_labels = entry["ner_tags"]  # lista di int (id delle label)

        # converti i token in ids
        subtoken_ids = self.tokenizer.convert_tokens_to_ids(bert_tokens)

        # rispetta il max_length: lasciamo spazio per CLS e SEP
        max_subtokens = self.max_length - 2
        subtoken_ids = subtoken_ids[:max_subtokens]
        bert_labels = bert_labels[:max_subtokens]

        # costruisci input_ids con CLS e SEP
        input_ids = [self.tokenizer.cls_token_id] + subtoken_ids + [self.tokenizer.sep_token_id]

        # mask: 1 per token reali (CLS + subtokens + SEP)
        attention_mask = [1] * len(input_ids)

        # labels: -100 per CLS/SEP, poi le nostre label
        labels = [-100] + bert_labels + [-100]

        assert len(input_ids) == len(attention_mask) == len(labels)

        # NON facciamo padding qui: ci pensa DataCollatorForTokenClassification
        return {
            "input_ids": torch.tensor(input_ids, dtype=torch.long),
            "attention_mask": torch.tensor(attention_mask, dtype=torch.long),
            "labels": torch.tensor(labels, dtype=torch.long),
        }


In [47]:
print("Creating training datasets...")

train_dataset = TokenClassificationDataset(
    sentences=train_sentences,
    tokenizer=tokenizer,
    max_length=512
)

dev_dataset = TokenClassificationDataset(
    sentences=dev_sentences,
    tokenizer=tokenizer,
    max_length=512,
)

print(f"✓ Training dataset: {len(train_dataset)} examples")
print(f"✓ Dev dataset: {len(dev_dataset)} examples")


Creating training datasets...
✓ Training dataset: 2308 examples
✓ Dev dataset: 577 examples


## Configure Training Arguments

In [48]:
# Setup data collator for token classification
# Data collator is used to dynamically pad inputs and labels
data_collator = DataCollatorForTokenClassification(
    tokenizer=tokenizer,
    padding=True,
    return_tensors="pt"
)
print("✓ Data collator initialized")

✓ Data collator initialized


In [49]:
from seqeval.metrics import precision_score, recall_score, f1_score
import numpy as np

def align_predictions(predictions, label_ids):
    preds = np.argmax(predictions, axis=2)
    batch_true = []
    batch_pred = []

    for pred, lab in zip(preds, label_ids):
        true_labels = []
        pred_labels = []
        for p, l in zip(pred, lab):
            if l == -100:
                continue
            true_labels.append(id2label[l])
            pred_labels.append(id2label[p])
        batch_true.append(true_labels)
        batch_pred.append(pred_labels)

    return batch_pred, batch_true

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    y_pred, y_true = align_predictions(logits, labels)

    return {
        "precision": precision_score(y_true, y_pred),
        "recall":    recall_score(y_true, y_pred),
        "f1":        f1_score(y_true, y_pred),
    }


In [None]:
# Define training arguments
""" training_args = TrainingArguments(
    output_dir=output_model_dir,
    learning_rate= 3e-5,    # before 2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=7,
    weight_decay=0.01, #
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False, #push the model to hugging face
    logging_steps=100,
    save_total_limit=2,
    seed=42,
    fp16=torch.cuda.is_available(), #only if you have gpu
    report_to="none"
) """

training_args = TrainingArguments(
    output_dir=output_model_dir,
    learning_rate=3e-5,                  # TORNA a 2e-5 (2e-5_changed) - PRIMA 3e-5
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=7,                  # tetto massimo
    weight_decay=0.01,

    # !!! PARAMETRO CORRETTO !!!
    eval_strategy="epoch",         # non eval_strategy
    save_strategy="epoch",
    load_best_model_at_end=True,

    # scheduler + warmup
    lr_scheduler_type="linear",
    warmup_ratio=0.1,

    logging_steps=100,
    save_total_limit=2,
    seed=42,
    fp16=torch.cuda.is_available(),
    report_to="none",

    # best model basato sulla F1 (se compute_metrics la espone)
    metric_for_best_model="f1",
    greater_is_better=True,
)

print("✓ Training configuration ready")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Learning rate: {training_args.learning_rate}")

✓ Training configuration ready
  Batch size: 16
  Epochs: 7
  Learning rate: 2e-05


## Train BERT Model

Note: This cell might take several minutes to run.


**Additional configurations to test**
- Aggregate training samples per paragraph/document
- Change hyperparameters (*learning_rate*, *batch_size*, *num_train_epochs*, *weight_decay*)

In [51]:
from transformers import EarlyStoppingCallback

# Initialize Trainer
print("Initializing Trainer...")
""" trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
) """
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,                 # con F1 token-level
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
)
print("✓ Trainer initialized")
print(f"  Training samples: {len(train_dataset)}")
print(f"  Evaluation samples: {len(dev_dataset)}")

Initializing Trainer...
✓ Trainer initialized
  Training samples: 2308
  Evaluation samples: 577


  trainer = Trainer(


In [52]:
# Start training
print("="*60)
print("Starting model training...")
print("="*60)

import time
training_start_time = time.time()

train_result = trainer.train()
#trainer.train(resume_from_checkpoint="models/bert_token_classification/checkpoint-725")


training_duration = time.time() - training_start_time

print("\n" + "="*60)
print("✓ TRAINING COMPLETED!")
print("="*60)
print(f"Training time: {training_duration/60:.2f} minutes")

Starting model training...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.5242,0.166736,0.445783,0.585973,0.506354
2,0.1661,0.132566,0.681395,0.662896,0.672018
3,0.0754,0.126107,0.648649,0.760181,0.7
4,0.0644,0.128943,0.648,0.733032,0.687898
5,0.0342,0.134354,0.697674,0.746606,0.721311
6,0.0228,0.137543,0.70021,0.755656,0.726877
7,0.0177,0.140826,0.703158,0.755656,0.728462





✓ TRAINING COMPLETED!
Training time: 62.09 minutes


## Save Trained Model

In [53]:
# Save the trained model
print("Saving trained model...")

os.makedirs(output_model_dir, exist_ok=True)
trainer.save_model(output_model_dir)
tokenizer.save_pretrained(output_model_dir)

print(f"✓ Model saved to: {output_model_dir}")

Saving trained model...
✓ Model saved to: models/bert_token_classification_2e-5_changed


## Inference Function

In [63]:
# Load the trained model for inference
print("Loading trained model for inference...")

inference_model = AutoModelForTokenClassification.from_pretrained(output_model_dir)
inference_tokenizer = AutoTokenizer.from_pretrained(output_model_dir)
inference_model.eval()

print(f"✓ Model loaded from: {output_model_dir}")

Loading trained model for inference...
✓ Model loaded from: models/bert_token_classification_2e-5_changed


## Predict on Dev Set

In [64]:
def clean_term(term: str) -> str:
    t = term.strip()
    # normalizza spazi
    t = " ".join(t.split())
    # togli punteggiatura solo ai bordi (non in mezzo)
    t = t.strip(string.punctuation + "«»“”'\"")
    return t.lower()


In [65]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inference_model = AutoModelForTokenClassification.from_pretrained(output_model_dir).to(device)
id2label = inference_model.config.id2label  # override quella globale, se vuoi
inference_tokenizer = AutoTokenizer.from_pretrained(output_model_dir)
inference_model.eval()


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(31102, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

In [66]:
def perform_inference(model, tokenizer, text, id2label):
    encoded = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding=True,
        return_offsets_mapping=True,
        max_length=512,
    )
    offset_mapping = encoded["offset_mapping"][0]  # (seq_len, 2)
    attention_mask = encoded["attention_mask"][0]

    inputs = {k: v.to(device) for k, v in encoded.items() if k != "offset_mapping"}

    with torch.no_grad():
        outputs = model(**inputs)
        predicted_labels = outputs.logits.argmax(dim=-1)[0].cpu()

    tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"][0])
    labels = [id2label[p.item()] for p in predicted_labels]

    predicted_terms = []
    current_term_chars = []

    text_proc = text  # già preprocessato prima di chiamare la funzione

    for (token, label, mask, (start, end)) in zip(tokens, labels, attention_mask, offset_mapping):
        if mask.item() == 0:
            continue  # padding

        # salta special token (meglio che per stringa)
        if start == 0 and end == 0:
            continue

        if label == "B-TERM":
            if current_term_chars:
                predicted_terms.append("".join(current_term_chars))
                current_term_chars = []
            current_term_chars.append(text_proc[start:end])
        elif label == "I-TERM" and current_term_chars:
            current_term_chars.append(" " + text_proc[start:end])
        else:
            if current_term_chars:
                predicted_terms.append("".join(current_term_chars))
                current_term_chars = []

    if current_term_chars:
        predicted_terms.append("".join(current_term_chars))

    predicted_terms = [clean_term(t) for t in predicted_terms if clean_term(t)]
    return predicted_terms


In [67]:
# Run inference on all dev sentences
print("Running inference on dev set...")
bert_preds = []

for i, sentence in enumerate(dev_sentences):
    if i % 200 == 0:
        print(f"  Processing {i}/{len(dev_sentences)}")
    
    predicted_terms = perform_inference(
        inference_model,
        inference_tokenizer,
        preprocess_text(sentence["sentence_text"]), #CHANGED
        id2label
    )
    bert_preds.append(predicted_terms)

print(f"✓ Inference completed: {len(bert_preds)} predictions")

Running inference on dev set...
  Processing 0/577
  Processing 200/577
  Processing 400/577
✓ Inference completed: 577 predictions


In [68]:
metrics = trainer.evaluate()
print(metrics)


{'eval_loss': 0.14082567393779755, 'eval_precision': 0.7031578947368421, 'eval_recall': 0.755656108597285, 'eval_f1': 0.7284623773173392, 'eval_runtime': 20.9162, 'eval_samples_per_second': 27.586, 'eval_steps_per_second': 1.769, 'epoch': 7.0}


In [69]:
def normalize_for_eval(t: str) -> str:
    # stessa logica di clean_term / preprocess_text
    t = t.strip().lower()
    t = " ".join(t.split())
    # togli punteggiatura ai bordi
    t = t.strip(string.punctuation + "«»“”'\"")
    return t

def compute_term_level_metrics_from_bert(dev_sentences, bert_preds):
    """
    dev_sentences: lista di dict (già pre-processati) con campo 'terms'
    bert_preds:    lista parallela di liste di termini predetti
    """
    assert len(dev_sentences) == len(bert_preds)

    tp = fp = fn = 0

    for entry, pred_terms in zip(dev_sentences, bert_preds):
        gold_terms = [normalize_for_eval(t) for t in entry["terms"]]
        pred_terms = [normalize_for_eval(t) for t in pred_terms]

        gold_set = set(gold_terms)
        pred_set = set(pred_terms)

        tp += len(gold_set & pred_set)
        fp += len(pred_set - gold_set)
        fn += len(gold_set - pred_set)

    precision = tp / (tp + fp + 1e-8)
    recall    = tp / (tp + fn + 1e-8)
    f1        = 2 * precision * recall / (precision + recall + 1e-8)

    print("=== TERM-LEVEL METRICS (DIRECT FROM bert_preds) ===")
    print(f"Precision: {precision:.3f}")
    print(f"Recall:    {recall:.3f}")
    print(f"F1:        {f1:.3f}")
    print(f"TP={tp}, FP={fp}, FN={fn}")

    return precision, recall, f1

compute_term_level_metrics_from_bert(dev_sentences, bert_preds)


=== TERM-LEVEL METRICS (DIRECT FROM bert_preds) ===
Precision: 0.460
Recall:    0.446
F1:        0.453
TP=201, FP=236, FN=250


(0.4599542333990857, 0.44567627493468565, 0.45270269769374955)

In [70]:
# Prepare gold standard and predictions for evaluation
dev_gold = [s['terms'] for s in dev_sentences]

# Evaluate using competition metrics
precision, recall, f1, tp, fp, fn= micro_f1_score(dev_gold, bert_preds)
type_precision, type_recall, type_f1 = type_f1_score(dev_gold, bert_preds)

print("\n" + "="*60)
print("BERT TOKEN CLASSIFICATION RESULTS")
print("="*60)
print("\nMicro-averaged Metrics:")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1 Score:  {f1:.4f}")
print(f"  TP={tp}, FP={fp}, FN={fn}")

print("\nType-level Metrics:")
print(f"  Type Precision: {type_precision:.4f}")
print(f"  Type Recall:    {type_recall:.4f}")
print(f"  Type F1 Score:  {type_f1:.4f}")
print("="*60)


BERT TOKEN CLASSIFICATION RESULTS

Micro-averaged Metrics:
  Precision: 0.4600
  Recall:    0.4457
  F1 Score:  0.4527
  TP=201, FP=236, FN=250

Type-level Metrics:
  Type Precision: 0.3948
  Type Recall:    0.3802
  Type F1 Score:  0.3874


BERT TOKEN CLASSIFICATION RESULTS (2e-5)

Micro-averaged Metrics:
Precision: 0.7079
Recall:    0.6718
F1 Score:  0.6894
TP=303, FP=125, FN=148

Type-level Metrics:
Type Precision: 0.6545
Type Recall:    0.5950
Type F1 Score:  0.6234

In [61]:
# Save predictions in competition format
def save_predictions(predictions, sentences, output_path):
    """Save predictions in competition format."""
    output = {'data': []}
    for pred, sent in zip(predictions, sentences):
        output['data'].append({
            'document_id': sent['document_id'],
            'paragraph_id': sent['paragraph_id'],
            'sentence_id': sent['sentence_id'],
            'term_list': pred
        })
    
    os.makedirs(os.path.dirname(output_path) or '.', exist_ok=True)
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(output, f, ensure_ascii=False, indent=2)
    print(f"✓ Saved {len(predictions)} predictions to {output_path}")


save_predictions(bert_preds, dev_sentences, 'predictions/subtask_a_dev_bert_token_classification_preds_extended_2e-5_changed.json') #CHANGE

✓ Saved 577 predictions to predictions/subtask_a_dev_bert_token_classification_preds_extended_2e-5_changed.json


## Example Predictions

In [62]:
# Show example predictions
print("Example Predictions:\n")

count = 0
for i in range(len(dev_sentences)):
    if len(dev_gold[i]) > 0 and count < 5:
        print(f"Sentence: {dev_sentences[i]['sentence_text'][:100]}...")
        print(f"Gold terms: {dev_gold[i][:5]}")
        print(f"BERT predictions: {bert_preds[i][:5]}")
        
        correct = set(dev_gold[i]) & set(bert_preds[i])
        missed = set(dev_gold[i]) - set(bert_preds[i])
        wrong = set(bert_preds[i]) - set(dev_gold[i])
        
        print(f"✓ Correct: {len(correct)}")
        print(f"✗ Missed: {len(missed)}")
        print(f"✗ Wrong: {len(wrong)}")
        print("-"*80)
        print()
        
        count += 1

Example Predictions:

Sentence: il presente disciplinare per la gestione dei centri di raccolta comunali è stato redatto ai sensi e ...
Gold terms: ['disciplina dei centri di raccolta dei rifiuti urbani raccolti in modo differenziato', 'disciplinare per la gestione dei centri di raccolta comunali']
BERT predictions: ['disciplinare per la gestione dei centri di raccolta comunali', 'centri di raccolta dei rifiuti urbani raccolti in']
✓ Correct: 1
✗ Missed: 1
✗ Wrong: 1
--------------------------------------------------------------------------------

Sentence: è un servizio supplementare di raccolta, rivolto a famiglie con bambini al di sotto dei 3 anni o con...
Gold terms: ['raccolta']
BERT predictions: ['servizio']
✓ Correct: 0
✗ Missed: 1
✗ Wrong: 1
--------------------------------------------------------------------------------

Sentence: ll servizio di raccolta dei rifiuti derivanti da sfalci e potature è gestito dalla buttol srl con il...
Gold terms: ['servizio di raccolta dei rifiu