# BERT Token Classification for Italian Term Extraction

This notebook demonstrates a BERT-based approach to term extraction:
- Uses BIO tagging scheme (Beginning-Inside-Outside)
- Fine-tunes Italian BERT model for token classification
- Trains on labeled data to recognize term boundaries

Dataset: EvalITA 2025 ATE-IT (Automatic Term Extraction - Italian Testbed)


## Setup and Imports

In [1]:
import json
import os
import numpy as np
import torch
from transformers import (
    AutoTokenizer, #choose this
    AutoModelForTokenClassification,  #choose this
    TrainingArguments, 
    Trainer,
    DataCollatorForTokenClassification
)
from torch.utils.data import Dataset
import pandas as pd

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("Setup complete")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Setup complete
PyTorch version: 2.8.0+cpu
CUDA available: False


In [2]:
# Define label mappings for BIO tagging scheme
label_list = ['O', 'B-TERM', 'I-TERM']
label2id = {k: v for v, k in enumerate(label_list)}
id2label = {v: k for v, k in enumerate(label_list)}

print(f"Labels: {label_list}")
print(f"Label to ID: {label2id}")

# Model configuration
model_name = "dbmdz/bert-base-italian-uncased" 
output_model_dir = "models/bert_token_classification_2e-5_uncased_train_dev" 

print(f"\nModel: {model_name}")
print(f"Output directory: {output_model_dir}")

Labels: ['O', 'B-TERM', 'I-TERM']
Label to ID: {'O': 0, 'B-TERM': 1, 'I-TERM': 2}

Model: dbmdz/bert-base-italian-uncased
Output directory: models/bert_token_classification_2e-5_uncased_train_dev


## Data Loading and Processing

In [3]:
def load_jsonl(path: str):
    """Load a JSON lines file or JSON array file."""
    with open(path, 'r', encoding='utf-8') as f:
        text = f.read().strip()
    if not text:
        return []
    try:
        data = json.loads(text)
    except json.JSONDecodeError:
        data = []
        for line in text.splitlines():
            line = line.strip()
            if line:
                data.append(json.loads(line))
    return data

#aggregate the terms concatenating the paragraphs
def build_sentence_gold_map(records):
    """Convert dataset rows into list of sentences with aggregated terms."""
    out = {}
    
    if isinstance(records, dict) and 'data' in records:
        rows = records['data']
    else:
        rows = records
    
    for r in rows:
        key = (r.get('document_id'), r.get('paragraph_id'), r.get('sentence_id'))
        if key not in out:
            out[key] = {
                'document_id': r.get('document_id'),
                'paragraph_id': r.get('paragraph_id'),
                'sentence_id': r.get('sentence_id'),
                'sentence_text': r.get('sentence_text', ''),
                'terms': []
            }
        
        if isinstance(r.get('term_list'), list):
            for t in r.get('term_list'):
                if t and t not in out[key]['terms']:
                    out[key]['terms'].append(t)
        else:
            term = r.get('term')
            if term and term not in out[key]['terms']:
                out[key]['terms'].append(term)
    
    return list(out.values())


print("✓ Data loading functions defined")

✓ Data loading functions defined


In [4]:
# Load training and dev data
train_data = load_jsonl('../../data/subtask_a_train.json')
dev_data = load_jsonl('../../data/subtask_a_dev.json')

train_sentences = build_sentence_gold_map(train_data)
dev_sentences = build_sentence_gold_map(dev_data)

print(f"Training sentences: {len(train_sentences)}")
print(f"Dev sentences: {len(dev_sentences)}")
print(f"\nExample sentence:")
print(f"  Text: {train_sentences[6]['sentence_text']}")
print(f"  Terms: {train_sentences[6]['terms']}")

Training sentences: 2308
Dev sentences: 577

Example sentence:
  Text: AFFIDAMENTO DEL “SERVIZIO DI SPAZZAMENTO, RACCOLTA, TRASPORTO E SMALTIMENTO/RECUPERO DEI RIFIUTI URBANI ED ASSIMILATI E SERVIZI COMPLEMENTARI DELLA CITTA' DI AGROPOLI” VALEVOLE PER UN QUINQUENNIO
  Terms: ['raccolta', 'recupero', 'servizio di raccolta', 'servizio di spazzamento', 'smaltimento', 'trasporto']


In [5]:
def preprocess_text(text, force_lower=True):
    # fix encoding issues
    text = text.replace("\u00a0", " ")

    # normalize spaces
    text = " ".join(text.split())

    # unify apostrophes
    text = text.replace("’", "'").replace("`", "'")

    # lowercase if model is uncased
    if force_lower:
        text = text.lower()

    # remove weird control characters
    text = "".join(c for c in text if c.isprintable())

    return text


In [6]:
for entry in train_sentences:
    # pulisci il testo della frase
    entry["sentence_text"] = preprocess_text(entry["sentence_text"])
    # (opzionale ma consigliato) pulisci anche i termini gold
    entry["terms"] = [preprocess_text(t) for t in entry["terms"]]

for entry in dev_sentences:
    entry["sentence_text"] = preprocess_text(entry["sentence_text"])
    entry["terms"] = [preprocess_text(t) for t in entry["terms"]]

all_sentences = train_sentences + dev_sentences
print(f"Total TRAIN+DEV sentences: {len(all_sentences)}")

Total TRAIN+DEV sentences: 2885


## Initialize BERT Model and Tokenizer

Always load
- tokenizer
- bert model

In [7]:
# Initialize tokenizer and model
print("Initializing BERT tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name, 
    num_labels=len(label_list), #labels we're trying to predict
    id2label=id2label, 
    label2id=label2id
)

print(f"✓ Tokenizer loaded: {tokenizer.__class__.__name__}")
print(f"✓ Model loaded with {model.num_labels} labels")
print(f"  Vocabulary size: {tokenizer.vocab_size}")


Initializing BERT tokenizer and model...


Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-italian-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✓ Tokenizer loaded: BertTokenizerFast
✓ Model loaded with 3 labels
  Vocabulary size: 31102


## BIO Tag Generation for Training Data

In [8]:
import string
# Initialize tokenizer and model
print("Initializing BERT tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name, 
    num_labels=len(label_list), #labels we're trying to predict
    id2label=id2label, 
    label2id=label2id
)

print(f"✓ Tokenizer loaded: {tokenizer.__class__.__name__}")
print(f"✓ Model loaded with {model.num_labels} labels")
print(f"  Vocabulary size: {tokenizer.vocab_size}")

Initializing BERT tokenizer and model...


Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-italian-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✓ Tokenizer loaded: BertTokenizerFast
✓ Model loaded with 3 labels
  Vocabulary size: 31102


## Process Training + Dev Data with BIO Tags

In [10]:
def create_ner_tags(text: str, terms: list[str], tokenizer, label2id: dict):
    """
    Crea token e BIO tag per una frase, dato l'elenco dei termini gold.

    text: frase pre-processata (come in preprocess_text)
    terms: lista di termini gold pre-processati
    tokenizer: tokenizer HuggingFace
    label2id: dict, es. {'O': 0, 'B-TERM': 1, 'I-TERM': 2}

    Ritorna:
        tokens: list[str]
        ner_tags: list[int] (stessa lunghezza di tokens)
    """

    # --- 1) Trova tutti gli span (start_char, end_char) dei termini nel testo ---

    def is_boundary(ch: str | None) -> bool:
        """True se il carattere è None o non alfanumerico (quindi buon confine di parola)."""
        if ch is None:
            return True
        return not ch.isalnum()

    spans = []  # lista di (start, end)
    for term in terms:
        term = term.strip()
        if not term:
            continue

        start = 0
        while True:
            idx = text.find(term, start)
            if idx == -1:
                break

            end = idx + len(term)

            # Controllo confini di parola
            before = text[idx - 1] if idx > 0 else None
            after = text[end] if end < len(text) else None

            if is_boundary(before) and is_boundary(after):
                spans.append((idx, end))

            start = idx + len(term)

    # opzionale: ordina gli span per inizio
    spans.sort(key=lambda x: x[0])

    # --- 2) Tokenizza con offset mapping ---
    encoded = tokenizer(
        text,
        return_offsets_mapping=True,
        add_special_tokens=False
    )

    tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"])
    offsets = encoded["offset_mapping"]

    ner_tags = [label2id["O"]] * len(tokens)

    # --- 3) Assegna BIO tag in base agli span ---
    for i, (tok_start, tok_end) in enumerate(offsets):
        # alcuni tokenizer possono dare (0, 0) per token speciali, ma noi add_special_tokens=False
        if tok_start == tok_end:
            ner_tags[i] = label2id["O"]
            continue

        tag = "O"
        for span_start, span_end in spans:
            # se il token inizia dentro uno span
            if tok_start >= span_start and tok_start < span_end:
                if tok_start == span_start:
                    tag = "B-TERM"
                else:
                    tag = "I-TERM"
                break

        ner_tags[i] = label2id[tag]

    return tokens, ner_tags


In [11]:
import pandas as pd
# Process training+dev data
print("Processing TRAIN+DEV data...")
for i, entry in enumerate(all_sentences):
    text = entry["sentence_text"]
    terms = entry["terms"]

    tokens, ner_tags = create_ner_tags(text, terms, tokenizer, label2id)
    entry["tokens"] = tokens
    entry["ner_tags"] = ner_tags

    if i % 1000 == 0:
        print(f"  Processed {i}/{len(all_sentences)}")

print(f"✓ TRAIN+DEV data processed: {len(all_sentences)} sentences")

print(f"\nSample sentence:")
print(f"  Text: {all_sentences[6]['sentence_text']}")
print(f"  Terms: {all_sentences[6]['terms']}")
token_tags = []
for token, tag in zip(all_sentences[6]['tokens'], all_sentences[6]['ner_tags']):
    token_tags.append((token, id2label[tag]))
print(f"\n{pd.DataFrame(token_tags, columns=['Token', 'Tag']).to_markdown()}")


Processing TRAIN+DEV data...
  Processed 0/2885
  Processed 1000/2885
  Processed 2000/2885
✓ TRAIN+DEV data processed: 2885 sentences

Sample sentence:
  Text: affidamento del “servizio di spazzamento, raccolta, trasporto e smaltimento/recupero dei rifiuti urbani ed assimilati e servizi complementari della citta' di agropoli” valevole per un quinquennio
  Terms: ['raccolta', 'recupero', 'servizio di raccolta', 'servizio di spazzamento', 'smaltimento', 'trasporto']

|    | Token         | Tag    |
|---:|:--------------|:-------|
|  0 | affidamento   | O      |
|  1 | del           | O      |
|  2 | “             | O      |
|  3 | servizio      | B-TERM |
|  4 | di            | I-TERM |
|  5 | spa           | I-TERM |
|  6 | ##zzamento    | I-TERM |
|  7 | ,             | O      |
|  8 | raccolta      | B-TERM |
|  9 | ,             | O      |
| 10 | trasporto     | B-TERM |
| 11 | e             | O      |
| 12 | smaltimento   | B-TERM |
| 13 | /             | O      |
| 14 | recupero  

## Prepare Dataset for BERT Training

In [12]:
class TokenClassificationDataset(Dataset):
    """Dataset per token classification usando i token BERT e le ner_tags già pre-calcolate."""
    
    def __init__(self, sentences, tokenizer, max_length=512):
        """
        sentences: lista di dict (train_sentences / dev_sentences),
                   ognuno con 'sentence_text', 'tokens', 'ner_tags'
        """
        self.sentences = sentences
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.sentences)
    
    def __getitem__(self, idx):
        entry = self.sentences[idx]
        
        # subtoken BERT (senza special tokens) e label allineate 1:1
        bert_tokens = entry["tokens"]
        bert_labels = entry["ner_tags"]  # lista di int (id delle label)

        # converti i token in ids
        subtoken_ids = self.tokenizer.convert_tokens_to_ids(bert_tokens)

        # rispetta il max_length: lasciamo spazio per CLS e SEP
        max_subtokens = self.max_length - 2
        subtoken_ids = subtoken_ids[:max_subtokens]
        bert_labels = bert_labels[:max_subtokens]

        # costruisci input_ids con CLS e SEP
        input_ids = [self.tokenizer.cls_token_id] + subtoken_ids + [self.tokenizer.sep_token_id]

        # mask: 1 per token reali (CLS + subtokens + SEP)
        attention_mask = [1] * len(input_ids)

        # labels: -100 per CLS/SEP, poi le nostre label
        labels = [-100] + bert_labels + [-100]

        assert len(input_ids) == len(attention_mask) == len(labels)

        # NON facciamo padding qui: ci pensa DataCollatorForTokenClassification
        return {
            "input_ids": torch.tensor(input_ids, dtype=torch.long),
            "attention_mask": torch.tensor(attention_mask, dtype=torch.long),
            "labels": torch.tensor(labels, dtype=torch.long),
        }


In [17]:
print("Creating training dataset (train+dev)...")

train_dev_dataset = TokenClassificationDataset(
    sentences=all_sentences,
    tokenizer=tokenizer,
    max_length=512
)

print(f"✓ TRAIN+DEV dataset: {len(train_dev_dataset)} examples")


Creating training dataset (train+dev)...
✓ TRAIN+DEV dataset: 2885 examples


## Configure Training Arguments

In [18]:
# Setup data collator for token classification
# Data collator is used to dynamically pad inputs and labels
data_collator = DataCollatorForTokenClassification(
    tokenizer=tokenizer,
    padding=True,
    return_tensors="pt"
)
print("✓ Data collator initialized")

✓ Data collator initialized


In [19]:
from seqeval.metrics import precision_score, recall_score, f1_score
import numpy as np

def align_predictions(predictions, label_ids):
    preds = np.argmax(predictions, axis=2)
    batch_true = []
    batch_pred = []

    for pred, lab in zip(preds, label_ids):
        true_labels = []
        pred_labels = []
        for p, l in zip(pred, lab):
            if l == -100:
                continue
            true_labels.append(id2label[l])
            pred_labels.append(id2label[p])
        batch_true.append(true_labels)
        batch_pred.append(pred_labels)

    return batch_pred, batch_true

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    y_pred, y_true = align_predictions(logits, labels)

    return {
        "precision": precision_score(y_true, y_pred),
        "recall":    recall_score(y_true, y_pred),
        "f1":        f1_score(y_true, y_pred),
    }


In [20]:
training_args = TrainingArguments(
    output_dir=output_model_dir,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=7,
    weight_decay=0.01,

    # no evaluation --> everything as training
    eval_strategy="no",
    save_strategy="epoch",   
    load_best_model_at_end=False,

    lr_scheduler_type="linear",
    warmup_ratio=0.1,

    logging_steps=100,
    save_total_limit=2,
    seed=42,
    fp16=torch.cuda.is_available(),
    report_to="none",
)

print("✓ Training configuration ready (FINAL TRAIN+DEV)")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Learning rate: {training_args.learning_rate}")


✓ Training configuration ready (FINAL TRAIN+DEV)
  Batch size: 16
  Epochs: 7
  Learning rate: 2e-05


## Train BERT Model

Note: This cell might take several minutes to run.


**Additional configurations to test**
- Aggregate training samples per paragraph/document
- Change hyperparameters (*learning_rate*, *batch_size*, *num_train_epochs*, *weight_decay*)

In [21]:
from transformers import EarlyStoppingCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dev_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

print("✓ Trainer initialized (FINAL TRAIN+DEV)")
print(f"  Training samples: {len(train_dev_dataset)}")

✓ Trainer initialized (FINAL TRAIN+DEV)
  Training samples: 2885


  trainer = Trainer(


In [22]:
# Start training
print("="*60)
print("Starting model training...")
print("="*60)

import time
training_start_time = time.time()

train_result = trainer.train()
training_duration = time.time() - training_start_time

print("\n" + "="*60)
print("✓ TRAINING COMPLETED!")
print("="*60)
print(f"Training time: {training_duration/60:.2f} minutes")

Starting model training...




Step,Training Loss
100,0.5442
200,0.1829
300,0.1112
400,0.0989
500,0.0715
600,0.0509
700,0.0456
800,0.0357
900,0.0298
1000,0.0221





✓ TRAINING COMPLETED!
Training time: 34.84 minutes


## Save Trained Model

In [23]:
# Save the trained model
print("Saving trained model...")

os.makedirs(output_model_dir, exist_ok=True)
trainer.save_model(output_model_dir)
tokenizer.save_pretrained(output_model_dir)

print(f"✓ Model saved to: {output_model_dir}")

Saving trained model...
✓ Model saved to: models/bert_token_classification_2e-5_uncased_train_dev


## Inference Function

In [13]:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
test_data = load_jsonl('../../data/test.json')
test_sentences = build_sentence_gold_map(test_data)

In [14]:
# Load the trained model for inference
print("Loading trained model for inference...")

inference_model = AutoModelForTokenClassification.from_pretrained(output_model_dir)
inference_tokenizer = AutoTokenizer.from_pretrained(output_model_dir)
inference_model.eval()
id2label = inference_model.config.id2label



print(f"✓ Model loaded from: {output_model_dir}")

Loading trained model for inference...
✓ Model loaded from: models/bert_token_classification_2e-5_uncased_train_dev


In [15]:
def clean_term(term: str) -> str:
    t = term.strip()
    # normalizza spazi
    t = " ".join(t.split())
    # togli punteggiatura solo ai bordi (non in mezzo)
    t = t.strip(string.punctuation + "«»“”'\"")
    return t.lower()


In [17]:
def perform_inference(model, tokenizer, text: str, id2label: dict) -> list[str]:
    """
    Perform token classification inference on a single text and
    extract TERM spans using the BIO scheme.
    """
    # Preprocess text exactly as in training
    text = preprocess_text(text)

    # Tokenize with offset mapping (we keep it in case we want spans later)
    encoded = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding=True,
        return_offsets_mapping=True
    )

    # Move tensors to the same device as the model
    encoded = {k: v.to(device) if isinstance(v, torch.Tensor) else v
               for k, v in encoded.items()}

    offset_mapping = encoded.pop("offset_mapping")  # not used directly now

    with torch.no_grad():
        outputs = model(**encoded)
        logits = outputs.logits
        predictions = torch.nn.functional.softmax(logits, dim=-1)
        predicted_labels = torch.argmax(predictions, dim=-1)

    # Convert ids back to tokens and labels
    input_ids = encoded["input_ids"][0]
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    labels = [id2label[p.item()] for p in predicted_labels[0]]

    # Extract terms using BIO scheme
    predicted_terms = []
    current_term_tokens = []

    for token, label in zip(tokens, labels):
        # Skip special tokens
        if token in [tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token]:
            continue

        if label == "B-TERM":
            # If we were building a previous term, close it
            if current_term_tokens:
                term_str = tokenizer.convert_tokens_to_string(current_term_tokens)
                term_str = clean_term(term_str)
                if term_str:
                    predicted_terms.append(term_str)
            # Start a new term
            current_term_tokens = [token]

        elif label == "I-TERM" and current_term_tokens:
            # Continue the current term
            current_term_tokens.append(token)

        else:
            # Outside a term or I-TERM without a current B-TERM
            if current_term_tokens:
                term_str = tokenizer.convert_tokens_to_string(current_term_tokens)
                term_str = clean_term(term_str)
                if term_str:
                    predicted_terms.append(term_str)
                current_term_tokens = []

    # Close any pending term at the end
    if current_term_tokens:
        term_str = tokenizer.convert_tokens_to_string(current_term_tokens)
        term_str = clean_term(term_str)
        if term_str:
            predicted_terms.append(term_str)

    return predicted_terms


In [18]:
# Run inference on all dev sentences
import string

print("Running inference on dev set...")
bert_preds = []
test_sentences = build_sentence_gold_map(test_data)
for i, sentence in enumerate(test_sentences):
    if i % 200 == 0:
        print(f"  Processing {i}/{len(test_sentences)}")
    
    predicted_terms = perform_inference(
        inference_model,
        inference_tokenizer,
        sentence["sentence_text"], 
        id2label
    )
    bert_preds.append(predicted_terms)

print(f"Inference completed: {len(bert_preds)} predictions")

Running inference on dev set...
  Processing 0/1142
  Processing 200/1142
  Processing 400/1142
  Processing 600/1142
  Processing 800/1142
  Processing 1000/1142
Inference completed: 1142 predictions


In [20]:
# Save predictions in competition format
def save_predictions(predictions, sentences, output_path):
    """Save predictions in competition format."""
    output = {'data': []}
    for pred, sent in zip(predictions, sentences):
        output['data'].append({
            'document_id': sent['document_id'],
            'paragraph_id': sent['paragraph_id'],
            'sentence_id': sent['sentence_id'],
            'term_list': pred
        })
    
    os.makedirs(os.path.dirname(output_path) or '.', exist_ok=True)
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(output, f, ensure_ascii=False, indent=2)
    print(f"✓ Saved {len(predictions)} predictions to {output_path}")


save_predictions(bert_preds, test_sentences, 'predictions/subtask_a_dev_bert_preds_train_dev.json') 

✓ Saved 1142 predictions to predictions/subtask_a_dev_bert_preds_train_dev.json


In [16]:
for i in range(50):
    print(test_sentences[i]["sentence_text"])
    print("→", bert_preds[i])
    print()


COMUNE DI AMATO
→ []

PROVINCIA DI CATANZARO
→ []

(UFFICIO DEL SINDACO)
→ []

Via Marconi, 14 – 88040 Amato (CZ)
→ []

protocollo.amato@asmepec.it - 0961993045
→ []

NUOVO CALENDARIO RACCOLTA RIFIUTI
→ ['calendario raccolta rifiuti']

(Decorrenza 12/06/2023)
→ []

CENTRO ABITATO
→ []

Lunedì
→ []

Martedì
→ []

Mercoledì
→ []

Giovedì
→ []

Venerdì
→ []

"UMIDO"
→ ['umido']

e
→ []

"INDIFFERENZIATO"
→ ['indifferenziato']

"VETRO"
→ ['vetro']

(1°, 3° ed eventuale 5° martedì del mese)
→ []

"UMIDO"
→ ['umido']

e
→ []

"CARTA/CARTONE"
→ ['carta / cartone']

"MULTIMATERIALE"
→ ['multimateriale']

(plastica, alluminio, lattine)
→ ['plastica', 'alluminio', 'lattine']

"UMIDO"
→ ['umido']

e
→ []

(Su richiesta)
→ []

Ausili per l'igiene di bambini, anziani e malati
→ []

CONTRADE
→ []

Lunedì
→ []

Giovedì
→ []

"INDIFFERENZIATO" e "VETRO"
→ ['indifferenziato', 'vetro']

"MULTIMATERIALE" (plastica, alluminio, lattine) e "CARTA/CARTONE"
→ ['multimateriale', 'plastica', 'alluminio', 'carta