This repository contains the code and supporting materials for the paper Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models, presented at the ParlaCLARIN V workshop at LREC 2026.
The project implements a pipeline for transcribing scanned Italian parliamentary proceedings, segmenting the resulting text into document elements, identifying speakers, and linking those speakers to entities in the Chamber of Deputies knowledge base.
Title: Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models
Authors: Luigi Curini, Alfio Ferrara, Giovanni Pagano, Sergio Picascia
Venue: ParlaCLARIN V: Interoperability, Multilinguality, and Multimodality in Parliamentary Corpora, LREC 2026
arXiv: https://arxiv.org/abs/2603.28103
.
|-- main.py # End-to-end transcription and entity-linking pipeline
|-- evaluation.py # Evaluation script for transcription and speaker-tagging metrics
|-- src/ # Python package used by the pipeline
| |-- config.py # Model, data, generation, and SPARQL configuration
| |-- document_processor.py # PDF download, page rendering, and OCR-text loading
| |-- vlm_handler.py # vLLM/Qwen-VL transcription wrapper
| |-- postprocessor.py # Sentence merging, speaker cleanup, and text fixes
| |-- kb_retriever.py # Chamber of Deputies SPARQL and local KB retrieval
| |-- speaker_matcher.py # Fuzzy speaker-to-URI matching
| `-- utils.py # JSON and parsing utilities
|-- scripts/
| |-- downloader_sedute.py # Retrieve sitting metadata and PDF URLs from dati.camera.it
| |-- downloader_deputati.py # Build Chamber member metadata resources
| |-- downloader_regno_30.py # Helper for the 30th legislature of the Kingdom period
| |-- wikidata_matcher.py # Add Wikidata party information for Chamber people
| `-- preprocessing/
| |-- run_ocr_pipeline.sh # Orchestrate DotsOCR preprocessing through vLLM
| |-- dotsocr_inference.py # Batch DotsOCR client for parliamentary PDFs
| `-- generation_config.json # Generation settings for the DotsOCR server
|-- data/ # Metadata and lookup tables used by the pipeline
|-- requirements.txt # Pinned Python dependencies
|-- pyproject.toml # Package metadata for editable installation
`-- LICENSE # GPL-3.0 license
Generated artifacts are written under output/ by default. Evaluation inputs and
predictions are expected under evaluation/ when reproducing the paper metrics.
The external DotsOCR checkout or model directory is not part of this repository.
The pipeline has six main stages.
-
Session metadata collection. The scripts in
scripts/retrieve sitting metadata, PDF links, legislature membership, and auxiliary knowledge-base resources from the Chamber of Deputies SPARQL endpoint and Wikidata. -
OCR-assisted layout extraction.
scripts/preprocessing/run_ocr_pipeline.shstarts a DotsOCR vLLM server and runsdotsocr_inference.pyover the PDFs listed indata/convocations.csv. The result is one OCR text file per page. -
Vision-language transcription.
main.pyrenders each PDF page as an image, combines the image with the DotsOCR text, and sends both to the VLM configured insrc/config.py. The model returns structured JSON objects withspeaker,type, andcontentfields. -
Post-processing.
src/postprocessor.pynormalizes speaker labels, repairs common transcription artifacts, merges interrupted sentence fragments, and keeps page-level context. -
Speaker linking.
src/kb_retriever.pyretrieves people active in a parliamentary sitting, andsrc/speaker_matcher.pylinks extracted speaker names to Chamber URIs with a combination of role-aware rules and fuzzy matching. -
Evaluation.
evaluation.pycomputes WER, CER, normalized WER/CER, and speaker-tagging precision, recall, and F1.
The code requires Python 3.10 or newer. The default configuration is designed for a CUDA machine with enough GPU memory to run vLLM-backed vision-language models. The published experiments use large VLMs, so CPU-only execution is not practical for full reproduction.
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .Some dependencies, especially torch, vllm, and DotsOCR, may also
require system libraries, CUDA drivers, and model downloads outside this
repository.
Most run-time settings live in src/config.py:
MODEL_PATHandPROCESSOR_PATHselect the VLM and processor.CONVOCATIONS_FILEpoints to the sitting metadata CSV.EARLY_OCR_DIRpoints to the directory containing per-page OCR text files.OUTPUT_DIRcontrols where raw and processed predictions are written.SAMPLE_DOCS_CONFIGcan be set to a dictionary to run only a sample of the corpus during development.SPARQL_ENDPOINTselects the Chamber of Deputies knowledge-base endpoint.
Before running the full pipeline, make sure EARLY_OCR_DIR matches the output of
your DotsOCR preprocessing run. The preprocessing script writes page text under
output/ocr/ by default, while the main configuration can be adjusted to any
compatible directory.
The data/ directory contains the metadata needed to reproduce the experiments:
convocations.csvand related CSV files list parliamentary sittings, dates, PDF URLs, and assembly flags.legislature_deputati.json,legislature_senatori.json, andlegislature_map.jsonsupport session-level retrieval of eligible speakers.deputati_uris.txtlists Chamber person URIs.wikidata_party_data.jsonandwikidata_party_data.csvprovide auxiliary party information retrieved from Wikidata.
To regenerate metadata, use the scripts in scripts/. Several of these scripts
use relative paths and external SPARQL services; if you run them from a different
working directory, adjust their paths accordingly.
From the preprocessing directory:
cd scripts/preprocessing
bash run_ocr_pipeline.shThis script clones DotsOCR if needed, starts a local vLLM server, waits for the
server health endpoint, and runs the batch client over the PDFs listed in
data/convocations.csv. Outputs are written as page-level text files, grouped by
sitting and PDF index.
For large runs, check GPU allocation, CUDA_VISIBLE_DEVICES, HF_MODEL_PATH,
and the vLLM memory settings in run_ocr_pipeline.sh before launching the job.
After OCR preprocessing is available and src/config.py points to the correct
OCR directory, run:
python main.pyThe script processes the sittings listed in CONVOCATIONS_FILE, downloads each
PDF, renders pages, runs the configured VLM, post-processes the model output,
links speakers, and saves final JSON files under:
output/processed/
Each output record contains the transcribed text, document element type, inferred
speaker label, page number, and linked speaker_uri values when a match is
available.
The evaluation reported in the paper was conducted on the gold-standard files
provided by the IPSA repository. The evaluation
workflow expects those text, image, and XML files under evaluation/ and writes
predictions to evaluation/predictions/.
python evaluation.pyThe script reports:
- WER and CER for the full DotsOCR + VLM pipeline.
- Normalized WER and CER after lowercasing, punctuation removal, and whitespace normalization.
- Speaker-tagging precision, recall, and F1 using bag-of-entities matching.
- Intermediate DotsOCR-only WER/CER metrics.
The final summary is written to evaluation/predictions/metrics.txt.
- The pipeline depends on external services: the Chamber of Deputies SPARQL endpoint, Wikidata, Hugging Face model hosting, and source PDF URLs.
- Large model outputs can vary when model versions, generation backends, CUDA
libraries, or vLLM versions change. The repository pins Python dependencies in
requirements.txt; record any model or hardware changes when comparing runs. - Use
SAMPLE_DOCS_CONFIGinsrc/config.pyfor small smoke tests before launching corpus-scale jobs. - Generated outputs, local evaluation artifacts, model checkouts, and DotsOCR files should remain outside version control unless they are intentionally being released as part of a data artifact.
The code is released under the GNU General Public License v3.0. See LICENSE for
full terms.
If you use this repository, please cite the paper. Once the official workshop proceedings entry is available, cite that version. Until then, cite the arXiv record:
@misc{curini2026transcription,
title = {Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models},
author = {Curini, Luigi and Ferrara, Alfio and Pagano, Giovanni and Picascia, Sergio},
year = {2026},
eprint = {2603.28103},
archivePrefix = {arXiv},
primaryClass = {cs.DL},
doi = {10.48550/arXiv.2603.28103},
url = {https://arxiv.org/abs/2603.28103},
note = {Presented at ParlaCLARIN V: Interoperability, Multilinguality, and Multimodality in Parliamentary Corpora, LREC 2026}
}