Skip to content

umilISLab/trips

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TRIPS

This repository contains the code and supporting materials for the paper Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models, presented at the ParlaCLARIN V workshop at LREC 2026.

The project implements a pipeline for transcribing scanned Italian parliamentary proceedings, segmenting the resulting text into document elements, identifying speakers, and linking those speakers to entities in the Chamber of Deputies knowledge base.

Paper

Title: Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

Authors: Luigi Curini, Alfio Ferrara, Giovanni Pagano, Sergio Picascia

Venue: ParlaCLARIN V: Interoperability, Multilinguality, and Multimodality in Parliamentary Corpora, LREC 2026

arXiv: https://arxiv.org/abs/2603.28103

Repository Contents

.
|-- main.py                         # End-to-end transcription and entity-linking pipeline
|-- evaluation.py                   # Evaluation script for transcription and speaker-tagging metrics
|-- src/                            # Python package used by the pipeline
|   |-- config.py                   # Model, data, generation, and SPARQL configuration
|   |-- document_processor.py       # PDF download, page rendering, and OCR-text loading
|   |-- vlm_handler.py              # vLLM/Qwen-VL transcription wrapper
|   |-- postprocessor.py            # Sentence merging, speaker cleanup, and text fixes
|   |-- kb_retriever.py             # Chamber of Deputies SPARQL and local KB retrieval
|   |-- speaker_matcher.py          # Fuzzy speaker-to-URI matching
|   `-- utils.py                    # JSON and parsing utilities
|-- scripts/
|   |-- downloader_sedute.py        # Retrieve sitting metadata and PDF URLs from dati.camera.it
|   |-- downloader_deputati.py      # Build Chamber member metadata resources
|   |-- downloader_regno_30.py      # Helper for the 30th legislature of the Kingdom period
|   |-- wikidata_matcher.py         # Add Wikidata party information for Chamber people
|   `-- preprocessing/
|       |-- run_ocr_pipeline.sh     # Orchestrate DotsOCR preprocessing through vLLM
|       |-- dotsocr_inference.py    # Batch DotsOCR client for parliamentary PDFs
|       `-- generation_config.json  # Generation settings for the DotsOCR server
|-- data/                           # Metadata and lookup tables used by the pipeline
|-- requirements.txt                # Pinned Python dependencies
|-- pyproject.toml                  # Package metadata for editable installation
`-- LICENSE                         # GPL-3.0 license

Generated artifacts are written under output/ by default. Evaluation inputs and predictions are expected under evaluation/ when reproducing the paper metrics. The external DotsOCR checkout or model directory is not part of this repository.

Pipeline Overview

The pipeline has six main stages.

  1. Session metadata collection. The scripts in scripts/ retrieve sitting metadata, PDF links, legislature membership, and auxiliary knowledge-base resources from the Chamber of Deputies SPARQL endpoint and Wikidata.

  2. OCR-assisted layout extraction. scripts/preprocessing/run_ocr_pipeline.sh starts a DotsOCR vLLM server and runs dotsocr_inference.py over the PDFs listed in data/convocations.csv. The result is one OCR text file per page.

  3. Vision-language transcription. main.py renders each PDF page as an image, combines the image with the DotsOCR text, and sends both to the VLM configured in src/config.py. The model returns structured JSON objects with speaker, type, and content fields.

  4. Post-processing. src/postprocessor.py normalizes speaker labels, repairs common transcription artifacts, merges interrupted sentence fragments, and keeps page-level context.

  5. Speaker linking. src/kb_retriever.py retrieves people active in a parliamentary sitting, and src/speaker_matcher.py links extracted speaker names to Chamber URIs with a combination of role-aware rules and fuzzy matching.

  6. Evaluation. evaluation.py computes WER, CER, normalized WER/CER, and speaker-tagging precision, recall, and F1.

Installation

The code requires Python 3.10 or newer. The default configuration is designed for a CUDA machine with enough GPU memory to run vLLM-backed vision-language models. The published experiments use large VLMs, so CPU-only execution is not practical for full reproduction.

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .

Some dependencies, especially torch, vllm, and DotsOCR, may also require system libraries, CUDA drivers, and model downloads outside this repository.

Configuration

Most run-time settings live in src/config.py:

  • MODEL_PATH and PROCESSOR_PATH select the VLM and processor.
  • CONVOCATIONS_FILE points to the sitting metadata CSV.
  • EARLY_OCR_DIR points to the directory containing per-page OCR text files.
  • OUTPUT_DIR controls where raw and processed predictions are written.
  • SAMPLE_DOCS_CONFIG can be set to a dictionary to run only a sample of the corpus during development.
  • SPARQL_ENDPOINT selects the Chamber of Deputies knowledge-base endpoint.

Before running the full pipeline, make sure EARLY_OCR_DIR matches the output of your DotsOCR preprocessing run. The preprocessing script writes page text under output/ocr/ by default, while the main configuration can be adjusted to any compatible directory.

Data and Metadata

The data/ directory contains the metadata needed to reproduce the experiments:

  • convocations.csv and related CSV files list parliamentary sittings, dates, PDF URLs, and assembly flags.
  • legislature_deputati.json, legislature_senatori.json, and legislature_map.json support session-level retrieval of eligible speakers.
  • deputati_uris.txt lists Chamber person URIs.
  • wikidata_party_data.json and wikidata_party_data.csv provide auxiliary party information retrieved from Wikidata.

To regenerate metadata, use the scripts in scripts/. Several of these scripts use relative paths and external SPARQL services; if you run them from a different working directory, adjust their paths accordingly.

Running OCR Preprocessing

From the preprocessing directory:

cd scripts/preprocessing
bash run_ocr_pipeline.sh

This script clones DotsOCR if needed, starts a local vLLM server, waits for the server health endpoint, and runs the batch client over the PDFs listed in data/convocations.csv. Outputs are written as page-level text files, grouped by sitting and PDF index.

For large runs, check GPU allocation, CUDA_VISIBLE_DEVICES, HF_MODEL_PATH, and the vLLM memory settings in run_ocr_pipeline.sh before launching the job.

Running the Main Pipeline

After OCR preprocessing is available and src/config.py points to the correct OCR directory, run:

python main.py

The script processes the sittings listed in CONVOCATIONS_FILE, downloads each PDF, renders pages, runs the configured VLM, post-processes the model output, links speakers, and saves final JSON files under:

output/processed/

Each output record contains the transcribed text, document element type, inferred speaker label, page number, and linked speaker_uri values when a match is available.

Evaluation

The evaluation reported in the paper was conducted on the gold-standard files provided by the IPSA repository. The evaluation workflow expects those text, image, and XML files under evaluation/ and writes predictions to evaluation/predictions/.

python evaluation.py

The script reports:

  • WER and CER for the full DotsOCR + VLM pipeline.
  • Normalized WER and CER after lowercasing, punctuation removal, and whitespace normalization.
  • Speaker-tagging precision, recall, and F1 using bag-of-entities matching.
  • Intermediate DotsOCR-only WER/CER metrics.

The final summary is written to evaluation/predictions/metrics.txt.

Reproducibility Notes

  • The pipeline depends on external services: the Chamber of Deputies SPARQL endpoint, Wikidata, Hugging Face model hosting, and source PDF URLs.
  • Large model outputs can vary when model versions, generation backends, CUDA libraries, or vLLM versions change. The repository pins Python dependencies in requirements.txt; record any model or hardware changes when comparing runs.
  • Use SAMPLE_DOCS_CONFIG in src/config.py for small smoke tests before launching corpus-scale jobs.
  • Generated outputs, local evaluation artifacts, model checkouts, and DotsOCR files should remain outside version control unless they are intentionally being released as part of a data artifact.

License

The code is released under the GNU General Public License v3.0. See LICENSE for full terms.

Citation

If you use this repository, please cite the paper. Once the official workshop proceedings entry is available, cite that version. Until then, cite the arXiv record:

@misc{curini2026transcription,
  title = {Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models},
  author = {Curini, Luigi and Ferrara, Alfio and Pagano, Giovanni and Picascia, Sergio},
  year = {2026},
  eprint = {2603.28103},
  archivePrefix = {arXiv},
  primaryClass = {cs.DL},
  doi = {10.48550/arXiv.2603.28103},
  url = {https://arxiv.org/abs/2603.28103},
  note = {Presented at ParlaCLARIN V: Interoperability, Multilinguality, and Multimodality in Parliamentary Corpora, LREC 2026}
}

About

Transcription and Recognition of Italian Parliamentary Speeches

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors