Skip to content

scriptogre/romanian-law-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Romanian Law Data

Zstd-compressed Parquet exports of the Romanian legal corpus (acts, articles, paragraphs) for use with DuckDB. Sourced from legislatie.just.ro (Ministry of Justice) via its public SOAP API.

Automated daily via GitHub Actions. Download from Releases.

Tables

Table Content Rows
acte One row per act (LEGE, OUG, HG, ORDIN, DECIZIE, …) with metadata + full text ~187k
articole One row per article (parsed from acte.content) ~993k
alineate One row per paragraph — the finest citation unit (e.g. art. 188 alin. (1)) ~1.96M

Tables use Romanian legal vocabulary (acte, articole, alineate); columns use English SQL convention (type, published_at, gazette_number, …) with Romanian COMMENT ON metadata in create_views.sql.

Subject lenses

create_views.sql also exposes pre-filtered views over acte for each canonical code and for jurisprudence:

View Filters
constitutie Constituția României (1991, republicată 2003)
cod_civil Legea 287/2009
cod_penal Legea 286/2009
cod_muncii Legea 53/2003 (republicată)
cod_procedura_civila Legea 134/2010 (republicată)
cod_procedura_penala Legea 135/2010
cod_fiscal Legea 227/2015
jurisprudenta CCR + ÎCCJ decisions

Usage

# Download the latest bundle
gh release download -R scriptogre/romanian-law-data
tar xzf laws.tar.gz -C data/
import duckdb
conn = duckdb.connect()
conn.execute(open("data/create_views.sql").read())
conn.execute("""
    SELECT act_citation, link, article_citation, content
    FROM articole
    WHERE act_id IN (SELECT id FROM cod_penal)
      AND article_number = 188
""").fetchall()

Pipeline

collect.py    SOAP API → data/raw_acts.jsonl
normalize.py  fix encoding, dedup, extract dates + gazette → stdout
parse.py      extract articles + alineate                  → stdout
export.py     write parquet bundle + sha256                ← stdin

collect checkpoints raw_acts.jsonl (SOAP is slow + rate-limited, worth caching). normalize → parse → export is one pipe — no intermediate JSONL on disk.

uv sync
uv run python -m scripts.collect
uv run python -m scripts.normalize \
  | uv run python -m scripts.parse \
  | uv run python -m scripts.export

License

The corpus is published by the Romanian Ministry of Justice and is public information. This repository only provides format conversion + pipeline tooling.

About

Romanian legal corpus (legislatie.just.ro → parquet). Daily releases of acts, articles, paragraphs.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors