Dali is open evidentiary infrastructure for evaluating whether AI-generated legal citations remain reproducible, attributable, and defensible under scrutiny.
Dali is designed for probabilistic AI systems operating in high-consequence legal environments where provenance and reconstructability matter.
A citation checker asks whether a citation exists. Dali asks whether the citation, retrieval pathway, verification state, and policy context can still be reconstructed months or years later during litigation, audit, or appellate review.
| Concept | What it means |
|---|---|
| Citation integrity | Whether the cited authority exists and resolves to a real source |
| Workflow reconstructability | Whether the pathway that produced the citation can be traced |
| Reconstructable evidence | Whether the result can be reproduced and re-verified under a versioned policy |
Legal AI workflow
|
v
Citation generated
|
v
Can this workflow be reconstructed
and replayed if challenged?
|
v
Dali evaluates
|
+-----------+------------+-------------+
| | | |
v v v v
Attribution Provenance Replayability Defensibility
Dali produces a versioned CitationIntegrityResult for every evaluated citation, including reproducible scoring metadata and evidence hashes so benchmark runs can be replayed consistently over time.
| Tier | Corpus | Purpose |
|---|---|---|
| Tier 1 | Court-documented citation failures (e.g. Mata v. Avianca) | Deterministic, policy-versioned ground truth |
| Tier 2 | Synthetic probe corpus across US, UK / Commonwealth, Brazil, adversarial traps, and cross-jurisdictional policy/academic | Live model evaluation |
Tier 1 is the benchmark standard. Tier 2 extends evaluation to model-facing prompt behavior.
450 prompt evaluations across 3 OpenAI models produced 524 citations in aggregate, evaluated under a deterministic, policy-versioned verification pipeline.
Tier 1 corpus (canonical standard): 3 scoring-eligible cases (Mata v. Avianca, US v. Cohen, Park v. Kim). Expanding this corpus is the highest-priority contribution track, see CONTRIBUTING.md. The 524-citation figures above are Tier 2 synthetic probe results.
0% 25% 50% 75% 100%
├──────────┼──────────┼──────────┼──────────┤
GPT-4o-mini 49% ████████████░░░░░░░░░░░░░ → 94 cites, 16% return HTTP 404
GPT-4.1 94% ████████████████████████░ → 374 cites, 23% return HTTP 404
GPT-4o 26% ██████░░░░░░░░░░░░░░░░░░░ → 56 cites, 20% return HTTP 404
GPT-4.1 was the most engaged model and the most fabrication-prone: of its 374 citations, 86 point to URLs that do not exist. On adversarial citation-trap prompts specifically, GPT-4.1 took the bait 76% of the time, fabricating 48% of those URLs.
US-only legal benchmarks underweight risk in places where AI legal tooling is being deployed but training-data coverage is thinner. Aggregated across all 524 generated citations, grouped by jurisdictional evaluation track:
| Jurisdiction track | Verified (HTTP 200) | Confirmed fabricated (HTTP 404) |
|---|---|---|
| UK / Commonwealth (UKSC, BAILII) | 76% | 5% |
| Cross-jurisdictional research / policy | 57% | 27% |
| US legal (cases, statutes, contracts) | 33% | 17% |
| Adversarial citation traps | 29% | 47% |
| Brazil (Portuguese, civil law) | 3% | 9% |
UK common-law citation structure transfers cleanly from training data. Brazilian Portuguese civil-law showed the weakest transferability across all evaluated tracks, with only 3% resolving successfully under deterministic verification. US-only legal benchmarks underweight risk where legal citation structure, language distribution, and public-authority coverage differ materially from dominant English-language training corpora. A cross-jurisdictional benchmark is how you find these gaps before the AI is in front of a court.
→ Bar charts, per-model leaderboard, full per-jurisdiction breakdown, methodology, and reproducible run instructions: results/v0.2/
git clone https://github.com/yenk/Dali
cd Dali
python -m venv .venvActivate the environment:
# Bash / Zsh
source .venv/bin/activate
# Fish
source .venv/bin/activate.fishpip install -r requirements.txt
python runners/run_integrity.py \
--corpus benchmarks/tier1/corpus/citation_failure_cases.json \
--output results/demo/integrity.jsonThis runs the deterministic Tier 1 evaluator locally. No API keys or hosted services required.
Expected output:
INFO run_integrity: loading corpus: benchmarks/tier1/corpus/citation_failure_cases.json
INFO run_integrity: corpus: 4 total, 3 scoring-eligible, 0 pre-canonical, 1 needs-verification
INFO run_integrity: evaluating 3 record(s)
INFO run_integrity: evaluating: mata-v-avianca-2023
INFO run_integrity: evaluating: us-v-cohen-2023
INFO run_integrity: evaluating: mata-derivative-reporter-swap-001
INFO run_integrity: wrote 3 result(s) to results/demo/integrity.json
--- Integrity Run Summary ---
case_id: mata-v-avianca-2023
authority: Mata v. Avianca, Inc.
citation: Varghese v. China Southern Airlines Co., 925 F.3d 1339 (11th Cir. 2019)
source_url: https://www.courtlistener.com/docket/63107798/mata-v-avianca-inc/
verification: FAILED
recoverability: infeasible
risk: critical
case_id: us-v-cohen-2023
authority: United States v. Cohen (post-conviction motion citation incident)
citation: Three nonexistent federal decisions cited in a supervised-release termination mo...
source_url: https://www.courtlistener.com/docket/8009608/united-states-v-cohen/
verification: FAILED
recoverability: infeasible
risk: critical
Each result is a CitationIntegrityResult artifact with reconstructability, defensibility risk, verification recoverability, and a deterministic evidence hash.
For Tier 2 setup, model registry, and benchmark commands see docs/examples.md.
Using the canonical corpus and the shared CitationIntegrityResult contract, you can:
- evaluate AI-assisted citation workflows against real court-documented failures
- measure provenance continuity and workflow reconstructability
- test retrieval and RAG systems for authority integrity regressions
- compare citation integrity behavior across models or pipeline versions
- replay evaluations under fixed policy versions for reproducibility
- produce deterministic benchmark artifacts and evidence hashes
- eyecite integration as the canonical legal citation parser
- CourtListener-backed canonical citation schema and resolution layer
- Evidence JSON v1.0 RFC publication
- expanded cross-jurisdiction benchmark corpus (UK/Commonwealth, Brazil)
- deterministic replay and reproducibility artifacts
- multi-model comparison runs across OpenAI, Claude, Gemini, and open-weight models
- expanded benchmark coverage for misattribution, proposition drift, and fabricated authority detection
- contributor and academic partnership expansion around legal AI reproducibility research
Longer-range direction: docs/roadmap.md.
See CONTRIBUTING.md for the quick start, corpus field reference, and contribution tracks. Open issues are tagged good first issue and help wanted.
For methodology, scoring rubric, and policy versioning see METHODOLOGY.md and docs/policy-versioning.md.
See CITATION.cff, or:
@software{dali-2026,
title = {Dali: Evidentiary Infrastructure for Legal AI},
author = {Kha, Yen},
year = {2026},
version = {0.2.0},
organization = {GammaLex AI Inc.},
url = {https://github.com/yenk/Dali},
note = {Evaluates whether AI-generated legal citations remain reproducible, attributable, and defensible under scrutiny}
}MIT. See LICENSE.
Dali is an open evidentiary infrastructure project for legal AI systems.
Maintained by GammaLex AI Inc. Primary author: Yen Kha.