Dali

Dali is open evidentiary infrastructure for evaluating whether AI-generated legal citations remain reproducible, attributable, and defensible under scrutiny.

Dali is designed for probabilistic AI systems operating in high-consequence legal environments where provenance and reconstructability matter.

A citation checker asks whether a citation exists. Dali asks whether the citation, retrieval pathway, verification state, and policy context can still be reconstructed months or years later during litigation, audit, or appellate review.

Core concepts

Concept	What it means
Citation integrity	Whether the cited authority exists and resolves to a real source
Workflow reconstructability	Whether the pathway that produced the citation can be traced
Reconstructable evidence	Whether the result can be reproduced and re-verified under a versioned policy

How it works

        Legal AI workflow
                |
                v
         Citation generated
                |
                v
   Can this workflow be reconstructed
        and replayed if challenged?
                |
                v
          Dali evaluates
                |
   +-----------+------------+-------------+
   |            |            |            |          
   v            v            v            v           
Attribution  Provenance  Replayability  Defensibility

Dali produces a versioned CitationIntegrityResult for every evaluated citation, including reproducible scoring metadata and evidence hashes so benchmark runs can be replayed consistently over time.

Evaluation tiers

Tier	Corpus	Purpose
Tier 1	Court-documented citation failures (e.g. Mata v. Avianca)	Deterministic, policy-versioned ground truth
Tier 2	Synthetic probe corpus across US, UK / Commonwealth, Brazil, adversarial traps, and cross-jurisdictional policy/academic	Live model evaluation

Tier 1 is the benchmark standard. Tier 2 extends evaluation to model-facing prompt behavior.

Latest results (v0.2 · 2026-05-26)

450 prompt evaluations across 3 OpenAI models produced 524 citations in aggregate, evaluated under a deterministic, policy-versioned verification pipeline.

Tier 1 corpus (canonical standard): 3 scoring-eligible cases (Mata v. Avianca, US v. Cohen, Park v. Kim). Expanding this corpus is the highest-priority contribution track, see CONTRIBUTING.md. The 524-citation figures above are Tier 2 synthetic probe results.

The model that cited most willingly also fabricated most often

                       0%        25%        50%        75%       100%
                       ├──────────┼──────────┼──────────┼──────────┤
  GPT-4o-mini   49%    ████████████░░░░░░░░░░░░░  → 94 cites, 16% return HTTP 404
  GPT-4.1       94%    ████████████████████████░  → 374 cites, 23% return HTTP 404
  GPT-4o        26%    ██████░░░░░░░░░░░░░░░░░░░  → 56 cites, 20% return HTTP 404

GPT-4.1 was the most engaged model and the most fabrication-prone: of its 374 citations, 86 point to URLs that do not exist. On adversarial citation-trap prompts specifically, GPT-4.1 took the bait 76% of the time, fabricating 48% of those URLs.

Why we test across jurisdictions

US-only legal benchmarks underweight risk in places where AI legal tooling is being deployed but training-data coverage is thinner. Aggregated across all 524 generated citations, grouped by jurisdictional evaluation track:

Jurisdiction track	Verified (HTTP 200)	Confirmed fabricated (HTTP 404)
UK / Commonwealth (UKSC, BAILII)	76%	5%
Cross-jurisdictional research / policy	57%	27%
US legal (cases, statutes, contracts)	33%	17%
Adversarial citation traps	29%	47%
Brazil (Portuguese, civil law)	3%	9%

UK common-law citation structure transfers cleanly from training data. Brazilian Portuguese civil-law showed the weakest transferability across all evaluated tracks, with only 3% resolving successfully under deterministic verification. US-only legal benchmarks underweight risk where legal citation structure, language distribution, and public-authority coverage differ materially from dominant English-language training corpora. A cross-jurisdictional benchmark is how you find these gaps before the AI is in front of a court.

→ Bar charts, per-model leaderboard, full per-jurisdiction breakdown, methodology, and reproducible run instructions: results/v0.2/

Quick start

git clone https://github.com/yenk/Dali
cd Dali
python -m venv .venv

Activate the environment:

# Bash / Zsh
source .venv/bin/activate

# Fish
source .venv/bin/activate.fish

pip install -r requirements.txt
python runners/run_integrity.py \
  --corpus benchmarks/tier1/corpus/citation_failure_cases.json \
  --output results/demo/integrity.json

This runs the deterministic Tier 1 evaluator locally. No API keys or hosted services required.

Expected output:

INFO run_integrity: loading corpus: benchmarks/tier1/corpus/citation_failure_cases.json
INFO run_integrity: corpus: 4 total, 3 scoring-eligible, 0 pre-canonical, 1 needs-verification
INFO run_integrity: evaluating 3 record(s)
INFO run_integrity:   evaluating: mata-v-avianca-2023
INFO run_integrity:   evaluating: us-v-cohen-2023
INFO run_integrity:   evaluating: mata-derivative-reporter-swap-001
INFO run_integrity: wrote 3 result(s) to results/demo/integrity.json

--- Integrity Run Summary ---

  case_id:        mata-v-avianca-2023
  authority:      Mata v. Avianca, Inc.
  citation:       Varghese v. China Southern Airlines Co., 925 F.3d 1339 (11th Cir. 2019)
  source_url:     https://www.courtlistener.com/docket/63107798/mata-v-avianca-inc/
  verification:   FAILED
  recoverability: infeasible
  risk:           critical

  case_id:        us-v-cohen-2023
  authority:      United States v. Cohen (post-conviction motion citation incident)
  citation:       Three nonexistent federal decisions cited in a supervised-release termination mo...
  source_url:     https://www.courtlistener.com/docket/8009608/united-states-v-cohen/
  verification:   FAILED
  recoverability: infeasible
  risk:           critical

Each result is a CitationIntegrityResult artifact with reconstructability, defensibility risk, verification recoverability, and a deterministic evidence hash.

For Tier 2 setup, model registry, and benchmark commands see docs/examples.md.

What this enables

Using the canonical corpus and the shared CitationIntegrityResult contract, you can:

evaluate AI-assisted citation workflows against real court-documented failures
measure provenance continuity and workflow reconstructability
test retrieval and RAG systems for authority integrity regressions
compare citation integrity behavior across models or pipeline versions
replay evaluations under fixed policy versions for reproducibility
produce deterministic benchmark artifacts and evidence hashes

Near-term roadmap

eyecite integration as the canonical legal citation parser
CourtListener-backed canonical citation schema and resolution layer
Evidence JSON v1.0 RFC publication
expanded cross-jurisdiction benchmark corpus (UK/Commonwealth, Brazil)
deterministic replay and reproducibility artifacts
multi-model comparison runs across OpenAI, Claude, Gemini, and open-weight models
expanded benchmark coverage for misattribution, proposition drift, and fabricated authority detection
contributor and academic partnership expansion around legal AI reproducibility research

Longer-range direction: docs/roadmap.md.

Contributing

See CONTRIBUTING.md for the quick start, corpus field reference, and contribution tracks. Open issues are tagged good first issue and help wanted.

For methodology, scoring rubric, and policy versioning see METHODOLOGY.md and docs/policy-versioning.md.

How to cite

See CITATION.cff, or:

@software{dali-2026,
  title        = {Dali: Evidentiary Infrastructure for Legal AI},
  author       = {Kha, Yen},
  year         = {2026},
  version      = {0.2.0},
  organization = {GammaLex AI Inc.},
  url          = {https://github.com/yenk/Dali},
  note         = {Evaluates whether AI-generated legal citations remain reproducible, attributable, and defensible under scrutiny}
}

License

MIT. See LICENSE.

Dali is an open evidentiary infrastructure project for legal AI systems.

Maintained by GammaLex AI Inc. Primary author: Yen Kha.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github		.github
benchmarks		benchmarks
corpus		corpus
dali_mcp		dali_mcp
docs		docs
results		results
runners		runners
schemas		schemas
scoring		scoring
specs		specs
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dali

Core concepts

How it works

Evaluation tiers

Latest results (v0.2 · 2026-05-26)

The model that cited most willingly also fabricated most often

Why we test across jurisdictions

Quick start

What this enables

Near-term roadmap

Contributing

How to cite

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dali

Core concepts

How it works

Evaluation tiers

Latest results (v0.2 · 2026-05-26)

The model that cited most willingly also fabricated most often

Why we test across jurisdictions

Quick start

What this enables

Near-term roadmap

Contributing

How to cite

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages