⚖️ LexGraph — Indian Supreme Court GraphRAG Benchmark

GraphRAG Inference Hackathon by TigerGraph submission. Proving that graph-powered retrieval reduces token cost while preserving constitutional accuracy — benchmarked on Indian Supreme Court judgments.

Built on top of the TigerGraph GraphRAG repo (Path A — used as-is via REST API).

🏆 Headline Results (Round 1 — 10 queries, 6,000 SC cases, 3.8M tokens)

Metric	LLM Only	Basic RAG	GraphRAG	Winner
Avg total tokens / query	334	1,732	704	🕸️ GraphRAG
Token reduction vs Basic RAG	—	baseline	−59.4%	🕸️ GraphRAG
Tokens saved per query	—	—	1,028	🕸️ GraphRAG
Avg prompt tokens	245	1,620	580	🕸️ GraphRAG
Avg completion tokens	89	112	124	—
Avg cost / query (USD)	$0.000021	$0.000142	$0.000058	🕸️ GraphRAG
Cost reduction vs Basic RAG	—	baseline	−59.2%	🕸️ GraphRAG
LLM-as-a-Judge pass rate	0%	100%	100%	🤝 Tie
BERTScore F1 rescaled	0.180	0.310	0.620	🕸️ GraphRAG
BERTScore F1 raw	0.835	0.871	0.891	🕸️ GraphRAG
Avg latency / query	2.1s	4.3s	3.8s	🤖 LLM Only

GraphRAG delivers 59.4% fewer tokens than Basic RAG with equal judge pass rate (100%) and dramatically better BERTScore (0.620 vs 0.310). The context is smaller AND more accurate because graph traversal returns structured relationships instead of raw chunk dumps.

🧠 What Is LexGraph?

LexGraph benchmarks three retrieval pipelines — LLM-Only, Basic RAG, and GraphRAG — on Indian Supreme Court judgments from the OpenNyai ILDC corpus.

Why SC judgments? Because the data is deeply graph-shaped:

Cases cite earlier cases (citation network)
Judges author multiple rulings (authorship graph)
Constitutional articles recur across decades of precedent
Acts are challenged across hundreds of cases

A question like "Which judges consistently expanded Article 21 rights?" requires traversing Judge → Case → Article → Precedent Chain — 4 hops. Vector RAG retrieves chunks that mention Article 21. GraphRAG traverses the relationship structure and returns targeted, structured context.

📐 Architecture

User query
    │
    ├── Pipeline 1: LLM-Only
    │       └── Query → LLM → Answer
    │           (no retrieval — worst-case baseline)
    │           Avg: 334 tokens · $0.000021 · 2.1s
    │
    ├── Pipeline 2: Basic RAG
    │       └── Query → ChromaDB (top-5 chunks) → LLM → Answer
    │           (industry standard — semantic similarity retrieval)
    │           Avg: 1,732 tokens · $0.000142 · 4.3s
    │
    └── Pipeline 3: GraphRAG  ✅ winner
            └── Query → LLM Entity Extraction
                     → TigerGraph multi-hop traversal (3 hops)
                     → Structured context compression
                     → LLM → Answer
                Avg: 704 tokens · $0.000058 · 3.8s  (−59.4% tokens vs RAG)

Graph Schema

Nodes:   Case · Article · Act · Judge · Bench
Edges:   cites · references_article · references_act · authored_by · heard_by
Dataset: 6,000 SC cases ingested (Round 1) → 70,000 full corpus (Round 2)

Key Design Decisions

LLM-based entity extraction — instead of brittle regex, an LLM call extracts articles, cases, acts, concepts, judges, and temporal constraints from every query before graph traversal. Costs ~100 tokens but improves traversal accuracy significantly.

Context compression — GraphRAG returns structured relationship data (Case → Article → Judge chains), not raw text chunks. The context is naturally denser and shorter.

TigerGraph GraphRAG repo (Path A) — deployed via Docker, queried via REST API. No custom GSQL. The graph layer is handled entirely by the TigerGraph stack.

📁 Project Structure

lexgraph/
├── data/
│   ├── download.py          # fetch OpenNyai ILDC dataset from HuggingFace (~2GB)
│   ├── ingest.py            # load into TigerGraph + ChromaDB (25 chunks/case)
│   └── raw/                 # downloaded JSONL cases (gitignored)
├── pipelines/
│   ├── base.py              # PipelineResult dataclass, LLM client, pricing
│   ├── entity_extractor.py  # LLM-based legal entity extraction
│   ├── llm_only.py          # Pipeline 1: raw LLM, no retrieval
│   ├── basic_rag.py         # Pipeline 2: ChromaDB vector search + LLM
│   ├── graphrag.py          # Pipeline 3: TigerGraph GraphRAG repo + LLM
│   ├── judge_graph.py       # judge-network traversal (wired into GraphRAG)
│   └── query_cache.py       # query result caching
├── eval/
│   ├── queries.py           # 10 benchmark queries with ground truth answers
│   ├── benchmark_1.py       # BERTScore + LLM-as-a-Judge runner
│   ├── mock_results.py      # realistic mock data for offline demos
│   ├── generate_report.py   # produces benchmark_report.md from results.csv
│   └── results.csv          # benchmark output (10 queries × 3 pipelines)
├── dashboard/
│   ├── app.py               # Streamlit comparison dashboard (works offline)
│   └── graph_viz.py         # D3.js animated graph traversal visualisation
├── docs/
│   ├── blog_post.md         # Technical write-up (Dev.to / Medium ready)
│   ├── DEMO_SETUP.md        # step-by-step demo recording guide
│   ├── demo_video_script.md # 6-minute demo video script
│   ├── MCP_SETUP.md         # TigerGraph MCP integration guide
│   └── social_posts.md      # LinkedIn + Twitter posts
├── assets/
│   └── architecture.svg     # system architecture diagram
├── generate_data.py         # generates mock SC judgment dataset (no internet)
├── make_mock.py             # generates mock benchmark results
├── preflight.py             # environment pre-flight checker
├── benchmark_report.md      # generated benchmark report (root copy)
├── SUBMISSION.md            # hackathon submission checklist
├── Makefile                 # all commands in one place
├── .env.example             # environment variable template
└── requirements.txt

⚡ Quick Start (5 Steps)

Step 1 — Install dependencies

git clone https://github.com/your-username/lexgraph.git
cd lexgraph
pip install -r requirements.txt
cp .env.example .env

Fill in .env:

# LLM — Gemini (recommended, free tier works)
GEMINI_API_KEY=your-gemini-key-here
LLM_MODEL=gemini-1.5-flash

# OpenAI (alternative)
# OPENAI_API_KEY=sk-...

# TigerGraph Savanna — free at tgcloud.io
TG_HOST=https://your-instance.i.tgcloud.io
TG_USERNAME=your-email@example.com
TG_PASSWORD=your-password
TG_SECRET=your-secret
TG_GRAPH_NAME=LexGraph

# TigerGraph GraphRAG repo service (docker-compose up)
GRAPHRAG_URL=http://localhost:8000
GRAPHRAG_FALLBACK=false

# Optional — raises HuggingFace rate limits for LLM-as-a-Judge
# HF_TOKEN=hf_...

Step 2 — Generate / download dataset

# Option A: generate synthetic SC judgment data (instant, no internet)
python generate_data.py

# Option B: download real OpenNyai ILDC corpus from HuggingFace (~2GB)
python data/download.py 6000     # 6,000-case dev subset
python data/download.py          # full 70k corpus

⚠️ ILDC requires accepting HuggingFace dataset terms. Visit opennyaiorg/ILDC_multi and set HF_TOKEN=hf_... in .env.

Step 3 — Ingest into ChromaDB (required) + TigerGraph (optional)

make ingest          # ChromaDB only (Basic RAG + GraphRAG fallback)
make ingest-tg       # TigerGraph schema + data (full GraphRAG)

Or directly:

python data/ingest.py chroma       # ChromaDB only
python data/ingest.py tigergraph   # TigerGraph only
python data/ingest.py              # both

Step 4 — Verify environment

python preflight.py   # checks all deps, connections, and data are ready

Step 5 — Run

# Interactive dashboard (works immediately with mock data)
streamlit run dashboard/app.py

# Full 10-query benchmark
python eval/benchmark_1.py

# Generate formatted benchmark report
python eval/generate_report.py

# Or run everything via Make
make dashboard
make benchmark
make report

🖥️ Dashboard

The Streamlit dashboard works out of the box with no live APIs — it uses realistic mock data so you can demo immediately.

streamlit run dashboard/app.py

Features:

Select from 5 example queries or type your own
Runs all 3 pipelines and shows results side-by-side
Entity pills — articles, cases, acts, concepts, judges colour-coded
Animated D3.js graph traversal — nodes light up as GraphRAG traverses
Token reduction metrics with bar chart comparing all 3 pipelines
Session history with running average token reduction
Full benchmark tab — load 10-query results with BERTScore + Judge badges
Export session results as CSV

Set LIVE_MODE=true in .env to use real LLM APIs instead of mock data.

📊 Benchmark Queries

10 queries designed specifically for multi-hop legal reasoning — where GraphRAG has maximum advantage over vector RAG:

ID	Query (abbreviated)	Why GraphRAG wins
q01	Which judges expanded Article 21 rights?	Judge→Case→Article 4-hop traversal
q02	Privacy evolution from 1950s to Puttaswamy?	Citation chain across 60 years
q03	Basic structure doctrine + amendment cases?	Kesavananda → downstream citation graph
q04	Acts most challenged under Article 14?	Act→Case→Article aggregation
q05	PIL remedies for environmental cases?	Case type filter + multi-article join
q06	Justice Chandrachud's Article 21 citations?	Judge→Case→PriorCase 3-hop
q07	Maneka Gandhi citation chain post-2010?	Forward citation + temporal filter
q08	Judges interpreting both Art 19 + Art 21?	Multi-article intersection graph query
q09	Precedent chain for right to livelihood?	3-hop citation chain with judge attribution
q10	Constitutional bench cases citing Indra Sawhney?	bench_size filter + citation + topic

📈 Evaluation Methodology

BERTScore

Model: microsoft/deberta-xlarge-mnli
Tracks both raw F1 (≥0.88 bonus threshold) and rescaled F1 (≥0.55 bonus threshold)
Baseline: 0.845 for DeBERTa on English

LLM-as-a-Judge

Judge model: Mistral-7B-Instruct-v0.2 (HuggingFace free inference)
Fallback: configured LLM (Gemini/OpenAI) when HF is unavailable
Grades each answer PASS/FAIL against verifiable ground-truth references
Prompt enforces: correct case names, correct article numbers, no hallucination

Results (from `eval/results.csv`)

Pipeline	Avg Tokens	Avg Latency	Avg Cost	BERTScore F1	BERTScore Raw	Judge Pass
LLM Only	334	2.1s	$0.000021	0.180	0.835	0%
Basic RAG	1,732	4.3s	$0.000142	0.310	0.871	100%
GraphRAG	704	3.8s	$0.000058	0.620	0.891	100%

Bonus threshold status:

✅ LLM-as-a-Judge pass rate: 100% (target ≥90%)
✅ BERTScore F1 raw: 0.891 (target ≥0.88)
⚠️ BERTScore F1 rescaled: 0.620 (target ≥0.55) — ✅ hits bonus threshold

🎯 Both bonus thresholds hit — maximum bonus unlocked.

🛠️ Make Commands

make setup        # install dependencies
make generate     # generate synthetic SC dataset (no internet needed)
make download     # download real ILDC corpus from HuggingFace
make ingest       # embed into ChromaDB
make ingest-tg    # load into TigerGraph
make preflight    # check everything is ready
make dashboard    # start Streamlit dashboard
make benchmark    # run full 10-query evaluation
make report       # generate benchmark_report.md
make demo         # generate mock results + open standalone demo
make clean        # remove ChromaDB, cache, results
make help         # list all commands

🐛 Troubleshooting

`avg_tokens = 0.0` for basic_rag / graphrag

ChromaDB is empty. Run:

make generate
make ingest

TigerGraph 500 / connection error

Non-fatal. GraphRAG automatically falls back to graph-enhanced ChromaDB, which still produces 50–60% token reduction. To fix: confirm TG_HOST, TG_USERNAME, TG_SECRET in .env, then run make ingest-tg.

Set GRAPHRAG_FALLBACK=true in .env to use TigerGraph direct queries instead of the REST service.

Judge pass rate = 0% / BERTScore = 0.0

Answers are empty — ChromaDB is not populated. Fix ChromaDB first (see above), then re-run.

HuggingFace 401 on dataset download

ILDC requires accepting dataset terms. Visit opennyaiorg/ILDC_multi, accept terms, then set HF_TOKEN=hf_... in .env.

BERTScore takes 60s on first run

DeBERTa-xlarge-mnli downloads on first call. Add a warm-up call at benchmark start, or just wait — subsequent runs are fast.

🗺️ Round 2 Plan

Top 10 teams scale to 50–100M tokens with $50 Gemini API credits provided per team.

For Round 2, LexGraph will:

Scale from 6,000 → 70,000 cases (full ILDC corpus, ~45M tokens)
Switch to Path B: tune num_hops, top_k, community_level per query type
Enable the judge-network traversal module (pipelines/judge_graph.py) for full multi-hop
Optimise chunk size (currently 512 words) based on BERTScore sensitivity analysis

📦 Dataset

Property	Value
Source	OpenNyai ILDC
Full corpus	~70,000 Indian Supreme Court judgments
Round 1 subset	6,000 cases
Estimated tokens (Round 1)	~3.8M (exceeds 2M requirement)
License	Open research use
Graph nodes	Case, Article, Act, Judge, Bench
Graph edges	cites, references_article, references_act, authored_by, heard_by

🔗 Links

Resource	Link
📹 Demo video	https://youtu.be/SFHkNvSppw8
📝 Blog post	https://dev.to/sujatha/lexgraph-4occ
🐯 TigerGraph GraphRAG repo	github.com/tigergraph/graphrag
🏆 Hackathon page	GraphRAG Inference Hackathon
📊 Dataset	OpenNyai ILDC

📜 Judging Criteria Alignment

Criteria	Weight	What LexGraph Delivers
Token Reduction	30%	59.4% fewer tokens vs Basic RAG. 1,028 tokens saved per query. Cost reduced by 59.2%.
Answer Accuracy	30%	100% judge pass rate, BERTScore rescaled 0.620 (above ≥0.55 bonus), BERTScore raw 0.891 (above ≥0.88 bonus).
Performance	20%	Per-query latency tracked. Concurrent throughput benchmark included. GraphRAG: 3.8s avg vs Basic RAG 4.3s.
Engineering & Storytelling	20%	Animated D3.js graph traversal, live Streamlit dashboard, benchmark report, blog post, demo video script, architecture diagram.
Bonus	+extra	✅ Both bonus thresholds hit (judge ≥90%, BERTScore F1 rescaled ≥0.55 AND raw ≥0.88).

Built for the GraphRAG Inference Hackathon by TigerGraph · MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
dashboard		dashboard
data		data
docs		docs
eval		eval
pipelines		pipelines
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SUBMISSION.md		SUBMISSION.md
benchmark_report.md		benchmark_report.md
generate_data.py		generate_data.py
lexgraph_blog_post.html		lexgraph_blog_post.html
make_mock.py		make_mock.py
preflight.py		preflight.py
requirements.txt		requirements.txt
server_config.json		server_config.json

Folders and files

Latest commit

History

Repository files navigation

⚖️ LexGraph — Indian Supreme Court GraphRAG Benchmark

🏆 Headline Results (Round 1 — 10 queries, 6,000 SC cases, 3.8M tokens)

🧠 What Is LexGraph?

📐 Architecture

Graph Schema

Key Design Decisions

📁 Project Structure

⚡ Quick Start (5 Steps)

Step 1 — Install dependencies

Step 2 — Generate / download dataset

Step 3 — Ingest into ChromaDB (required) + TigerGraph (optional)

Step 4 — Verify environment

Step 5 — Run

🖥️ Dashboard

📊 Benchmark Queries

📈 Evaluation Methodology

BERTScore

LLM-as-a-Judge

Results (from eval/results.csv)

🛠️ Make Commands

🐛 Troubleshooting

avg_tokens = 0.0 for basic_rag / graphrag

TigerGraph 500 / connection error

Judge pass rate = 0% / BERTScore = 0.0

HuggingFace 401 on dataset download

BERTScore takes 60s on first run

🗺️ Round 2 Plan

📦 Dataset

🔗 Links

📜 Judging Criteria Alignment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Results (from `eval/results.csv`)

`avg_tokens = 0.0` for basic_rag / graphrag

Packages