Skip to content

rynix27/lexgraph

Repository files navigation

⚖️ LexGraph — Indian Supreme Court GraphRAG Benchmark

GraphRAG Inference Hackathon by TigerGraph submission. Proving that graph-powered retrieval reduces token cost while preserving constitutional accuracy — benchmarked on Indian Supreme Court judgments.

Built on top of the TigerGraph GraphRAG repo (Path A — used as-is via REST API).

Python 3.10+ TigerGraph License: MIT Hackathon


🏆 Headline Results (Round 1 — 10 queries, 6,000 SC cases, 3.8M tokens)

Metric LLM Only Basic RAG GraphRAG Winner
Avg total tokens / query 334 1,732 704 🕸️ GraphRAG
Token reduction vs Basic RAG baseline −59.4% 🕸️ GraphRAG
Tokens saved per query 1,028 🕸️ GraphRAG
Avg prompt tokens 245 1,620 580 🕸️ GraphRAG
Avg completion tokens 89 112 124
Avg cost / query (USD) $0.000021 $0.000142 $0.000058 🕸️ GraphRAG
Cost reduction vs Basic RAG baseline −59.2% 🕸️ GraphRAG
LLM-as-a-Judge pass rate 0% 100% 100% 🤝 Tie
BERTScore F1 rescaled 0.180 0.310 0.620 🕸️ GraphRAG
BERTScore F1 raw 0.835 0.871 0.891 🕸️ GraphRAG
Avg latency / query 2.1s 4.3s 3.8s 🤖 LLM Only

GraphRAG delivers 59.4% fewer tokens than Basic RAG with equal judge pass rate (100%) and dramatically better BERTScore (0.620 vs 0.310). The context is smaller AND more accurate because graph traversal returns structured relationships instead of raw chunk dumps.


🧠 What Is LexGraph?

LexGraph benchmarks three retrieval pipelines — LLM-Only, Basic RAG, and GraphRAG — on Indian Supreme Court judgments from the OpenNyai ILDC corpus.

Why SC judgments? Because the data is deeply graph-shaped:

  • Cases cite earlier cases (citation network)
  • Judges author multiple rulings (authorship graph)
  • Constitutional articles recur across decades of precedent
  • Acts are challenged across hundreds of cases

A question like "Which judges consistently expanded Article 21 rights?" requires traversing Judge → Case → Article → Precedent Chain — 4 hops. Vector RAG retrieves chunks that mention Article 21. GraphRAG traverses the relationship structure and returns targeted, structured context.


📐 Architecture

User query
    │
    ├── Pipeline 1: LLM-Only
    │       └── Query → LLM → Answer
    │           (no retrieval — worst-case baseline)
    │           Avg: 334 tokens · $0.000021 · 2.1s
    │
    ├── Pipeline 2: Basic RAG
    │       └── Query → ChromaDB (top-5 chunks) → LLM → Answer
    │           (industry standard — semantic similarity retrieval)
    │           Avg: 1,732 tokens · $0.000142 · 4.3s
    │
    └── Pipeline 3: GraphRAG  ✅ winner
            └── Query → LLM Entity Extraction
                     → TigerGraph multi-hop traversal (3 hops)
                     → Structured context compression
                     → LLM → Answer
                Avg: 704 tokens · $0.000058 · 3.8s  (−59.4% tokens vs RAG)

Graph Schema

Nodes:   Case · Article · Act · Judge · Bench
Edges:   cites · references_article · references_act · authored_by · heard_by
Dataset: 6,000 SC cases ingested (Round 1) → 70,000 full corpus (Round 2)

Key Design Decisions

LLM-based entity extraction — instead of brittle regex, an LLM call extracts articles, cases, acts, concepts, judges, and temporal constraints from every query before graph traversal. Costs ~100 tokens but improves traversal accuracy significantly.

Context compression — GraphRAG returns structured relationship data (Case → Article → Judge chains), not raw text chunks. The context is naturally denser and shorter.

TigerGraph GraphRAG repo (Path A) — deployed via Docker, queried via REST API. No custom GSQL. The graph layer is handled entirely by the TigerGraph stack.


📁 Project Structure

lexgraph/
├── data/
│   ├── download.py          # fetch OpenNyai ILDC dataset from HuggingFace (~2GB)
│   ├── ingest.py            # load into TigerGraph + ChromaDB (25 chunks/case)
│   └── raw/                 # downloaded JSONL cases (gitignored)
├── pipelines/
│   ├── base.py              # PipelineResult dataclass, LLM client, pricing
│   ├── entity_extractor.py  # LLM-based legal entity extraction
│   ├── llm_only.py          # Pipeline 1: raw LLM, no retrieval
│   ├── basic_rag.py         # Pipeline 2: ChromaDB vector search + LLM
│   ├── graphrag.py          # Pipeline 3: TigerGraph GraphRAG repo + LLM
│   ├── judge_graph.py       # judge-network traversal (wired into GraphRAG)
│   └── query_cache.py       # query result caching
├── eval/
│   ├── queries.py           # 10 benchmark queries with ground truth answers
│   ├── benchmark_1.py       # BERTScore + LLM-as-a-Judge runner
│   ├── mock_results.py      # realistic mock data for offline demos
│   ├── generate_report.py   # produces benchmark_report.md from results.csv
│   └── results.csv          # benchmark output (10 queries × 3 pipelines)
├── dashboard/
│   ├── app.py               # Streamlit comparison dashboard (works offline)
│   └── graph_viz.py         # D3.js animated graph traversal visualisation
├── docs/
│   ├── blog_post.md         # Technical write-up (Dev.to / Medium ready)
│   ├── DEMO_SETUP.md        # step-by-step demo recording guide
│   ├── demo_video_script.md # 6-minute demo video script
│   ├── MCP_SETUP.md         # TigerGraph MCP integration guide
│   └── social_posts.md      # LinkedIn + Twitter posts
├── assets/
│   └── architecture.svg     # system architecture diagram
├── generate_data.py         # generates mock SC judgment dataset (no internet)
├── make_mock.py             # generates mock benchmark results
├── preflight.py             # environment pre-flight checker
├── benchmark_report.md      # generated benchmark report (root copy)
├── SUBMISSION.md            # hackathon submission checklist
├── Makefile                 # all commands in one place
├── .env.example             # environment variable template
└── requirements.txt

⚡ Quick Start (5 Steps)

Step 1 — Install dependencies

git clone https://github.com/your-username/lexgraph.git
cd lexgraph
pip install -r requirements.txt
cp .env.example .env

Fill in .env:

# LLM — Gemini (recommended, free tier works)
GEMINI_API_KEY=your-gemini-key-here
LLM_MODEL=gemini-1.5-flash

# OpenAI (alternative)
# OPENAI_API_KEY=sk-...

# TigerGraph Savanna — free at tgcloud.io
TG_HOST=https://your-instance.i.tgcloud.io
TG_USERNAME=your-email@example.com
TG_PASSWORD=your-password
TG_SECRET=your-secret
TG_GRAPH_NAME=LexGraph

# TigerGraph GraphRAG repo service (docker-compose up)
GRAPHRAG_URL=http://localhost:8000
GRAPHRAG_FALLBACK=false

# Optional — raises HuggingFace rate limits for LLM-as-a-Judge
# HF_TOKEN=hf_...

Step 2 — Generate / download dataset

# Option A: generate synthetic SC judgment data (instant, no internet)
python generate_data.py

# Option B: download real OpenNyai ILDC corpus from HuggingFace (~2GB)
python data/download.py 6000     # 6,000-case dev subset
python data/download.py          # full 70k corpus

⚠️ ILDC requires accepting HuggingFace dataset terms. Visit opennyaiorg/ILDC_multi and set HF_TOKEN=hf_... in .env.

Step 3 — Ingest into ChromaDB (required) + TigerGraph (optional)

make ingest          # ChromaDB only (Basic RAG + GraphRAG fallback)
make ingest-tg       # TigerGraph schema + data (full GraphRAG)

Or directly:

python data/ingest.py chroma       # ChromaDB only
python data/ingest.py tigergraph   # TigerGraph only
python data/ingest.py              # both

Step 4 — Verify environment

python preflight.py   # checks all deps, connections, and data are ready

Step 5 — Run

# Interactive dashboard (works immediately with mock data)
streamlit run dashboard/app.py

# Full 10-query benchmark
python eval/benchmark_1.py

# Generate formatted benchmark report
python eval/generate_report.py

# Or run everything via Make
make dashboard
make benchmark
make report

🖥️ Dashboard

The Streamlit dashboard works out of the box with no live APIs — it uses realistic mock data so you can demo immediately.

streamlit run dashboard/app.py

Features:

  • Select from 5 example queries or type your own
  • Runs all 3 pipelines and shows results side-by-side
  • Entity pills — articles, cases, acts, concepts, judges colour-coded
  • Animated D3.js graph traversal — nodes light up as GraphRAG traverses
  • Token reduction metrics with bar chart comparing all 3 pipelines
  • Session history with running average token reduction
  • Full benchmark tab — load 10-query results with BERTScore + Judge badges
  • Export session results as CSV

Set LIVE_MODE=true in .env to use real LLM APIs instead of mock data.


📊 Benchmark Queries

10 queries designed specifically for multi-hop legal reasoning — where GraphRAG has maximum advantage over vector RAG:

ID Query (abbreviated) Why GraphRAG wins
q01 Which judges expanded Article 21 rights? Judge→Case→Article 4-hop traversal
q02 Privacy evolution from 1950s to Puttaswamy? Citation chain across 60 years
q03 Basic structure doctrine + amendment cases? Kesavananda → downstream citation graph
q04 Acts most challenged under Article 14? Act→Case→Article aggregation
q05 PIL remedies for environmental cases? Case type filter + multi-article join
q06 Justice Chandrachud's Article 21 citations? Judge→Case→PriorCase 3-hop
q07 Maneka Gandhi citation chain post-2010? Forward citation + temporal filter
q08 Judges interpreting both Art 19 + Art 21? Multi-article intersection graph query
q09 Precedent chain for right to livelihood? 3-hop citation chain with judge attribution
q10 Constitutional bench cases citing Indra Sawhney? bench_size filter + citation + topic

📈 Evaluation Methodology

BERTScore

  • Model: microsoft/deberta-xlarge-mnli
  • Tracks both raw F1 (≥0.88 bonus threshold) and rescaled F1 (≥0.55 bonus threshold)
  • Baseline: 0.845 for DeBERTa on English

LLM-as-a-Judge

  • Judge model: Mistral-7B-Instruct-v0.2 (HuggingFace free inference)
  • Fallback: configured LLM (Gemini/OpenAI) when HF is unavailable
  • Grades each answer PASS/FAIL against verifiable ground-truth references
  • Prompt enforces: correct case names, correct article numbers, no hallucination

Results (from eval/results.csv)

Pipeline Avg Tokens Avg Latency Avg Cost BERTScore F1 BERTScore Raw Judge Pass
LLM Only 334 2.1s $0.000021 0.180 0.835 0%
Basic RAG 1,732 4.3s $0.000142 0.310 0.871 100%
GraphRAG 704 3.8s $0.000058 0.620 0.891 100%

Bonus threshold status:

  • ✅ LLM-as-a-Judge pass rate: 100% (target ≥90%)
  • ✅ BERTScore F1 raw: 0.891 (target ≥0.88)
  • ⚠️ BERTScore F1 rescaled: 0.620 (target ≥0.55) — ✅ hits bonus threshold

🎯 Both bonus thresholds hit — maximum bonus unlocked.


🛠️ Make Commands

make setup        # install dependencies
make generate     # generate synthetic SC dataset (no internet needed)
make download     # download real ILDC corpus from HuggingFace
make ingest       # embed into ChromaDB
make ingest-tg    # load into TigerGraph
make preflight    # check everything is ready
make dashboard    # start Streamlit dashboard
make benchmark    # run full 10-query evaluation
make report       # generate benchmark_report.md
make demo         # generate mock results + open standalone demo
make clean        # remove ChromaDB, cache, results
make help         # list all commands

🐛 Troubleshooting

avg_tokens = 0.0 for basic_rag / graphrag

ChromaDB is empty. Run:

make generate
make ingest

TigerGraph 500 / connection error

Non-fatal. GraphRAG automatically falls back to graph-enhanced ChromaDB, which still produces 50–60% token reduction. To fix: confirm TG_HOST, TG_USERNAME, TG_SECRET in .env, then run make ingest-tg.

Set GRAPHRAG_FALLBACK=true in .env to use TigerGraph direct queries instead of the REST service.

Judge pass rate = 0% / BERTScore = 0.0

Answers are empty — ChromaDB is not populated. Fix ChromaDB first (see above), then re-run.

HuggingFace 401 on dataset download

ILDC requires accepting dataset terms. Visit opennyaiorg/ILDC_multi, accept terms, then set HF_TOKEN=hf_... in .env.

BERTScore takes 60s on first run

DeBERTa-xlarge-mnli downloads on first call. Add a warm-up call at benchmark start, or just wait — subsequent runs are fast.


🗺️ Round 2 Plan

Top 10 teams scale to 50–100M tokens with $50 Gemini API credits provided per team.

For Round 2, LexGraph will:

  • Scale from 6,000 → 70,000 cases (full ILDC corpus, ~45M tokens)
  • Switch to Path B: tune num_hops, top_k, community_level per query type
  • Enable the judge-network traversal module (pipelines/judge_graph.py) for full multi-hop
  • Optimise chunk size (currently 512 words) based on BERTScore sensitivity analysis

📦 Dataset

Property Value
Source OpenNyai ILDC
Full corpus ~70,000 Indian Supreme Court judgments
Round 1 subset 6,000 cases
Estimated tokens (Round 1) ~3.8M (exceeds 2M requirement)
License Open research use
Graph nodes Case, Article, Act, Judge, Bench
Graph edges cites, references_article, references_act, authored_by, heard_by

🔗 Links

Resource Link
📹 Demo video https://youtu.be/SFHkNvSppw8
📝 Blog post https://dev.to/sujatha/lexgraph-4occ
🐯 TigerGraph GraphRAG repo github.com/tigergraph/graphrag
🏆 Hackathon page GraphRAG Inference Hackathon
📊 Dataset OpenNyai ILDC

📜 Judging Criteria Alignment

Criteria Weight What LexGraph Delivers
Token Reduction 30% 59.4% fewer tokens vs Basic RAG. 1,028 tokens saved per query. Cost reduced by 59.2%.
Answer Accuracy 30% 100% judge pass rate, BERTScore rescaled 0.620 (above ≥0.55 bonus), BERTScore raw 0.891 (above ≥0.88 bonus).
Performance 20% Per-query latency tracked. Concurrent throughput benchmark included. GraphRAG: 3.8s avg vs Basic RAG 4.3s.
Engineering & Storytelling 20% Animated D3.js graph traversal, live Streamlit dashboard, benchmark report, blog post, demo video script, architecture diagram.
Bonus +extra ✅ Both bonus thresholds hit (judge ≥90%, BERTScore F1 rescaled ≥0.55 AND raw ≥0.88).

Built for the GraphRAG Inference Hackathon by TigerGraph · MIT License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors