wikipath

Vietnamese family-tree explorer. Open data, source-cited, contribution-friendly.

wikipath is a public reference for the family relationships of notable Vietnamese people — kings, scholars, artists, politicians, athletes. Search a name, view their family tree visually, find the relationship path between any two people, and trace every fact back to its source.

Status: pre-alpha, local development. Public site coming to wikipath.app. SPEC v1.2 in SPEC.md; 8 features locked. A seed dataset (~6,000 persons, ~250 bios, ~1,500 avatars) is included; full bio enrichment runs on demand against a Kyma API key.

Features

	Feature	What it does
F1	Multi-source data pipeline	Wikidata + Wikipedia VN + LLM extraction + community contributions; every fact carries a source-sentence quote
F2	Diacritic-aware search	`ng phu trong` matches "Nguyễn Phú Trọng"; alt-name index covers tên hiệu, miếu hiệu, bút danh
F3	Family tree visualization	Vertical layout, era-aware coloring, 4 generations up + 3 down by default
F4	Detail modal	Quick stats, source badges, "Mở cây từ đây" / "So sánh quan hệ" actions
F5	Compare / family-path	Shortest path between any two people, with Vietnamese kinship labels
F6	Tiered community contribution	Anonymous Suggest → authenticated Edit → trusted Moderate, with permanent audit log
F7	Avatar rendering	Wikidata P18 photos via Wikimedia Commons, monogram fallback
F8	Engagement analytics	Anonymous event tracking → popularity-driven enrichment priority

Schema highlights (Vietnamese-aware)

The data model reflects Vietnamese kinship conventions rather than mapping to Western defaults:

Đa thê (rank) — multiple spouses ordered by rank (chính / thứ)
Names by kind — tên húy, tên hiệu, tên thụy, miếu hiệu, pháp danh, bút danh, tên cúng cơm
Half-siblings — sibling_paternal (cùng cha khác mẹ) vs sibling_maternal (cùng mẹ khác cha) vs sibling_full
Child kinds — birth / adopted / step / foster / ritual_kin
Historicity — confirmed / probable / legendary / mythological
Era + dynasty — Lý, Trần, Lê, Mạc, Trịnh, Tây Sơn, Nguyễn, Hiện đại

See SPEC.md §3 for the full schema.

Architecture

                       SPARQL                         User browser
                          │                                 │
     Wikidata ─────► import_wikidata.py                     │
                          │                                 ▼
  Wikipedia VN ─► import_wiki_vi.py / enrich_async.py ─► Next.js (app/)
                          │                                 │
                          ▼                                 │ fetch /api/*
                   DuckDB (wikipath.duckdb) ◄──────►  Go API (cmd/api)
                          ▲                                 │
                          └─────── analytics events ◄───────┘

Backend: Go 1.24 + stdlib net/http, talks to DuckDB via go-duckdb/v2. One binary, five routes: /api/search, /api/p/<id>, /api/p/<id>/details, /api/path, plus engagement endpoints /api/event, /api/trending, /api/admin/popularity.
Read store: DuckDB embedded; recursive CTEs for ancestor/descendant collection; <50ms p95 for a 4-up + 3-down tree.
Frontend: Next.js 16 App Router + Tailwind v4 + shadcn/ui + @xyflow/react for the tree.
Data pipeline: Python (asyncio + aiohttp) for bulk SPARQL imports
- LLM enrichment via the Kyma API gateway. Default extraction model: deepseek-v4-pro. See "LLM model choice" below for a benchmark of stable alternatives.
Hallucination guard: every LLM-extracted fact must include a source_sentence field that is a literal substring of the input article; confidence is downgraded when the match is partial; names failing the is_valid_person_name() filter are rejected at insert time.

LLM model choice

The enrichment pipeline talks to any OpenAI-compatible chat completions endpoint. We use the Kyma API gateway because it's Vietnamese-friendly (Asia-region latency, multilingual model catalog) and lets us swap underlying models without touching code — just change the LLM_MODEL constant in scripts/enrich_async.py.

We benchmarked four Kyma-hosted models on the same Wikipedia VN extraction task (3 articles, identical prompt + JSON schema, 60s per-request timeout, results from python3 scripts/bench_models.py):

Model	OK rate	Median latency	Source-sentence match	Notes
`deepseek-v4-pro` (preview)	99.5% (200-profile pilot)	~5s	high	Current production default
`deepseek-v3` (stable)	3/3 in bench	14.3s	100%	Verified drop-in alternative
`gemini-2.5-flash`	1/3	43.8s	100%	Intermittent timeouts on long articles
`glm-4.5-air`	1/3	32.5s	100%	Intermittent on long articles
`qwen-3.6-plus`	0/3	–	–	Did not return valid JSON in our bench

Recommendation: keep deepseek-v4-pro as the production default for quality; switch to deepseek-v3 if you want a stable-tier model with the same vendor; the others are not yet recommended for VN-language structured extraction without further tuning.

To re-run the bench against your own model shortlist:

python3 scripts/bench_models.py --models deepseek-v3 gemini-2.5-flash

Quick start (local dev)

Requirements: Go 1.24+, Python 3.11+, Node 20+, DuckDB CLI, pnpm.

# 1. Clone + install
git clone https://github.com/sonpiaz/wikipath.git
cd wikipath
cd web && pnpm install && cd ..

# 2. Initialize DuckDB schema
duckdb wikipath.duckdb < internal/schema/001_init.sql
duckdb wikipath.duckdb < internal/schema/002_engagement.sql

# 3. Seed with 49 hand-curated profiles
pip install duckdb PyYAML
python3 scripts/seed_db.py

# 4. Build + run the Go API
go build -o bin/wikipath-api ./cmd/api
./bin/wikipath-api -db wikipath.duckdb -addr :8090

# 5. In another shell, run the Next.js dev server
cd web && pnpm dev --port 3010
# open http://localhost:3010

Optional larger dataset (~5,000 persons, requires a Kyma API key in KYMA_API_KEY):

# Wikidata bulk import (~3 min)
python3 scripts/import_wikidata.py

# Image enrichment (~16s)
python3 scripts/import_images.py

# Bulk LLM enrichment of biographies (~3h wall time, ~$7 LLM cost
# at concurrency 15 against a Vietnamese-language model)
python3 scripts/pick_candidates.py --limit 5000 -o data/candidates-5k.json
python3 scripts/enrich_async.py --candidates data/candidates-5k.json --concurrency 15

Note: DuckDB serializes file locks across processes. Stop the Go API before running batch Python scripts; restart afterwards.

Project layout

wikipath/
├── cmd/api/              # Go HTTP API (read-write DuckDB)
├── internal/
│   ├── schema/           # SQL migrations (001_init, 002_engagement)
│   └── store/            # DuckDB query layer (search, tree, detail, path, events)
├── scripts/              # Python data pipeline
│   ├── seed_db.py
│   ├── import_wiki_vi.py
│   ├── import_wikidata.py
│   ├── import_images.py
│   ├── pick_candidates.py
│   ├── enrich_async.py
│   └── cleanup_bad_llm_names.py
├── web/                  # Next.js 16 frontend
│   ├── app/              # routes: / | /p/[id] | /path/[from]/[to]
│   ├── components/       # SearchBox, FamilyTree, PersonModal, …
│   └── lib/              # api.ts, track.ts, utils
├── data/
│   └── seed-vi.yml       # 49 hand-curated profiles
├── docs/
│   ├── SPEC-v0-celebrity-hops.md
│   └── SPRINT-PLAN-v0.md
├── SPEC.md               # v1.2 design + execution plan
├── PRIVACY.md            # subject + user-data privacy
├── TERMS.md
├── CODE-OF-CONDUCT.md
├── CONTRIBUTOR-AGREEMENT.md
├── TAKEDOWN.md
├── LICENSE               # MIT (code)
├── LICENSE-DATA          # CC BY-SA 4.0 + ODbL (database)
└── DATA-SOURCES.md

Contributing

We accept contributions at three tiers — see CONTRIBUTOR-AGREEMENT.md:

Tier 0 — Suggest: anonymous, no login. Submit a correction or addition with an optional source URL.
Tier 1 — Edit: authenticated via magic email link, requires CLA acceptance, edits enter a moderation queue.
Tier 2 — Moderate: trusted reviewers approve/reject pending edits.

All contributions are bound by the Code of Conduct. The short version: cite or stay quiet; respect Vietnamese cultural conventions; living people deserve stronger privacy defaults than the deceased.

Privacy

About persons in the database: living-by-default is opt_out; takedowns honored within 7 days. See PRIVACY.md §1.
About visitors of the site: anonymous session_id only, no PII, 90-day retention, opt-out via localStorage. See PRIVACY.md §2.

Licenses

Code: MIT.
Database compilation: dual-licensed under CC BY-SA 4.0 and the Open Database License. See LICENSE-DATA.
Upstream data: Wikipedia (CC BY-SA), Wikidata (CC0), Wikimedia Commons (per-file). Attribution preserved in source badges on every person modal. See DATA-SOURCES.md.

A note on the v0 archive

The original wikipath repo was a clean-room reimplementation of the "Six Degrees of Wikipedia" pattern — celebrity-hop BFS over a Wikipedia-mention graph. The v0 SPEC + sprint plan are preserved in docs/SPEC-v0-celebrity-hops.md and docs/SPRINT-PLAN-v0.md for historical reference; the v0 source code was removed when the project pivoted to the Vietnamese family-tree explorer in May 2026. v1 was written fresh against the new SPEC — no v0 source is imported by the current codebase.

Contact

Takedown / correction requests: see TAKEDOWN.md. Other inquiries: open an issue at https://github.com/sonpiaz/wikipath/issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikipath

Features

Schema highlights (Vietnamese-aware)

Architecture

LLM model choice

Quick start (local dev)

Project layout

Contributing

Privacy

Licenses

A note on the v0 archive

Contact

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
cmd/api		cmd/api
data		data
docs		docs
internal		internal
scripts		scripts
web		web
.env.example		.env.example
.gitignore		.gitignore
CODE-OF-CONDUCT.md		CODE-OF-CONDUCT.md
CONTRIBUTOR-AGREEMENT.md		CONTRIBUTOR-AGREEMENT.md
DATA-SOURCES.md		DATA-SOURCES.md
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
PRIVACY.md		PRIVACY.md
README.md		README.md
SPEC.md		SPEC.md
TAKEDOWN.md		TAKEDOWN.md
TERMS.md		TERMS.md
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

wikipath

Features

Schema highlights (Vietnamese-aware)

Architecture

LLM model choice

Quick start (local dev)

Project layout

Contributing

Privacy

Licenses

A note on the v0 archive

Contact

About

Resources

License

Licenses found

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages