LLM-powered extraction pipeline that transforms any document into a typed, traversable knowledge graph in SurrealDB with NARS epistemic truth values and vector embeddings for semantic search.
[ANY DOCUMENT]
|
v
Kreuzberg text extraction (75+ formats, OCR)
|
v
Semantic chunker (paragraph boundaries, 800 char max, 200 char overlap)
|
v
Grammar Triangle annotation (no LLM, pure keyword activation)
- 20 NSM primitives (Wierzbicka semantic primes)
- 8 Qualia dimensions (valence, arousal, certainty, agency, ...)
- 5 Causality indicators (past/present/future + temporality + agency)
- dominant_mode classification
|
v
Claude LLM extraction via OpenRouter (batched, 4 chunks/call)
- Concepts: name, type, description, rung (R0-R9), NARS truth values
- Relations: 144-verb taxonomy, evidence quotes, temporal validity
|
v
SurrealDB graph storage
- Concept deduplication via SHA256 stable IDs
- NARS revision merges evidence across documents
- Native graph traversal with -> edges
|
v
Vector embeddings (text-embedding-3-small, 1536d, HNSW cosine index)
- Semantic search across all concepts
- Find-similar by embedding distance
|
v
3D visualization (3d-force-graph, self-contained HTML)
document (node) --contains--> chunk (node) --mentions--> concept (node)
|
relates (edge)
|
concept (node)
Node tables:
document— source file metadata (title, mime_type, word_count, language, quality)chunk— text segment with Grammar Triangle (NSM, Qualia, Causality, dominant_mode)concept— named entity/idea/event with type, rung level (R0-R9), NARS truth values, aliases, tags, vector embedding
Edge tables (TYPE RELATION — enables native -> traversal):
contains— document to chunk provenancementions— chunk to concept provenance (with salience score)relates— concept to concept semantic edge (verb from 144-verb taxonomy, evidence quote, temporal validity)
- Python 3.11+
- SurrealDB v3.x running locally
- OpenRouter API key (for LLM extraction + embeddings)
# Clone
git clone https://github.com/regentribes/regen-knowledge-graph.git
cd regen-knowledge-graph
# Create virtualenv
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your credentials
# Start SurrealDB
surreal start --bind 0.0.0.0:8000 --user root --pass <your-password> \
surrealkv:./data/brain.db
# Apply schema
source .env
python pipeline.py stats # bootstraps schema on first runsource .venv/bin/activate
export $(grep -v "^#" .env | xargs)
# Ingest a file with embeddings
python pipeline.py ingest path/to/document.pdf --embed -v
# Ingest a directory
python pipeline.py ingest ./docs/ --embed -v# Semantic search
python pipeline.py search "regenerative agriculture" --limit 10
# Find similar concepts
python pipeline.py similar "concept:abc123"
# Database stats
python pipeline.py stats# Generate interactive 3D HTML visualization
python pipeline.py viz
# Export graph as JSON
python pipeline.py export -o graph.json| Command | Purpose |
|---|---|
python pipeline.py ingest <path> --embed -v |
Ingest file/directory + generate embeddings |
python pipeline.py embed [--force] |
Generate embeddings for all concepts |
python pipeline.py search "<query>" --limit N |
Semantic similarity search |
python pipeline.py similar <concept_id> |
Find similar concepts by embedding |
python pipeline.py stats |
Database overview |
python pipeline.py viz [--limit N] |
Generate 3D HTML visualization |
python pipeline.py export -o file.json |
Export graph as JSON |
Wrapper scripts in scripts/ provide JSON output for integration with bots and other tools.
| Script | Usage | Output |
|---|---|---|
scripts/ingest.sh <file> |
Ingest + search for connections | JSON: doc_id, counts, top concepts, connections |
scripts/query.sh "<query>" |
Semantic search + related edges | JSON: results, connections |
scripts/relate.sh "<A>" "<B>" |
Find paths between concepts | JSON: direct relations, shared neighbors |
scripts/capture.sh "<text>" |
Capture text snippet | JSON: doc_id, counts |
scripts/stats.sh |
Database overview | JSON: table counts, types, verbs |
| File | Role |
|---|---|
pipeline.py |
CLI orchestrator — routes to all commands |
graph_extract.py |
Main pipeline: GraphExtractor class (Kreuzberg -> chunks -> LLM -> SurrealDB) |
llm_parser.py |
LLM extraction via OpenRouter — batched chunk processing, JSON parsing |
grammar_triangle.py |
Zero-cost semantic annotation (NSM + Qualia + Causality) + text chunker |
verbs.py |
144-verb taxonomy (6 categories x 24 verbs), fuzzy matching and normalization |
embeddings.py |
Vector embedding generation (text-embedding-3-small) and semantic search |
schema.surql |
Full SurrealDB schema definition |
batch_ingest.py |
Rule-based bulk ingestion (no LLM, pattern matching only) |
batch_ingest_v2.py |
Improved rule-based bulk ingestion with extended patterns |
subagent_parser.py |
Alternative extraction via sub-agent spawning |
viz/graph_viz.py |
3D visualization generator (queries SurrealDB, produces self-contained HTML) |
Every concept and relation carries (frequency, confidence, evidence_count) from the Non-Axiomatic Reasoning System. When the same fact is observed in multiple documents, evidence merges via weighted revision:
w1 = c1 / (1 - c1) * n1 # prior weight
w2 = c2 / (1 - c2) * n2 # new evidence weight
freq_new = (w1*f1 + w2*f2) / (w1 + w2)
conf_new = (w1 + w2) / (w1 + w2 + 1.0)
n_new = n1 + n2Confidence assignment by the LLM:
- Direct statement in text: 0.8-0.95
- Clearly implied: 0.4-0.7
- Speculative/hedged: 0.1-0.4
Six categories x 24 verbs. Every relates edge uses exactly one. LLM output is fuzzy-matched then normalized; unknown verbs fall back to RELATED_TO.
| Category | Verbs (sample) |
|---|---|
| Structural | IS_A, HAS_A, PART_OF, CONTAINS, DEPENDS_ON, CONTRADICTS, SUPPORTS |
| Causal | CAUSES, ENABLES, PREVENTS, TRIGGERS, TRANSFORMS, REQUIRES |
| Temporal | BEFORE, AFTER, DURING, PRECEDES, FOLLOWS, DEPRECATED |
| Epistemic | KNOWS, BELIEVES, INFERS, PREDICTS, DISCOVERS, CONCLUDES |
| Agentive | DOES, WANTS, DECIDES, CREATES, DESTROYS, CONTROLS |
| Experiential | FEELS, SEES, ENJOYS, FEARS, HOPES, EMBRACES |
Abstraction depth assigned by the LLM per concept:
| Rung | Name | Example |
|---|---|---|
| R0 | Surface literal | "Redis", "port 8080" |
| R1 | Shallow inference | "HTTP server", "cache layer" |
| R2 | Contextual | "storage backend", "feature flag system" |
| R3 | Analogical | "SpineCache is like a blackboard" |
| R4 | Abstract pattern | "zero-copy architecture" |
| R5 | Structural schema | "content-addressable memory model" |
| R6 | Counterfactual | "if API were fixed, latency would halve" |
| R7 | Meta | "reasoning about the reasoning system" |
| R8 | Recursive | "the system observing its own observation" |
| R9 | Transcendent | "consciousness substrate" |
Every chunk is annotated before any LLM call using pure keyword activation, giving a continuous semantic fingerprint at zero API cost:
- NSM — 20 Wierzbicka semantic primitives (FEEL, THINK, KNOW, WANT, SEE, DO, HAPPEN, SAY, GOOD, BAD, EXIST, SELF, OTHER, BECAUSE, IF, CAN, MAYBE, AFTER, BEFORE, NOT)
- Qualia — 8 phenomenal dimensions (valence, arousal, intimacy, certainty, agency, emergence, continuity, abstraction)
- Causality — past/present/future tense balance + temporality + agency
- dominant_mode — emotional | cognitive | existential | emergent | relational | technical
Stable IDs from sha256(name.lower() + "_" + type)[:16]. UPSERT MERGE preserves all fields; NARS revision merges epistemic state when the same concept appears in multiple documents.
-- All causal neighbours of a concept
SELECT ->relates[WHERE verb_category = 'Causal']->(concept)
FROM concept WHERE name = 'Permaculture';
-- 3-hop traversal
SELECT ->relates->(concept)->relates->(concept)->relates->concept
FROM concept WHERE name = 'Community Garden';
-- High-confidence structural relations
SELECT * FROM relates
WHERE verb_category = 'Structural' AND nars_confidence > 0.7;
-- Abstract concepts (R4+), ranked by confidence
SELECT name, type, rung, description, nars_confidence FROM concept
WHERE rung >= 4 ORDER BY nars_confidence DESC LIMIT 20;
-- Disputed claims: often false but confidently stated
SELECT * FROM concept
WHERE nars_frequency < 0.5 AND nars_confidence > 0.7;
-- Full document subgraph (concepts + provenance)
SELECT *, ->contains->(chunk)->mentions->(concept)
FROM document WHERE title = 'README.md';
-- Emotionally loaded chunks (high arousal + negative valence)
SELECT text, dominant_mode, qualia FROM chunk
WHERE qualia.arousal > 0.5 AND qualia.valence < 0.35;| Variable | Required | Description |
|---|---|---|
OPENROUTER_API_KEY |
Yes | OpenRouter API key for LLM extraction and embeddings |
SURREAL_PASS |
Yes | SurrealDB root password |
SURREALDB_URL |
No | SurrealDB WebSocket URL (default: ws://127.0.0.1:8000) |
EXTRACTION_MODEL |
No | LLM model for extraction (default: anthropic/claude-sonnet-4) |
EMBEDDING_MODEL |
No | Embedding model (default: openai/text-embedding-3-small) |
Design specifications and project documentation live in docs/:
docs/PROJECT-objectives.md— Overall project goalsdocs/USECASES.md— 8 usage scenariosdocs/specs/— Detailed evaluation specs for each component
- Python — pipeline orchestration
- Kreuzberg — document text extraction (75+ formats, OCR)
- Claude Sonnet (via OpenRouter) — concept and relation extraction
- SurrealDB — graph storage with native
->traversal - OpenAI text-embedding-3-small (via OpenRouter) — vector embeddings
- 3d-force-graph — interactive 3D visualization
MIT