RegenTribes Knowledge Graph

LLM-powered extraction pipeline that transforms any document into a typed, traversable knowledge graph in SurrealDB with NARS epistemic truth values and vector embeddings for semantic search.

Architecture

[ANY DOCUMENT]
      |
      v
  Kreuzberg text extraction (75+ formats, OCR)
      |
      v
  Semantic chunker (paragraph boundaries, 800 char max, 200 char overlap)
      |
      v
  Grammar Triangle annotation (no LLM, pure keyword activation)
  - 20 NSM primitives (Wierzbicka semantic primes)
  - 8 Qualia dimensions (valence, arousal, certainty, agency, ...)
  - 5 Causality indicators (past/present/future + temporality + agency)
  - dominant_mode classification
      |
      v
  Claude LLM extraction via OpenRouter (batched, 4 chunks/call)
  - Concepts: name, type, description, rung (R0-R9), NARS truth values
  - Relations: 144-verb taxonomy, evidence quotes, temporal validity
      |
      v
  SurrealDB graph storage
  - Concept deduplication via SHA256 stable IDs
  - NARS revision merges evidence across documents
  - Native graph traversal with -> edges
      |
      v
  Vector embeddings (text-embedding-3-small, 1536d, HNSW cosine index)
  - Semantic search across all concepts
  - Find-similar by embedding distance
      |
      v
  3D visualization (3d-force-graph, self-contained HTML)

Schema

document (node) --contains--> chunk (node) --mentions--> concept (node)
                                                              |
                                                         relates (edge)
                                                              |
                                                         concept (node)

Node tables:

document — source file metadata (title, mime_type, word_count, language, quality)
chunk — text segment with Grammar Triangle (NSM, Qualia, Causality, dominant_mode)
concept — named entity/idea/event with type, rung level (R0-R9), NARS truth values, aliases, tags, vector embedding

Edge tables (TYPE RELATION — enables native -> traversal):

contains — document to chunk provenance
mentions — chunk to concept provenance (with salience score)
relates — concept to concept semantic edge (verb from 144-verb taxonomy, evidence quote, temporal validity)

Quick Start

Prerequisites

Python 3.11+
SurrealDB v3.x running locally
OpenRouter API key (for LLM extraction + embeddings)

Setup

# Clone
git clone https://github.com/regentribes/regen-knowledge-graph.git
cd regen-knowledge-graph

# Create virtualenv
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env with your credentials

# Start SurrealDB
surreal start --bind 0.0.0.0:8000 --user root --pass <your-password> \
  surrealkv:./data/brain.db

# Apply schema
source .env
python pipeline.py stats  # bootstraps schema on first run

Ingest a document

source .venv/bin/activate
export $(grep -v "^#" .env | xargs)

# Ingest a file with embeddings
python pipeline.py ingest path/to/document.pdf --embed -v

# Ingest a directory
python pipeline.py ingest ./docs/ --embed -v

Search the graph

# Semantic search
python pipeline.py search "regenerative agriculture" --limit 10

# Find similar concepts
python pipeline.py similar "concept:abc123"

# Database stats
python pipeline.py stats

Visualize

# Generate interactive 3D HTML visualization
python pipeline.py viz

# Export graph as JSON
python pipeline.py export -o graph.json

Pipeline Commands

Command	Purpose
`python pipeline.py ingest <path> --embed -v`	Ingest file/directory + generate embeddings
`python pipeline.py embed [--force]`	Generate embeddings for all concepts
`python pipeline.py search "<query>" --limit N`	Semantic similarity search
`python pipeline.py similar <concept_id>`	Find similar concepts by embedding
`python pipeline.py stats`	Database overview
`python pipeline.py viz [--limit N]`	Generate 3D HTML visualization
`python pipeline.py export -o file.json`	Export graph as JSON

Shell Scripts

Wrapper scripts in scripts/ provide JSON output for integration with bots and other tools.

Script	Usage	Output
`scripts/ingest.sh <file>`	Ingest + search for connections	JSON: doc_id, counts, top concepts, connections
`scripts/query.sh "<query>"`	Semantic search + related edges	JSON: results, connections
`scripts/relate.sh "<A>" "<B>"`	Find paths between concepts	JSON: direct relations, shared neighbors
`scripts/capture.sh "<text>"`	Capture text snippet	JSON: doc_id, counts
`scripts/stats.sh`	Database overview	JSON: table counts, types, verbs

Components

File	Role
`pipeline.py`	CLI orchestrator — routes to all commands
`graph_extract.py`	Main pipeline: `GraphExtractor` class (Kreuzberg -> chunks -> LLM -> SurrealDB)
`llm_parser.py`	LLM extraction via OpenRouter — batched chunk processing, JSON parsing
`grammar_triangle.py`	Zero-cost semantic annotation (NSM + Qualia + Causality) + text chunker
`verbs.py`	144-verb taxonomy (6 categories x 24 verbs), fuzzy matching and normalization
`embeddings.py`	Vector embedding generation (text-embedding-3-small) and semantic search
`schema.surql`	Full SurrealDB schema definition
`batch_ingest.py`	Rule-based bulk ingestion (no LLM, pattern matching only)
`batch_ingest_v2.py`	Improved rule-based bulk ingestion with extended patterns
`subagent_parser.py`	Alternative extraction via sub-agent spawning
`viz/graph_viz.py`	3D visualization generator (queries SurrealDB, produces self-contained HTML)

Key Design Decisions

NARS Epistemic Truth Values

Every concept and relation carries (frequency, confidence, evidence_count) from the Non-Axiomatic Reasoning System. When the same fact is observed in multiple documents, evidence merges via weighted revision:

w1 = c1 / (1 - c1) * n1        # prior weight
w2 = c2 / (1 - c2) * n2        # new evidence weight
freq_new = (w1*f1 + w2*f2) / (w1 + w2)
conf_new = (w1 + w2) / (w1 + w2 + 1.0)
n_new    = n1 + n2

Confidence assignment by the LLM:

Direct statement in text: 0.8-0.95
Clearly implied: 0.4-0.7
Speculative/hedged: 0.1-0.4

144-Verb Taxonomy

Six categories x 24 verbs. Every relates edge uses exactly one. LLM output is fuzzy-matched then normalized; unknown verbs fall back to RELATED_TO.

Category	Verbs (sample)
Structural	IS_A, HAS_A, PART_OF, CONTAINS, DEPENDS_ON, CONTRADICTS, SUPPORTS
Causal	CAUSES, ENABLES, PREVENTS, TRIGGERS, TRANSFORMS, REQUIRES
Temporal	BEFORE, AFTER, DURING, PRECEDES, FOLLOWS, DEPRECATED
Epistemic	KNOWS, BELIEVES, INFERS, PREDICTS, DISCOVERS, CONCLUDES
Agentive	DOES, WANTS, DECIDES, CREATES, DESTROYS, CONTROLS
Experiential	FEELS, SEES, ENJOYS, FEARS, HOPES, EMBRACES

Rung Levels (R0-R9)

Abstraction depth assigned by the LLM per concept:

Rung	Name	Example
R0	Surface literal	"Redis", "port 8080"
R1	Shallow inference	"HTTP server", "cache layer"
R2	Contextual	"storage backend", "feature flag system"
R3	Analogical	"SpineCache is like a blackboard"
R4	Abstract pattern	"zero-copy architecture"
R5	Structural schema	"content-addressable memory model"
R6	Counterfactual	"if API were fixed, latency would halve"
R7	Meta	"reasoning about the reasoning system"
R8	Recursive	"the system observing its own observation"
R9	Transcendent	"consciousness substrate"

Grammar Triangle (Zero-Cost Annotation)

Every chunk is annotated before any LLM call using pure keyword activation, giving a continuous semantic fingerprint at zero API cost:

NSM — 20 Wierzbicka semantic primitives (FEEL, THINK, KNOW, WANT, SEE, DO, HAPPEN, SAY, GOOD, BAD, EXIST, SELF, OTHER, BECAUSE, IF, CAN, MAYBE, AFTER, BEFORE, NOT)
Qualia — 8 phenomenal dimensions (valence, arousal, intimacy, certainty, agency, emergence, continuity, abstraction)
Causality — past/present/future tense balance + temporality + agency
dominant_mode — emotional | cognitive | existential | emergent | relational | technical

Concept Deduplication

Stable IDs from sha256(name.lower() + "_" + type)[:16]. UPSERT MERGE preserves all fields; NARS revision merges epistemic state when the same concept appears in multiple documents.

SurrealDB Query Examples

-- All causal neighbours of a concept
SELECT ->relates[WHERE verb_category = 'Causal']->(concept)
FROM concept WHERE name = 'Permaculture';

-- 3-hop traversal
SELECT ->relates->(concept)->relates->(concept)->relates->concept
FROM concept WHERE name = 'Community Garden';

-- High-confidence structural relations
SELECT * FROM relates
WHERE verb_category = 'Structural' AND nars_confidence > 0.7;

-- Abstract concepts (R4+), ranked by confidence
SELECT name, type, rung, description, nars_confidence FROM concept
WHERE rung >= 4 ORDER BY nars_confidence DESC LIMIT 20;

-- Disputed claims: often false but confidently stated
SELECT * FROM concept
WHERE nars_frequency < 0.5 AND nars_confidence > 0.7;

-- Full document subgraph (concepts + provenance)
SELECT *, ->contains->(chunk)->mentions->(concept)
FROM document WHERE title = 'README.md';

-- Emotionally loaded chunks (high arousal + negative valence)
SELECT text, dominant_mode, qualia FROM chunk
WHERE qualia.arousal > 0.5 AND qualia.valence < 0.35;

Environment Variables

Variable	Required	Description
`OPENROUTER_API_KEY`	Yes	OpenRouter API key for LLM extraction and embeddings
`SURREAL_PASS`	Yes	SurrealDB root password
`SURREALDB_URL`	No	SurrealDB WebSocket URL (default: `ws://127.0.0.1:8000`)
`EXTRACTION_MODEL`	No	LLM model for extraction (default: `anthropic/claude-sonnet-4`)
`EMBEDDING_MODEL`	No	Embedding model (default: `openai/text-embedding-3-small`)

Docs

Design specifications and project documentation live in docs/:

docs/PROJECT-objectives.md — Overall project goals
docs/USECASES.md — 8 usage scenarios
docs/specs/ — Detailed evaluation specs for each component

Stack

Python — pipeline orchestration
Kreuzberg — document text extraction (75+ formats, OCR)
Claude Sonnet (via OpenRouter) — concept and relation extraction
SurrealDB — graph storage with native -> traversal
OpenAI text-embedding-3-small (via OpenRouter) — vector embeddings
3d-force-graph — interactive 3D visualization

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RegenTribes Knowledge Graph

Architecture

Schema

Quick Start

Prerequisites

Setup

Ingest a document

Search the graph

Visualize

Pipeline Commands

Shell Scripts

Components

Key Design Decisions

NARS Epistemic Truth Values

144-Verb Taxonomy

Rung Levels (R0-R9)

Grammar Triangle (Zero-Cost Annotation)

Concept Deduplication

SurrealDB Query Examples

Environment Variables

Docs

Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
scripts		scripts
viz		viz
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
batch_ingest.py		batch_ingest.py
batch_ingest_v2.py		batch_ingest_v2.py
embeddings.py		embeddings.py
grammar_triangle.py		grammar_triangle.py
graph_extract.py		graph_extract.py
llm_parser.py		llm_parser.py
nsm_keywords.json		nsm_keywords.json
pipeline.py		pipeline.py
requirements.txt		requirements.txt
schema.surql		schema.surql
subagent_parser.py		subagent_parser.py
verb_taxonomy.json		verb_taxonomy.json
verbs.py		verbs.py

Folders and files

Latest commit

History

Repository files navigation

RegenTribes Knowledge Graph

Architecture

Schema

Quick Start

Prerequisites

Setup

Ingest a document

Search the graph

Visualize

Pipeline Commands

Shell Scripts

Components

Key Design Decisions

NARS Epistemic Truth Values

144-Verb Taxonomy

Rung Levels (R0-R9)

Grammar Triangle (Zero-Cost Annotation)

Concept Deduplication

SurrealDB Query Examples

Environment Variables

Docs

Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages