Skip to content

regentribes/regen-knowledge-graph

Repository files navigation

RegenTribes Knowledge Graph

LLM-powered extraction pipeline that transforms any document into a typed, traversable knowledge graph in SurrealDB with NARS epistemic truth values and vector embeddings for semantic search.

Architecture

[ANY DOCUMENT]
      |
      v
  Kreuzberg text extraction (75+ formats, OCR)
      |
      v
  Semantic chunker (paragraph boundaries, 800 char max, 200 char overlap)
      |
      v
  Grammar Triangle annotation (no LLM, pure keyword activation)
  - 20 NSM primitives (Wierzbicka semantic primes)
  - 8 Qualia dimensions (valence, arousal, certainty, agency, ...)
  - 5 Causality indicators (past/present/future + temporality + agency)
  - dominant_mode classification
      |
      v
  Claude LLM extraction via OpenRouter (batched, 4 chunks/call)
  - Concepts: name, type, description, rung (R0-R9), NARS truth values
  - Relations: 144-verb taxonomy, evidence quotes, temporal validity
      |
      v
  SurrealDB graph storage
  - Concept deduplication via SHA256 stable IDs
  - NARS revision merges evidence across documents
  - Native graph traversal with -> edges
      |
      v
  Vector embeddings (text-embedding-3-small, 1536d, HNSW cosine index)
  - Semantic search across all concepts
  - Find-similar by embedding distance
      |
      v
  3D visualization (3d-force-graph, self-contained HTML)

Schema

document (node) --contains--> chunk (node) --mentions--> concept (node)
                                                              |
                                                         relates (edge)
                                                              |
                                                         concept (node)

Node tables:

  • document — source file metadata (title, mime_type, word_count, language, quality)
  • chunk — text segment with Grammar Triangle (NSM, Qualia, Causality, dominant_mode)
  • concept — named entity/idea/event with type, rung level (R0-R9), NARS truth values, aliases, tags, vector embedding

Edge tables (TYPE RELATION — enables native -> traversal):

  • contains — document to chunk provenance
  • mentions — chunk to concept provenance (with salience score)
  • relates — concept to concept semantic edge (verb from 144-verb taxonomy, evidence quote, temporal validity)

Quick Start

Prerequisites

  • Python 3.11+
  • SurrealDB v3.x running locally
  • OpenRouter API key (for LLM extraction + embeddings)

Setup

# Clone
git clone https://github.com/regentribes/regen-knowledge-graph.git
cd regen-knowledge-graph

# Create virtualenv
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env with your credentials

# Start SurrealDB
surreal start --bind 0.0.0.0:8000 --user root --pass <your-password> \
  surrealkv:./data/brain.db

# Apply schema
source .env
python pipeline.py stats  # bootstraps schema on first run

Ingest a document

source .venv/bin/activate
export $(grep -v "^#" .env | xargs)

# Ingest a file with embeddings
python pipeline.py ingest path/to/document.pdf --embed -v

# Ingest a directory
python pipeline.py ingest ./docs/ --embed -v

Search the graph

# Semantic search
python pipeline.py search "regenerative agriculture" --limit 10

# Find similar concepts
python pipeline.py similar "concept:abc123"

# Database stats
python pipeline.py stats

Visualize

# Generate interactive 3D HTML visualization
python pipeline.py viz

# Export graph as JSON
python pipeline.py export -o graph.json

Pipeline Commands

Command Purpose
python pipeline.py ingest <path> --embed -v Ingest file/directory + generate embeddings
python pipeline.py embed [--force] Generate embeddings for all concepts
python pipeline.py search "<query>" --limit N Semantic similarity search
python pipeline.py similar <concept_id> Find similar concepts by embedding
python pipeline.py stats Database overview
python pipeline.py viz [--limit N] Generate 3D HTML visualization
python pipeline.py export -o file.json Export graph as JSON

Shell Scripts

Wrapper scripts in scripts/ provide JSON output for integration with bots and other tools.

Script Usage Output
scripts/ingest.sh <file> Ingest + search for connections JSON: doc_id, counts, top concepts, connections
scripts/query.sh "<query>" Semantic search + related edges JSON: results, connections
scripts/relate.sh "<A>" "<B>" Find paths between concepts JSON: direct relations, shared neighbors
scripts/capture.sh "<text>" Capture text snippet JSON: doc_id, counts
scripts/stats.sh Database overview JSON: table counts, types, verbs

Components

File Role
pipeline.py CLI orchestrator — routes to all commands
graph_extract.py Main pipeline: GraphExtractor class (Kreuzberg -> chunks -> LLM -> SurrealDB)
llm_parser.py LLM extraction via OpenRouter — batched chunk processing, JSON parsing
grammar_triangle.py Zero-cost semantic annotation (NSM + Qualia + Causality) + text chunker
verbs.py 144-verb taxonomy (6 categories x 24 verbs), fuzzy matching and normalization
embeddings.py Vector embedding generation (text-embedding-3-small) and semantic search
schema.surql Full SurrealDB schema definition
batch_ingest.py Rule-based bulk ingestion (no LLM, pattern matching only)
batch_ingest_v2.py Improved rule-based bulk ingestion with extended patterns
subagent_parser.py Alternative extraction via sub-agent spawning
viz/graph_viz.py 3D visualization generator (queries SurrealDB, produces self-contained HTML)

Key Design Decisions

NARS Epistemic Truth Values

Every concept and relation carries (frequency, confidence, evidence_count) from the Non-Axiomatic Reasoning System. When the same fact is observed in multiple documents, evidence merges via weighted revision:

w1 = c1 / (1 - c1) * n1        # prior weight
w2 = c2 / (1 - c2) * n2        # new evidence weight
freq_new = (w1*f1 + w2*f2) / (w1 + w2)
conf_new = (w1 + w2) / (w1 + w2 + 1.0)
n_new    = n1 + n2

Confidence assignment by the LLM:

  • Direct statement in text: 0.8-0.95
  • Clearly implied: 0.4-0.7
  • Speculative/hedged: 0.1-0.4

144-Verb Taxonomy

Six categories x 24 verbs. Every relates edge uses exactly one. LLM output is fuzzy-matched then normalized; unknown verbs fall back to RELATED_TO.

Category Verbs (sample)
Structural IS_A, HAS_A, PART_OF, CONTAINS, DEPENDS_ON, CONTRADICTS, SUPPORTS
Causal CAUSES, ENABLES, PREVENTS, TRIGGERS, TRANSFORMS, REQUIRES
Temporal BEFORE, AFTER, DURING, PRECEDES, FOLLOWS, DEPRECATED
Epistemic KNOWS, BELIEVES, INFERS, PREDICTS, DISCOVERS, CONCLUDES
Agentive DOES, WANTS, DECIDES, CREATES, DESTROYS, CONTROLS
Experiential FEELS, SEES, ENJOYS, FEARS, HOPES, EMBRACES

Rung Levels (R0-R9)

Abstraction depth assigned by the LLM per concept:

Rung Name Example
R0 Surface literal "Redis", "port 8080"
R1 Shallow inference "HTTP server", "cache layer"
R2 Contextual "storage backend", "feature flag system"
R3 Analogical "SpineCache is like a blackboard"
R4 Abstract pattern "zero-copy architecture"
R5 Structural schema "content-addressable memory model"
R6 Counterfactual "if API were fixed, latency would halve"
R7 Meta "reasoning about the reasoning system"
R8 Recursive "the system observing its own observation"
R9 Transcendent "consciousness substrate"

Grammar Triangle (Zero-Cost Annotation)

Every chunk is annotated before any LLM call using pure keyword activation, giving a continuous semantic fingerprint at zero API cost:

  • NSM — 20 Wierzbicka semantic primitives (FEEL, THINK, KNOW, WANT, SEE, DO, HAPPEN, SAY, GOOD, BAD, EXIST, SELF, OTHER, BECAUSE, IF, CAN, MAYBE, AFTER, BEFORE, NOT)
  • Qualia — 8 phenomenal dimensions (valence, arousal, intimacy, certainty, agency, emergence, continuity, abstraction)
  • Causality — past/present/future tense balance + temporality + agency
  • dominant_mode — emotional | cognitive | existential | emergent | relational | technical

Concept Deduplication

Stable IDs from sha256(name.lower() + "_" + type)[:16]. UPSERT MERGE preserves all fields; NARS revision merges epistemic state when the same concept appears in multiple documents.

SurrealDB Query Examples

-- All causal neighbours of a concept
SELECT ->relates[WHERE verb_category = 'Causal']->(concept)
FROM concept WHERE name = 'Permaculture';

-- 3-hop traversal
SELECT ->relates->(concept)->relates->(concept)->relates->concept
FROM concept WHERE name = 'Community Garden';

-- High-confidence structural relations
SELECT * FROM relates
WHERE verb_category = 'Structural' AND nars_confidence > 0.7;

-- Abstract concepts (R4+), ranked by confidence
SELECT name, type, rung, description, nars_confidence FROM concept
WHERE rung >= 4 ORDER BY nars_confidence DESC LIMIT 20;

-- Disputed claims: often false but confidently stated
SELECT * FROM concept
WHERE nars_frequency < 0.5 AND nars_confidence > 0.7;

-- Full document subgraph (concepts + provenance)
SELECT *, ->contains->(chunk)->mentions->(concept)
FROM document WHERE title = 'README.md';

-- Emotionally loaded chunks (high arousal + negative valence)
SELECT text, dominant_mode, qualia FROM chunk
WHERE qualia.arousal > 0.5 AND qualia.valence < 0.35;

Environment Variables

Variable Required Description
OPENROUTER_API_KEY Yes OpenRouter API key for LLM extraction and embeddings
SURREAL_PASS Yes SurrealDB root password
SURREALDB_URL No SurrealDB WebSocket URL (default: ws://127.0.0.1:8000)
EXTRACTION_MODEL No LLM model for extraction (default: anthropic/claude-sonnet-4)
EMBEDDING_MODEL No Embedding model (default: openai/text-embedding-3-small)

Docs

Design specifications and project documentation live in docs/:

  • docs/PROJECT-objectives.md — Overall project goals
  • docs/USECASES.md — 8 usage scenarios
  • docs/specs/ — Detailed evaluation specs for each component

Stack

  • Python — pipeline orchestration
  • Kreuzberg — document text extraction (75+ formats, OCR)
  • Claude Sonnet (via OpenRouter) — concept and relation extraction
  • SurrealDB — graph storage with native -> traversal
  • OpenAI text-embedding-3-small (via OpenRouter) — vector embeddings
  • 3d-force-graph — interactive 3D visualization

License

MIT

About

RegenTribes Knowledge Graph & AI Brain - Specifications, schemas, and evaluation frameworks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors