Knowa

The problem: LLM token costs compound fast

The naive approach to building AI-powered apps is to load your documents into the prompt and let the LLM figure it out. It works in demos. It breaks in production.

A 1,000-page knowledge base is roughly 2–4 million tokens. At current API pricing, sending that on every request costs dollars per query. At 10,000 queries a day that is tens of thousands of dollars a month — for context that is 95% irrelevant to the question being asked. As AI usage scales across teams and products, this becomes the dominant cost line.

Knowa's core job is to solve this. It indexes your documents once — as vector chunks, full-text pages, and a named-entity graph — then for each question extracts only the handful of chunks that are actually relevant, typically 1,000–3,000 tokens out of a large corpus. At scale that is a 90–99% reduction in context sent to the LLM, with no loss in answer quality. Every query response includes a measured token savings figure so you can track this in production. Savings are corpus-size dependent — see Understanding token savings for what to expect at different scales.

This matters now and will matter much more as AI usage goes from prototype to product to company-wide infrastructure. Precision retrieval is not a nice-to-have — it is the difference between an AI feature that scales and one that breaks your budget.

What Knowa does

Knowa is a hybrid retrieval library and knowledge base server. It ingests documents from local directories, Notion, Confluence, and any custom source you connect — storing them as vector chunks and full-text pages in PostgreSQL, and as a named-entity knowledge graph in PostgreSQL (default), Neo4j, or Kuzu. For each question it retrieves only the most relevant context across all three representations and returns it ready to inject into any LLM or AI pipeline you choose.

The knowledge graph is built during indexing and gives retrieval a structural dimension that pure vector search misses — connecting people, products, organisations, and concepts across your entire document corpus so entity-centric questions ("what teams work on X?", "which pages mention both Y and Z?") get precise, targeted answers.

How the graph is populated depends on how much coverage you need:

spaCy NER (default) — runs locally at indexing time with zero API cost. Recognises standard entity types (people, organisations, locations, dates, and more). You can swap the model to improve accuracy or target a specific domain: en_core_web_trf (transformer-based, better on ambiguous names), scispaCy models for scientific or biomedical text, or any spaCy-compatible model.
LLM entity enrichment (opt-in) — an additive second pass using any OpenAI-compatible model (gpt-4o-mini, Qwen3, Kimi2, and others). The LLM extracts domain-specific entities that a general NER model misses: product features, internal codenames, technical standards, abstract concepts — anything with meaning in your organisation's language. Runs concurrently across pages to keep indexing fast at scale, and costs roughly $0.30 per 1,000 pages with gpt-4o-mini.

Quickstart

1. Prerequisites

Python 3.11+
Docker (for PostgreSQL)
OpenAI API key

2. Start PostgreSQL

docker compose up -d

This starts a PostgreSQL 16 instance with pgvector pre-installed, exposed on localhost:5432. Data is persisted in a Docker volume across restarts.

3. Create a virtual environment

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate

OR - if you have miniconda installed

conda create -n knowa python=3.12
conda activate knowa

4. Install

pip install -r requirements.txt
python -m spacy download en_core_web_sm
pip install -e .                 # registers the `knowa` CLI command

5. Configure

cp .env.example .env

Edit .env and fill in the required values:

Variable	Required	Notes
`DATABASE_URL`	Yes	Pre-filled to match `docker compose` — no change needed
`OPENAI_API_KEY`	Yes	Your OpenAI API key
`OPENAI_MODEL`	Yes	e.g. `gpt-5.4`
`API_KEY`	Yes	Any random secret — used to protect the REST API
`NOTION_API_KEY`	No	Required only for Notion sources
`CONFLUENCE_*`	No	All four vars required for Confluence sources
`SPACY_MODEL`	No	NER model for graph extraction (default: `en_core_web_sm`)
`ENTITY_LLM_MODEL`	No	Enable LLM entity enrichment, e.g. `gpt-4o-mini` (see below)

Generate a strong API key:

python3 -c "import secrets; print(secrets.token_urlsafe(32))"

6. Index your documents

knowa index /path/to/docs --name "My Docs"

Migrations run automatically on the first command. Supported file types: .md, .txt, .pdf, .docx

7. Chat with the index

knowa chat "What would you like to know?"

8. Run the admin UI

uvicorn knowa.api.main:app --reload --port 8000

Open http://localhost:8000/admin/ui — search, browse sources, and trigger rebuilds from the browser.

Three ways to use Knowa

As a Python library — embed Knowa directly in your app. You own the LLM call; Knowa handles indexing, retrieval, and context formatting. Works with Anthropic, OpenAI, Gemini, local models, LangChain, LlamaIndex, or any other framework.

As a REST API — run the FastAPI server and hit /query to get complete answers with citations from any language or tool, no Python required.

As a CLI tool — use the knowa command to index directories and chat with your knowledge base directly from the terminal, no server or code required.

Features

Up to 90–99% token reduction at scale — three-path hybrid retrieval (vector search, full-text search, property graph) surgically extracts only what is relevant; every query reports measured token savings so you can track efficiency in production. See Understanding token savings for realistic expectations at different corpus sizes.
Hierarchical chunking — documents are split into small child chunks for precise vector search, then expanded to larger parent chunks for full context — maximising relevance without losing surrounding information
Pluggable embedders — OpenAIEmbedder (default) or SentenceTransformerEmbedder (fully local, no API key needed at query time); implement the Embedder protocol to add your own
Multiple sources — Notion, Confluence Cloud, and local directories (.md, .txt, .pdf, .docx)
Incremental sync — only re-processes files changed since the last run
Zero LLM calls at index time by default — spaCy handles entity extraction; no OpenAI spend during indexing regardless of knowledge base size. For richer entity coverage, an optional LLM enrichment pass can be enabled (see Graph entity extraction)
Admin UI — per-source sync/rebuild controls, query interface, token savings tracking, and interactive entity graph visualization
CLI — full index and chat management without running the server

Understanding token savings

The token savings figure shown after each query measures how much of your indexed corpus Knowa avoided sending to the LLM:

savings % = 1 − (tokens retrieved by this query / all parent-chunk tokens in the DB)

This is a corpus-size-relative metric. With a small corpus, retrieval may cover most of it on every query — and savings will be near 0%. That is expected and correct; it is not a bug or a misconfiguration.

How retrieval works (and why size matters)

By default Knowa retrieves the top 5 child chunks by vector similarity (TOP_K_CHUNKS=5), then expands each to its parent chunk (~2,048 tokens). That means roughly 10,000 tokens of context per query, regardless of corpus size.

Savings only accumulate once your total indexed content significantly exceeds that window.

Assumptions

The estimates below assume:

Default settings — TOP_K_CHUNKS=5, parent chunk size ~2,048 tokens
Typical document sizes — wiki pages, Notion pages, Confluence pages: 500–3,000 words → 1–4 parent chunks each. Short README-style files (under 500 words) may produce a single chunk; long PDFs or DOCX files (5,000+ words) may produce 5–15 chunks each.
Mixed file types shift the breakpoints: a corpus of 10 long PDFs can behave like 50+ short markdown files in terms of total chunk count.

Expected savings by corpus size

Corpus	Approx. parent chunks	Expected savings
5 short docs	5–10	0–10% — retrieval covers most of the corpus
20–30 substantial pages	30–50	40–60%
100+ pages	100–200	70–85%
500+ pages	500+	85–95%
Large wiki (1,000+ pages)	1,000+	90–99%

What 0% savings actually means

A 0% savings reading does not mean retrieval is broken or that Knowa is not helping. It means your corpus is small enough that the retrieval window covers essentially all of it — which is the correct behaviour. The answer quality benefit (finding the right pages and synthesising a coherent answer) is present at any corpus size.

The savings badge becomes a useful production monitoring signal once your knowledge base grows large enough that retrieval is genuinely selective — typically 20–30 substantial documents or more.

Tuning

If you have a large corpus but still see low savings, consider reducing TOP_K_CHUNKS in your .env. The default of 5 is conservative; dropping to 3 roughly halves retrieved context and increases savings, at the cost of slightly lower recall on broad questions.

Usage

CLI

The CLI connects directly to the database — no server needed.

# Index a directory — incremental by default, full rebuild on first run
knowa index /path/to/docs
knowa index /path/to/docs --name "Engineering Docs"   # attach a friendly label
knowa index /path/to/docs --full                       # force full rebuild
knowa index /path/to/docs --workers 4                  # parallel indexing (4 threads)

# See all indexed sources with page/chunk counts and labels
knowa list

# Chat with the index
knowa chat                                    # interactive REPL
knowa chat "What is our refund policy?"       # single-shot
knowa chat --source "Engineering Docs"        # scoped to one source
knowa chat --debug                            # show retrieval path before each answer
knowa chat --no-index /path/to/docs           # bypass index, read docs directly

# Inspect graph backend (entity/edge counts)
knowa debug

# Benchmark questions with and without index
knowa bench questions.json

# Clear index for a source
knowa clear /path/to/docs
knowa clear --source-id <notion-workspace-id>
knowa clear --source-id company.atlassian.net/ENG
knowa clear --all

Supported file types: .md, .txt, .pdf, .docx

Sample dataset

To quickly populate the knowledge base for testing the graph visualization:

python scripts/fetch_sample_docs.py --out /tmp/knowa_sample_docs
knowa index /tmp/knowa_sample_docs --name "Knowa Sample Docs"

This downloads ~20 Wikipedia articles (AI companies and researchers) and indexes them.

Library

from knowa import KnowledgeBase

# Create once at app startup — reuse across requests
kb = KnowledgeBase()
kb.index("/path/to/docs", label="Engineering Docs")

# Get formatted context ready to inject into any LLM
context = kb.get_context("What is our deployment process?")

Use with any LLM

import anthropic
from knowa import KnowledgeBase

kb = KnowledgeBase()
client = anthropic.Anthropic()

def answer(question: str) -> str:
    context = kb.get_context(question)
    message = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=f"Answer using only this context:\n\n{context}",
        messages=[{"role": "user", "content": question}],
    )
    return message.content[0].text

Inspect retrieved chunks

retrieve() gives you structured results before synthesis — filter, rerank, or log them:

from knowa import KnowledgeBase, RetrievedChunk

kb = KnowledgeBase()
chunks: list[RetrievedChunk] = kb.retrieve("What is our SLA?")

for c in chunks:
    print(f"  [{c.retrieval_type}] score={c.score:.3f} | {c.page_title}")

# Filter by confidence, then format
filtered = [c for c in chunks if c.score >= 0.3 or c.retrieval_type == "fts"]
context = kb.format_context(filtered)

Local embeddings (no OpenAI at query time)

pip install "knowa[st]"

from knowa import KnowledgeBase
from knowa.embedders.sentence_transformers import SentenceTransformerEmbedder

# Set EMBEDDING_DIMENSIONS=384 in .env before the first index run
embedder = SentenceTransformerEmbedder("all-MiniLM-L6-v2")
kb = KnowledgeBase(embedder=embedder)
kb.index("/path/to/docs", full=True)
context = kb.get_context("What is our refund policy?")  # no OpenAI calls

Multi-source / multi-tenant

kb.index("/docs/engineering", label="Engineering")
kb.index("/docs/legal", label="Legal")

eng_context = kb.get_context("deployment process", source_id="Engineering")
legal_context = kb.get_context("compliance requirements", source_id="Legal")

Chat with history

Manage history in your application and pass only the retrieved context to each turn:

from knowa import KnowledgeBase

kb = KnowledgeBase()
history: list[dict] = []

def chat(user_message: str) -> str:
    context = kb.get_context(user_message)
    messages = history[-10:] + [{"role": "user", "content": user_message}]
    response = your_llm(system=f"Answer using this context:\n\n{context}", messages=messages)
    history.append({"role": "user", "content": user_message})
    history.append({"role": "assistant", "content": response.text})
    return response.text

Async usage

Retrieval is synchronous. Wrap with asyncio.to_thread to avoid blocking the event loop:

import asyncio
from knowa import KnowledgeBase

kb = KnowledgeBase()

async def handle_request(question: str, source: str | None = None) -> str:
    context = await asyncio.to_thread(kb.get_context, question, source)
    async with your_async_llm_client() as client:
        return await client.complete(context, question)

FastAPI dependency injection

from functools import lru_cache
from fastapi import Depends, FastAPI
from knowa import KnowledgeBase

app = FastAPI()

@lru_cache(maxsize=1)
def get_kb() -> KnowledgeBase:
    return KnowledgeBase()

@app.post("/ask")
async def ask(question: str, source: str | None = None, kb: KnowledgeBase = Depends(get_kb)):
    import asyncio
    chunks = await asyncio.to_thread(kb.retrieve, question, source)
    context = kb.format_context(chunks)
    answer = await your_llm(context, question)
    return {"answer": answer, "citations": [{"title": c.page_title, "url": c.url} for c in chunks if c.page_title]}

KnowledgeBase API reference

KnowledgeBase(database_url=None, embedder=None) — database_url falls back to DATABASE_URL env var; embedder defaults to OpenAIEmbedder. The embedder must match what was used when the index was built.

Method	Returns	Description
`index(path, label=None, full=False)`	`dict`	Index a local directory. Returns `{"indexed": N, "deleted": N, "errors": N}`.
`retrieve(question, source_id=None)`	`list[RetrievedChunk]`	Hybrid retrieval — no LLM calls.
`format_context(chunks)`	`str`	Format chunks into `[Title]\n---` blocks for LLM injection.
`get_context(question, source_id=None)`	`str`	`retrieve()` + `format_context()` + graph relationships in one call.

RetrievedChunk fields: content, score, page_id, page_title, source_id, source_type, url, retrieval_type ("vector" or "fts").

Production checklist

Concern	Guidance
Embedder reuse	Create one `KnowledgeBase` per process and reuse it. Re-creating per request is wasteful.
Dimension consistency	Set `EMBEDDING_DIMENSIONS` before the first `index()` call. Changing the embedder later requires a full rebuild — mixing dimensions returns wrong results silently.
Thread safety	`KnowledgeBase` is safe to call from multiple threads — the `psycopg2` pool is thread-safe.
Async blocking	Use `asyncio.to_thread(kb.retrieve, ...)` in async frameworks.
Context size	`get_context()` returns all retrieved chunks untruncated. For small context windows, filter by `score` and pass `format_context(filtered)`.
Re-indexing	Call `kb.index(path)` on a schedule for directories that change. Each call runs incrementally.
Error handling	`index()` returns `{"errors": N}`. Check this value and alert if non-zero.

REST API

curl -X POST "http://localhost:8000/query" \
  -H "X-API-Key: <key>" \
  -H "Content-Type: application/json" \
  -d '{"question": "What is our refund policy?", "response_mode": "with_citations"}'

Response modes: answer_only · with_citations · full (includes raw retrieved chunks)

Graph entity extraction

Knowa builds a property graph from your documents by extracting named entities during indexing. Two mechanisms are available and can be used together.

spaCy NER (default)

Runs entirely locally, zero API cost, no latency added. Recognises standard entity types: people, organisations, locations, dates, and 14 others from the OntoNotes scheme.

SPACY_MODEL=en_core_web_sm    # default — fast, lightweight
SPACY_MODEL=en_core_web_trf   # transformer-based, better accuracy on ambiguous names
SPACY_MODEL=en_core_sci_lg    # scientific text (pip install scispacy + model)
SPACY_MODEL=en_ner_bc5cdr_md  # biomedical (pip install scispacy + model)

After changing the model, install it:

python -m spacy download en_core_web_trf
# or for scispaCy models follow https://allenai.github.io/scispacy/

LLM entity enrichment (opt-in)

An optional second pass that runs after spaCy and adds domain-specific entities spaCy misses — products, technologies, frameworks, abstract concepts, and any entity type relevant to your domain. Runs concurrently across pages (configurable parallelism) so it does not serialize indexing.

Supports any OpenAI-compatible provider. Examples:

# gpt-4o-mini (~$0.30 per 1,000 pages)
ENTITY_LLM_MODEL=gpt-4o-mini
ENTITY_LLM_API_KEY=sk-...           # optional — falls back to OPENAI_API_KEY

# Qwen3 via Dashscope
ENTITY_LLM_MODEL=qwen3-7b
ENTITY_LLM_API_KEY=sk-...
ENTITY_LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1

# Kimi2 (Moonshot)
ENTITY_LLM_MODEL=moonshot-v1-8k
ENTITY_LLM_API_KEY=sk-...
ENTITY_LLM_BASE_URL=https://api.moonshot.cn/v1

Concurrency is controlled by ENTITY_LLM_CONCURRENCY (default 5). Leave ENTITY_LLM_MODEL blank to keep the default zero-LLM-at-index-time behaviour.

Two-phase indexing: when you combine --workers with ENTITY_LLM_MODEL, the two phases run sequentially — Phase 1 (OCR, chunking, embedding, spaCy) completes across all pages first, then Phase 2 (LLM enrichment) runs concurrently at ENTITY_LLM_CONCURRENCY parallelism. The two concurrency settings are independent: --workers controls how many files are read and embedded at once; ENTITY_LLM_CONCURRENCY controls how many LLM API calls fire in the enrichment pass.

Choosing the right extraction method

spaCy model by document domain

Domain	Recommended model	Install	What it adds over `en_core_web_sm`
General wikis, internal docs, HR, support	`en_core_web_sm`	built-in	— baseline
Same domains, higher accuracy	`en_core_web_trf`	`python -m spacy download en_core_web_trf`	Transformer-based; better on ambiguous names and abbreviations
Scientific papers, research reports	`en_core_sci_lg`	`pip install scispacy` + model	Chemicals, genes, proteins, species, experimental methods
Biomedical / clinical	`en_ner_bc5cdr_md`	`pip install scispacy` + model	Diseases, drugs, adverse drug reactions (BC5CDR corpus)
Legal (case law, UK)	`en_blackstone_proto`	`pip install blackstone`	Legal concepts, case references, court names — UK-focused, not actively maintained
Finance, engineering, security	any of the above	—	General models catch orgs, people, dates well; domain-specific terms need LLM enrichment (see below)

Note: scispaCy models must be downloaded separately after pip install scispacy. See allenai.github.io/scispacy for the full model list and download URLs.

When to add LLM entity enrichment

spaCy extracts what it was trained on. LLM enrichment fills the gaps — domain-specific terms, internal jargon, and relationship types that no pre-trained NER model knows about.

Domain	spaCy alone	Add LLM enrichment when…
General wikis / internal docs	Good — people, orgs, locations covered	You need product names, team names, internal codenames, or project-specific concepts indexed as graph nodes
Legal	Partial — parties, orgs, dates caught; clause types and regulatory citations missed	Almost always — contract clause types (Indemnification, Force Majeure), party roles (Licensor, Licensee), regulation references (GDPR Art. 17, SOX §404) require LLM extraction
Finance	Partial — companies, dates, monetary values caught; financial instruments and risk terms missed	When indexing analyst reports, earnings calls, or contracts — extracts instruments (CDO, SPAC), risk categories, regulatory filings (10-K, 8-K)
Biomedical / clinical	Good with scispaCy — diseases, drugs, genes covered	When you need treatment protocols, trial phases, or mechanism-of-action relationships beyond standard NER types
Engineering / code docs	Weak — standard NER misses most technical entities	Almost always — APIs, services, error codes, config flags, version numbers, and dependency names are invisible to general NER models
Security / infosec	Weak — CVE IDs and threat actors are not standard NER types	Almost always — CVEs, attack techniques (MITRE ATT&CK), threat actors, vulnerability classes, affected products
HR / people ops	Good — people and org names covered	When role titles, skill taxonomies, or org-structure concepts matter for graph traversal
Customer support	Partial — product names and people caught; issue categories and feature areas missed	When support tickets reference internal product areas, error messages, or workflow steps that aren't in any NER vocabulary

Rule of thumb: if the entities that matter most in your domain are not people, organisations, locations, or dates — add LLM enrichment. The cost is low (gpt-4o-mini runs at roughly $0.30 per 1,000 pages) and the graph coverage improvement is substantial for technical and domain-specific corpora.

Adding sources

Notion

Go to notion.so/my-integrations → New integration → give it a name → copy the Internal Integration Token (starts with secret_)
Set NOTION_API_KEY=secret_... in your .env
In Notion, open each root page you want indexed → ⋯ → Connections → Add connection → select your integration. Sub-pages inherit the connection automatically.
Trigger an initial full index:

curl -X POST "http://localhost:8000/admin/rebuild?full=true" -H "X-API-Key: <key>"

Sub-pages are indexed recursively. Container pages (no body content, only child pages) produce 0 chunks — their children are indexed normally. For a 5,000-page workspace the initial index takes 30–90 minutes depending on Notion API rate limits.

Confluence Cloud

Create an Atlassian API token at id.atlassian.com/manage-profile/security/api-tokens
Find your space key in the Confluence URL: https://yourcompany.atlassian.net/wiki/spaces/**ENG**/pages/...
Add to .env:

CONFLUENCE_BASE_URL=https://yourcompany.atlassian.net
CONFLUENCE_USERNAME=you@yourcompany.com
CONFLUENCE_API_TOKEN=<token from step 1>
CONFLUENCE_SPACE_KEY=ENG

Trigger an initial full index:

curl -X POST "http://localhost:8000/admin/rebuild?full=true" -H "X-API-Key: <key>"

All four vars must be set to enable the connector — leaving any blank disables it without error. Large spaces (1,000+ pages) may take 10–30 minutes.

Deployment

Knowa is a single Docker container + PostgreSQL.

Docker

docker build -t knowa .
docker run -d --env-file .env -p 8000:8000 knowa
docker logs -f knowa

Cloud deployment

Managed PostgreSQL

pgvector must be available. All major providers support it:

Provider	Notes
AWS RDS / Aurora	Enable `pgvector` in Parameter Groups, then `CREATE EXTENSION IF NOT EXISTS vector;`
Google Cloud SQL	Enable the `pgvector` database flag, then `CREATE EXTENSION IF NOT EXISTS vector;`
Azure Database for PostgreSQL	Built-in extension — run `CREATE EXTENSION IF NOT EXISTS vector;`
Supabase	pgvector pre-installed; run `CREATE EXTENSION IF NOT EXISTS vector;` from the SQL editor
Neon	pgvector pre-installed; run `CREATE EXTENSION IF NOT EXISTS vector;` from the console
Render / Railway	Create a PostgreSQL service; connect via psql and run the extension command

After provisioning, set DATABASE_URL in your environment.

Container hosting

The app exposes port 8000 and reads all config from environment variables:

Provider	Approach
AWS ECS / Fargate	Push to ECR, create a Fargate task, inject env vars via Secrets Manager
Google Cloud Run	Push to Artifact Registry, deploy as a Cloud Run service (`--port 8000`)
Azure Container Apps	Push to ACR, deploy as a Container App, configure env vars in the portal
Render	Connect GitHub repo, set Runtime to Docker, set Port to 8000
Railway	Connect GitHub repo or push a Docker image, add env vars in the dashboard
Fly.io	`fly launch` auto-detects the Dockerfile; set secrets with `fly secrets set`

Generic VPS

sudo apt update && sudo apt install -y docker.io
git clone <your-repo> && cd knowa
cp .env.example .env && nano .env
docker build -t knowa .
docker run -d --env-file .env -p 8000:8000 --restart unless-stopped knowa

Running tests

Three layers — run in order.

Layer 1 — Unit tests (no external dependencies)

pip install pytest
pytest tests/unit/ -v
# Expected: 99 passed, 21 skipped

# Unlock full coverage:
python -m spacy download en_core_web_sm   # +8 spaCy graph extractor tests
pip install markdownify requests          # +13 Confluence connector tests
# Full install: 120 passed, 0 skipped

Layer 2 — Integration tests (PostgreSQL required)

Tests the full ingestion pipeline with fixture markdown files. No Notion API or OpenAI calls — embeddings use deterministic fake vectors.

createdb knowa_test
psql knowa_test -c "CREATE EXTENSION IF NOT EXISTS vector;"

TEST_DATABASE_URL=postgresql://postgres:postgres@localhost:5432/knowa_test \
OPENAI_API_KEY=sk-anything API_KEY=test \
pytest tests/integration/ -v

Layer 3 — E2E tests (Notion + OpenAI + PostgreSQL)

Tests the full system end-to-end against real Notion pages with deterministic content.

One-time setup:

Create three pages in your Notion workspace with exact content from tests/e2e/notion_test_pages.md:
- Knowa Test — FAQ
- Knowa Test — Pricing
- Knowa Test — Handbook
Connect your integration to each page (⋯ → Connections)
Copy the 32-char hex ID from each page's URL (strip dashes and page title)
Create .env.test:

TEST_DATABASE_URL=postgresql://postgres:postgres@localhost:5432/knowa_test
TEST_NOTION_FAQ_PAGE_ID=<32-char id>
TEST_NOTION_PRICING_PAGE_ID=<32-char id>
TEST_NOTION_HANDBOOK_PAGE_ID=<32-char id>

Run:

export $(cat .env.test | xargs)
pytest tests/e2e/ -v -s

E2E tests auto-skip if any required env var is missing — they will not break CI runs without Notion credentials. Each full run costs ~$0.05 in OpenAI API calls.

Operations

Force a full re-index

Use after schema changes, embedding model updates, or suspected index corruption:

curl -X POST "http://localhost:8000/admin/rebuild?full=true" -H "X-API-Key: <key>"
# or via CLI (no server needed):
knowa index /path/to/docs --full

Check sync state

SELECT source_id, source_type, label, last_synced_at FROM sync_state;

Check index sizes

SELECT
  relname AS table,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

Check index freshness

SELECT source_id, COALESCE(label, source_id) AS name,
       last_synced_at, NOW() - last_synced_at AS age
FROM sync_state
ORDER BY last_synced_at ASC;

Troubleshooting

"No wiki sources configured" (503 on /admin/rebuild)

At least one cloud connector must be fully configured: either NOTION_API_KEY or all four CONFLUENCE_* vars. This error does not apply to local directory sources — use knowa index <dir> for those. Verify env vars are loaded:

docker exec knowa env | grep -E 'NOTION|CONFLUENCE'

Graph tab shows no entities

Run knowa debug to check raw node/edge counts. If nodes = 0:

spaCy model not installed — run python -m spacy download en_core_web_sm then trigger a full rebuild
Purely technical content — code files, configs, and API references have few named entities. Graph works best with prose containing people, organisations, and places. Consider adding LLM entity enrichment for technical domains.
Indexed before graph extraction was set up — trigger a full rebuild from the Admin tab to re-run extraction on all pages.

Vector search returns no results

SELECT COUNT(*) FROM chunks WHERE chunk_type = 'child';

If 0, run a full rebuild. If non-zero, check for OpenAI API errors in the server logs — the query embedding may be failing.

Embeddings are slow or rate-limited

Embedding requests are batched at 100 texts per call (knowa/indexing/embedder.py). If hitting rate limits, reduce the batch size to 50 or request a rate limit increase from OpenAI.

Notion API 403 / pages missing from index

The Notion integration must be explicitly connected to each page: ⋯ → Connections → Add connection. Sub-pages inherit the connection automatically, but top-level pages do not.

"extension vector does not exist"

CREATE EXTENSION IF NOT EXISTS vector;

On AWS RDS, ensure pgvector is in your Parameter Group's shared_preload_libraries before creating the extension.

IVFFlat index warning: "index requires more data"

The IVFFlat index needs at least lists × 3 rows to be useful. With lists=100 that is ~300 child chunks. This warning appears on small initial datasets and resolves automatically as the index grows.

Monitoring

No built-in metrics endpoint yet. Recommended additions for production:

Structured logging — replace logging.basicConfig with structlog for JSON logs
Rebuild alerts — if stats["errors"] > 0 after a rebuild, send a Slack/email notification
Query latency — add a FastAPI middleware timer and log P95 latency
Index freshness — alert if last_synced_at is older than your sync interval (the Operations query above makes a good health check)

Architecture

For full architecture documentation, schema, API spec, cost model, and extension points see DESIGN.md.

Tech stack

PostgreSQL + pgvector · FastAPI · OpenAI (embeddings + completions) · spaCy (configurable NER model) · tiktoken · Neo4j / Kuzu (optional graph backends) · any OpenAI-compatible LLM for entity enrichment (gpt-4o-mini, Qwen3, Kimi2, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
knowa		knowa
migrations		migrations
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
cli.py		cli.py
config.py		config.py
docker-compose.yml		docker-compose.yml
knowa.gif		knowa.gif
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
questions.sample.json		questions.sample.json
requirements.txt		requirements.txt
sample_pdfs		sample_pdfs

Folders and files

Latest commit

History

Repository files navigation