Personal Knowledge Operating System (PWBS)

A cognitive infrastructure that continuously ingests data from heterogeneous personal sources, builds a semantic knowledge model, and delivers context-aware briefings at the right moment — so knowledge workers spend less time searching and more time deciding.

Status: Phase 2 (MVP) — Closed Beta. Core pipeline (4 connectors → chunking → hybrid search → briefing generation) is functional end-to-end.

What I Built and Why It's Interesting

This is a solo-built RAG system with four non-trivial engineering decisions that go beyond standard library glue:

1. Semantic Coherence Chunking (ADR-021)

Standard chunkers split at punctuation or fixed token counts. My SemanticCoherenceChunker places chunk boundaries at actual topic shifts by computing a cosine-similarity curve over consecutive sentence embeddings and detecting local minima below an adaptive threshold (μ − σ·stddev). This makes every chunk cover one coherent topic, which measurably improves retrieval precision.

→ backend/pwbs/processing/semantic_chunker.py — 580 LOC, no NumPy dependency, provider-agnostic embedding callback

2. Hybrid Search with RRF + Sigmoid Reranker (ADR-010)

Vector search fails on exact proper names; keyword search can't handle synonyms. The system combines both via Reciprocal Rank Fusion (Cormack et al., SIGIR 2009), then reranks with a composite score: cosine similarity (0.6) + normalized RRF (0.3) + sigmoid-based recency decay (0.1). The recency sigmoid ensures recent documents get a boost that decays smoothly rather than cliff-dropping.

→ backend/pwbs/search/hybrid.py + reranker.py — includes an IR evaluation framework (nDCG@k, MRR, MAP, Precision/Recall@k)

3. LLM Gateway with Fallback Cascade and Circuit Breaker

Three LLM providers (Claude → GPT-4 → Ollama) with automatic failover. Each provider call wraps retry with exponential backoff (factor 5), transient error classification, and per-request cost tracking. External API calls are protected by a three-state circuit breaker (CLOSED → OPEN → HALF_OPEN) to prevent cascade failures.

→ backend/pwbs/core/llm_gateway.py (520 LOC) + backend/pwbs/connectors/resilience.py

4. Envelope Encryption with Per-User Key Derivation (ADR-009)

GDPR requires that deleting a user's account makes their data cryptographically unreadable — without re-encrypting everything. I implemented HKDF-based per-user key derivation (owner_id as salt), Fernet DEK/KEK envelope encryption, and Argon2id password hashing (64 MB memory cost). Deleting the wrapped DEK = effective data erasure.

→ backend/pwbs/connectors/oauth.py + backend/pwbs/services/user.py

How It Was Built

This project was built solo over several months with the help of AI coding assistants (GitHub Copilot / Claude) for boilerplate generation and test scaffolding. All architectural decisions, algorithm designs, and trade-off evaluations are my own. The ADRs in docs/adr/ document my reasoning for each significant technical choice.

Features

Data ingestion — 4 connectors (Google Calendar, Notion, Obsidian, Zoom) via OAuth2 or local file watchers; cursor-based incremental sync
Processing pipeline — semantic chunking → batch embedding → two-stage NER (rule-based + LLM) → idempotent graph writes
Hybrid search — Weaviate vector similarity + PostgreSQL BM25, fused via RRF, reranked with cosine + recency
Context briefings — LLM-generated briefings (morning, meeting-prep, project, weekly) backed exclusively by retrieved context with full source attribution
Knowledge graph — Neo4j with weighted, time-decaying edges and pattern recognition (recurring themes, unresolved questions)
Graceful degradation — Neo4j optional (3 NullService implementations), Weaviate optional, LLM cascade with cache fallback
GDPR by design — per-user encryption keys, expires_at on every document, DELETE CASCADE, no LLM training on user data
Idempotent pipeline — every step is safe to re-run; cursor watermarks persisted after each batch

Tech Stack

Layer	Technology
Backend	Python 3.12+, FastAPI, Pydantic v2
Relational DB	PostgreSQL (users, connectors, documents, audit log)
Vector DB	Weaviate (chunk embeddings, hybrid search)
Graph DB	Neo4j (knowledge graph, entity relationships)
LLM	Anthropic Claude (primary), OpenAI GPT-4 (fallback), Ollama (local/offline)
Embeddings	OpenAI `text-embedding-3-small` (cloud), `all-MiniLM-L6-v2` via Sentence Transformers (local)
Frontend	Next.js (App Router), React, TypeScript, Tailwind CSS
Infrastructure	Docker Compose (local), Vercel (frontend), AWS ECS Fargate + RDS + ElastiCache (backend)
Task queue	Celery + Redis (active in MVP: ingestion, processing, briefing queues)
Migrations	Alembic
Testing	pytest, pytest-asyncio

Prerequisites

Docker and Docker Compose
Python 3.12+
Node.js 20+
API keys for at least one LLM provider (Anthropic or OpenAI) and one embedding provider
OAuth2 application credentials for any connectors you want to enable (Google, Notion, Zoom)

Installation

1. Clone the repository

git clone https://github.com/sauremilk/PWBS.git
cd PWBS

2. Configure environment variables

cp .env.example .env
# Edit .env — see Configuration section for required variables

3. Start backing services

docker compose up -d
# Starts PostgreSQL, Weaviate, and Redis
# Neo4j (Knowledge Graph) is optional — activate with:
#   docker compose --profile graph up -d

4. Set up the backend

cd backend
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"

# Run database migrations
alembic upgrade head

# Start the API server (hot-reload)
uvicorn pwbs.api.main:app --reload

5. Set up the frontend

cd frontend
npm install
npm run dev

The API is available at http://localhost:8000 and the web app at http://localhost:3000.

Quickstart

Connect a data source

# Initiate OAuth2 flow for Google Calendar
curl http://localhost:8000/api/v1/connectors/google_calendar/auth-url \
  -H "Authorization: Bearer <access_token>"
# Returns a redirect URL — open in browser to complete OAuth consent

# Trigger a manual sync after connecting
curl -X POST http://localhost:8000/api/v1/connectors/<connection_id>/sync \
  -H "Authorization: Bearer <access_token>"

Semantic search

curl "http://localhost:8000/api/v1/search?q=product+roadmap+Q2&mode=hybrid&top_k=10" \
  -H "Authorization: Bearer <access_token>"

Generate a morning briefing

curl -X POST http://localhost:8000/api/v1/briefings/generate \
  -H "Authorization: Bearer <access_token>" \
  -H "Content-Type: application/json" \
  -d '{"type": "morning"}'

Implementing a new connector

Every connector extends BaseConnector and must be idempotent:

from pwbs.connectors.base import BaseConnector, ConnectorConfig, SyncResult
from pwbs.ingestion.models import UnifiedDocument

class MyConnector(BaseConnector):
    async def fetch_since(self, cursor: str | None) -> SyncResult:
        """Cursor-based fetch. Returns the new cursor for the next run."""
        ...

    async def normalize(self, raw: dict) -> UnifiedDocument:
        """Transform raw API response into the Unified Document Format."""
        ...

Project Structure

backend/pwbs/           # Python 3.12 package
  api/                  # FastAPI app, v1 routes, middleware (auth, rate limit, audit)
  connectors/           # BaseConnector + 4 source implementations
  processing/           # Semantic chunking, embedding, NER, entity dedup
  search/               # Hybrid search (vector + keyword + RRF), reranker, evaluation
  briefing/             # Briefing generation with context modules
  graph/                # Neo4j knowledge graph (optional)
  core/                 # Config, exceptions, LLM gateway, encryption
  models/               # SQLAlchemy ORM (33 models)
  queue/tasks/          # Celery workers (ingestion, processing, briefing)
frontend/src/           # Next.js 15 App Router, TypeScript strict mode
docs/adr/               # Architecture Decision Records

Configuration

All secrets and settings are loaded from environment variables. Copy .env.example to .env and fill in values — never commit .env.

Required: DATABASE_URL, WEAVIATE_URL, REDIS_URL, JWT_PRIVATE_KEY, JWT_PUBLIC_KEY, ENCRYPTION_KEK, and at least one LLM API key (ANTHROPIC_API_KEY or OPENAI_API_KEY). Neo4j is optional (activate with --profile graph). See .env.example for the full variable list.

API Documentation

Interactive API docs (Swagger UI) are available at http://localhost:8000/docs in development mode. All endpoints require a JWT Bearer token; user identity is always extracted from the token, never from request bodies. Every database query filters by owner_id.

Architecture Overview

PWBS follows a modular monolith pattern in the current MVP (Phase 2). All backend logic runs in a single FastAPI process; modules communicate via typed Python interfaces, not HTTP. This enables rapid iteration while keeping the service boundary clear enough for a future service split in Phase 3.

flowchart TD
    subgraph Sources["Data Sources"]
        GCal[Google Calendar]
        Notion[Notion]
        Zoom[Zoom Transcripts]
        Obs[Obsidian Vault]
    end

    subgraph Ingestion["Ingestion Layer (IngestionAgent)"]
        Sync[Cursor-based Sync] --> UDF[Unified Document Format]
    end

    subgraph Processing["Processing Pipeline (ProcessingAgent)"]
        Chunk[Semantic Chunking] --> Embed[Embedding Generation]
        Embed --> NER[Named Entity Recognition]
    end

    subgraph Store["Knowledge Store"]
        PG[(PostgreSQL)]
        WV[(Weaviate)]
        N4J[(Neo4j Graph)]
    end

    subgraph Retrieval["Retrieval (SearchAgent + GraphAgent)"]
        Hybrid[Hybrid Search RRF]
        Graph[Graph Traversal]
    end

    subgraph Output["Output (BriefingAgent)"]
        LLM[LLM Gateway\nClaude / GPT-4 / Ollama]
        Brief[Briefing + SourceRefs]
    end

    Sources --> Sync
    UDF --> Processing
    NER --> PG
    NER --> N4J
    Embed --> WV
    Store --> Retrieval
    Retrieval --> LLM
    LLM --> Brief
    Brief --> API[FastAPI]
    API --> FE[Next.js Frontend]

The processing pipeline runs in three stages:

Chunking — two strategies: regex-based semantic splitting (fast, default) and embedding-based coherence chunking that detects topic shifts via cosine similarity curves (ADR-021)
Embedding — batch embedding generation (OpenAI or local Sentence Transformers)
NER + Graph — two-stage entity extraction (rule-based → LLM-based) followed by idempotent MERGE writes to Neo4j

See ARCHITECTURE.md for the full system design, database schemas, Weaviate collection configuration, and Neo4j graph model.

Development

Running tests

cd backend

# Unit tests (no network, no running databases required)
pytest tests/unit/ -v

# Integration tests (requires running Docker services)
pytest tests/integration/ -v --docker

Creating a database migration

cd backend
alembic revision --autogenerate -m "describe your change"
alembic upgrade head

Adding a new briefing type

Derive from BriefingTemplate, place the LLM prompt in pwbs/prompts/, and register the new type in the scheduler if it should run on a schedule. Every briefing must return sources: list[SourceRef] — the system does not deliver briefings without source attribution.

Architecture decisions

Significant architectural decisions are documented as Architecture Decision Records in docs/adr/. Use docs/adr/000-template.md as the starting point for new ADRs.

Contributing

Fork the repository and create a feature branch.
Follow the coding conventions in .github/instructions/ (applied automatically by GitHub Copilot).
Ensure all unit tests pass before opening a pull request.
For significant changes, create an ADR in docs/adr/ before writing code.
Do not commit .env or any file containing secrets.

Security issues should be reported privately rather than via public issues.

License

License terms have not yet been specified for this project.

Name		Name	Last commit message	Last commit date
Latest commit History 481 Commits
.github		.github
.vscode		.vscode
backend		backend
contract-graph @ b5e8753		contract-graph @ b5e8753
deploy		deploy
docs		docs
drift @ 2cd0b96		drift @ 2cd0b96
frontend		frontend
infra		infra
legal		legal
obsidian-plugin		obsidian-plugin
tests/load		tests/load
tools/orchestration		tools/orchestration
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
APPROACH.md		APPROACH.md
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
GOVERNANCE.md		GOVERNANCE.md
Makefile		Makefile
ORCHESTRATION.md		ORCHESTRATION.md
PRD-SPEC.md		PRD-SPEC.md
README.md		README.md
ROADMAP.md		ROADMAP.md
_fix_b904.py		_fix_b904.py
docker-compose.yml		docker-compose.yml
vision-wissens-os.md		vision-wissens-os.md

Folders and files

Latest commit

History

Repository files navigation

Personal Knowledge Operating System (PWBS)

What I Built and Why It's Interesting

1. Semantic Coherence Chunking (ADR-021)

2. Hybrid Search with RRF + Sigmoid Reranker (ADR-010)

3. LLM Gateway with Fallback Cascade and Circuit Breaker

4. Envelope Encryption with Per-User Key Derivation (ADR-009)

How It Was Built

Features

Tech Stack

Prerequisites

Installation

1. Clone the repository

2. Configure environment variables

3. Start backing services

4. Set up the backend

5. Set up the frontend

Quickstart

Connect a data source

Semantic search

Generate a morning briefing

Implementing a new connector

Project Structure

Configuration

API Documentation

Architecture Overview

Development

Running tests

Creating a database migration

Adding a new briefing type

Architecture decisions

Contributing

License

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages