A cognitive infrastructure that continuously ingests data from heterogeneous personal sources, builds a semantic knowledge model, and delivers context-aware briefings at the right moment — so knowledge workers spend less time searching and more time deciding.
Status: Phase 2 (MVP) — Closed Beta. Core pipeline (4 connectors → chunking → hybrid search → briefing generation) is functional end-to-end.
This is a solo-built RAG system with four non-trivial engineering decisions that go beyond standard library glue:
1. Semantic Coherence Chunking (ADR-021)
Standard chunkers split at punctuation or fixed token counts. My SemanticCoherenceChunker places chunk boundaries at actual topic shifts by computing a cosine-similarity curve over consecutive sentence embeddings and detecting local minima below an adaptive threshold (μ − σ·stddev). This makes every chunk cover one coherent topic, which measurably improves retrieval precision.
→ backend/pwbs/processing/semantic_chunker.py — 580 LOC, no NumPy dependency, provider-agnostic embedding callback
2. Hybrid Search with RRF + Sigmoid Reranker (ADR-010)
Vector search fails on exact proper names; keyword search can't handle synonyms. The system combines both via Reciprocal Rank Fusion (Cormack et al., SIGIR 2009), then reranks with a composite score: cosine similarity (0.6) + normalized RRF (0.3) + sigmoid-based recency decay (0.1). The recency sigmoid ensures recent documents get a boost that decays smoothly rather than cliff-dropping.
→ backend/pwbs/search/hybrid.py + reranker.py — includes an IR evaluation framework (nDCG@k, MRR, MAP, Precision/Recall@k)
Three LLM providers (Claude → GPT-4 → Ollama) with automatic failover. Each provider call wraps retry with exponential backoff (factor 5), transient error classification, and per-request cost tracking. External API calls are protected by a three-state circuit breaker (CLOSED → OPEN → HALF_OPEN) to prevent cascade failures.
→ backend/pwbs/core/llm_gateway.py (520 LOC) + backend/pwbs/connectors/resilience.py
4. Envelope Encryption with Per-User Key Derivation (ADR-009)
GDPR requires that deleting a user's account makes their data cryptographically unreadable — without re-encrypting everything. I implemented HKDF-based per-user key derivation (owner_id as salt), Fernet DEK/KEK envelope encryption, and Argon2id password hashing (64 MB memory cost). Deleting the wrapped DEK = effective data erasure.
→ backend/pwbs/connectors/oauth.py + backend/pwbs/services/user.py
This project was built solo over several months with the help of AI coding assistants (GitHub Copilot / Claude) for boilerplate generation and test scaffolding. All architectural decisions, algorithm designs, and trade-off evaluations are my own. The ADRs in docs/adr/ document my reasoning for each significant technical choice.
- Data ingestion — 4 connectors (Google Calendar, Notion, Obsidian, Zoom) via OAuth2 or local file watchers; cursor-based incremental sync
- Processing pipeline — semantic chunking → batch embedding → two-stage NER (rule-based + LLM) → idempotent graph writes
- Hybrid search — Weaviate vector similarity + PostgreSQL BM25, fused via RRF, reranked with cosine + recency
- Context briefings — LLM-generated briefings (morning, meeting-prep, project, weekly) backed exclusively by retrieved context with full source attribution
- Knowledge graph — Neo4j with weighted, time-decaying edges and pattern recognition (recurring themes, unresolved questions)
- Graceful degradation — Neo4j optional (3 NullService implementations), Weaviate optional, LLM cascade with cache fallback
- GDPR by design — per-user encryption keys,
expires_aton every document,DELETE CASCADE, no LLM training on user data - Idempotent pipeline — every step is safe to re-run; cursor watermarks persisted after each batch
| Layer | Technology |
|---|---|
| Backend | Python 3.12+, FastAPI, Pydantic v2 |
| Relational DB | PostgreSQL (users, connectors, documents, audit log) |
| Vector DB | Weaviate (chunk embeddings, hybrid search) |
| Graph DB | Neo4j (knowledge graph, entity relationships) |
| LLM | Anthropic Claude (primary), OpenAI GPT-4 (fallback), Ollama (local/offline) |
| Embeddings | OpenAI text-embedding-3-small (cloud), all-MiniLM-L6-v2 via Sentence Transformers (local) |
| Frontend | Next.js (App Router), React, TypeScript, Tailwind CSS |
| Infrastructure | Docker Compose (local), Vercel (frontend), AWS ECS Fargate + RDS + ElastiCache (backend) |
| Task queue | Celery + Redis (active in MVP: ingestion, processing, briefing queues) |
| Migrations | Alembic |
| Testing | pytest, pytest-asyncio |
- Docker and Docker Compose
- Python 3.12+
- Node.js 20+
- API keys for at least one LLM provider (Anthropic or OpenAI) and one embedding provider
- OAuth2 application credentials for any connectors you want to enable (Google, Notion, Zoom)
git clone https://github.com/sauremilk/PWBS.git
cd PWBScp .env.example .env
# Edit .env — see Configuration section for required variablesdocker compose up -d
# Starts PostgreSQL, Weaviate, and Redis
# Neo4j (Knowledge Graph) is optional — activate with:
# docker compose --profile graph up -dcd backend
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
# Run database migrations
alembic upgrade head
# Start the API server (hot-reload)
uvicorn pwbs.api.main:app --reloadcd frontend
npm install
npm run devThe API is available at http://localhost:8000 and the web app at http://localhost:3000.
# Initiate OAuth2 flow for Google Calendar
curl http://localhost:8000/api/v1/connectors/google_calendar/auth-url \
-H "Authorization: Bearer <access_token>"
# Returns a redirect URL — open in browser to complete OAuth consent
# Trigger a manual sync after connecting
curl -X POST http://localhost:8000/api/v1/connectors/<connection_id>/sync \
-H "Authorization: Bearer <access_token>"curl "http://localhost:8000/api/v1/search?q=product+roadmap+Q2&mode=hybrid&top_k=10" \
-H "Authorization: Bearer <access_token>"curl -X POST http://localhost:8000/api/v1/briefings/generate \
-H "Authorization: Bearer <access_token>" \
-H "Content-Type: application/json" \
-d '{"type": "morning"}'Every connector extends BaseConnector and must be idempotent:
from pwbs.connectors.base import BaseConnector, ConnectorConfig, SyncResult
from pwbs.ingestion.models import UnifiedDocument
class MyConnector(BaseConnector):
async def fetch_since(self, cursor: str | None) -> SyncResult:
"""Cursor-based fetch. Returns the new cursor for the next run."""
...
async def normalize(self, raw: dict) -> UnifiedDocument:
"""Transform raw API response into the Unified Document Format."""
...backend/pwbs/ # Python 3.12 package
api/ # FastAPI app, v1 routes, middleware (auth, rate limit, audit)
connectors/ # BaseConnector + 4 source implementations
processing/ # Semantic chunking, embedding, NER, entity dedup
search/ # Hybrid search (vector + keyword + RRF), reranker, evaluation
briefing/ # Briefing generation with context modules
graph/ # Neo4j knowledge graph (optional)
core/ # Config, exceptions, LLM gateway, encryption
models/ # SQLAlchemy ORM (33 models)
queue/tasks/ # Celery workers (ingestion, processing, briefing)
frontend/src/ # Next.js 15 App Router, TypeScript strict mode
docs/adr/ # Architecture Decision Records
All secrets and settings are loaded from environment variables. Copy .env.example to .env and fill in values — never commit .env.
Required: DATABASE_URL, WEAVIATE_URL, REDIS_URL, JWT_PRIVATE_KEY, JWT_PUBLIC_KEY, ENCRYPTION_KEK, and at least one LLM API key (ANTHROPIC_API_KEY or OPENAI_API_KEY). Neo4j is optional (activate with --profile graph). See .env.example for the full variable list.
Interactive API docs (Swagger UI) are available at http://localhost:8000/docs in development mode. All endpoints require a JWT Bearer token; user identity is always extracted from the token, never from request bodies. Every database query filters by owner_id.
PWBS follows a modular monolith pattern in the current MVP (Phase 2). All backend logic runs in a single FastAPI process; modules communicate via typed Python interfaces, not HTTP. This enables rapid iteration while keeping the service boundary clear enough for a future service split in Phase 3.
flowchart TD
subgraph Sources["Data Sources"]
GCal[Google Calendar]
Notion[Notion]
Zoom[Zoom Transcripts]
Obs[Obsidian Vault]
end
subgraph Ingestion["Ingestion Layer (IngestionAgent)"]
Sync[Cursor-based Sync] --> UDF[Unified Document Format]
end
subgraph Processing["Processing Pipeline (ProcessingAgent)"]
Chunk[Semantic Chunking] --> Embed[Embedding Generation]
Embed --> NER[Named Entity Recognition]
end
subgraph Store["Knowledge Store"]
PG[(PostgreSQL)]
WV[(Weaviate)]
N4J[(Neo4j Graph)]
end
subgraph Retrieval["Retrieval (SearchAgent + GraphAgent)"]
Hybrid[Hybrid Search RRF]
Graph[Graph Traversal]
end
subgraph Output["Output (BriefingAgent)"]
LLM[LLM Gateway\nClaude / GPT-4 / Ollama]
Brief[Briefing + SourceRefs]
end
Sources --> Sync
UDF --> Processing
NER --> PG
NER --> N4J
Embed --> WV
Store --> Retrieval
Retrieval --> LLM
LLM --> Brief
Brief --> API[FastAPI]
API --> FE[Next.js Frontend]
The processing pipeline runs in three stages:
- Chunking — two strategies: regex-based semantic splitting (fast, default) and embedding-based coherence chunking that detects topic shifts via cosine similarity curves (ADR-021)
- Embedding — batch embedding generation (OpenAI or local Sentence Transformers)
- NER + Graph — two-stage entity extraction (rule-based → LLM-based) followed by idempotent
MERGEwrites to Neo4j
See ARCHITECTURE.md for the full system design, database schemas, Weaviate collection configuration, and Neo4j graph model.
cd backend
# Unit tests (no network, no running databases required)
pytest tests/unit/ -v
# Integration tests (requires running Docker services)
pytest tests/integration/ -v --dockercd backend
alembic revision --autogenerate -m "describe your change"
alembic upgrade headDerive from BriefingTemplate, place the LLM prompt in pwbs/prompts/, and register the new type in the scheduler if it should run on a schedule. Every briefing must return sources: list[SourceRef] — the system does not deliver briefings without source attribution.
Significant architectural decisions are documented as Architecture Decision Records in docs/adr/. Use docs/adr/000-template.md as the starting point for new ADRs.
- Fork the repository and create a feature branch.
- Follow the coding conventions in
.github/instructions/(applied automatically by GitHub Copilot). - Ensure all unit tests pass before opening a pull request.
- For significant changes, create an ADR in
docs/adr/before writing code. - Do not commit
.envor any file containing secrets.
Security issues should be reported privately rather than via public issues.
License terms have not yet been specified for this project.