Skip to content

sowmith95/KnowledgeHub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Hub

A production-grade personal knowledge base with semantic search, designed to run on a single machine. Index codebases, web pages, PDFs, YouTube transcripts, and notes — then search everything with natural language through an MCP server, HTTP API, or CLI.

Built for developers who want their own searchable knowledge graph without cloud dependencies.

Why This Exists

LLMs are powerful but stateless. Every conversation starts from scratch. Knowledge Hub gives your tools persistent memory:

  • Index your entire codebase — AST-aware chunking understands Python class/function boundaries, not just line counts
  • Search with natural language — "how does the exit manager calculate stop losses" finds the exact code
  • Inject into any Claude Code session — MCP server makes your knowledge base a native tool
  • RAG queries — ask questions, get synthesized answers with source citations
  • Zero cloud lock-in — runs entirely on your machine (Ollama embeddings, Qdrant vector DB, SQLite metadata)

Architecture

┌─────────────────────────────────────────────────────┐
│                   Interfaces                         │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ MCP Server│  │ HTTP API     │  │ CLI           │  │
│  │ (stdio)   │  │ (port 8006)  │  │               │  │
│  └─────┬─────┘  └──────┬───────┘  └───────┬───────┘  │
│        └───────────────┼───────────────────┘          │
│                        ▼                              │
│              ┌─────────────────┐                      │
│              │ IngestPipeline   │                      │
│              │ (orchestrator)   │                      │
│              └────────┬────────┘                      │
│        ┌──────────────┼──────────────┐                │
│        ▼              ▼              ▼                │
│  ┌──────────┐  ┌────────────┐  ┌───────────┐         │
│  │ Ingestors │  │ Chunker    │  │ Search    │         │
│  │ Web/PDF/  │  │ AST/Prose/ │  │ Engine    │         │
│  │ YouTube/  │  │ Markdown   │  │ (hybrid)  │         │
│  │ Code/Text │  │            │  │           │         │
│  └──────────┘  └────────────┘  └───────────┘         │
│                        │                              │
│        ┌───────────────┼───────────────┐              │
│        ▼               ▼               ▼              │
│  ┌──────────┐   ┌──────────┐   ┌───────────┐         │
│  │ Ollama    │   │ Qdrant   │   │ SQLite    │         │
│  │ Embeddings│   │ Vectors  │   │ Metadata  │         │
│  │ (768-dim) │   │ (HNSW)   │   │ (WAL)     │         │
│  └──────────┘   └──────────┘   └───────────┘         │
└─────────────────────────────────────────────────────┘
Component Technology Purpose
Vector DB Qdrant (Docker) Cosine similarity search, HNSW index, mmap disk storage
Embeddings nomic-embed-text (Ollama) 768-dim vectors, local inference, free
Metadata SQLite (WAL mode) Document tracking, dedup via content hashing
Chunking Python ast module Class/function boundary detection for code
Search Hybrid scoring Semantic similarity + time decay + source credibility
RAG Claude API Synthesized answers with source citations
MCP FastMCP (stdio) Native Claude Code integration
HTTP FastAPI REST API for remote access

Quick Start

Prerequisites

  • Python 3.11+
  • Docker (for Qdrant)
  • Ollama with nomic-embed-text model
# Install Ollama and pull the embedding model
ollama pull nomic-embed-text

Install

git clone https://github.com/sowmith95/KnowledgeHub.git
cd KnowledgeHub

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install
pip install -e .

Start Qdrant

cd docker
docker compose up -d qdrant

Index Something

# Index a codebase
knowledge-hub ingest /path/to/your/project --code --name "My Project"

# Index a web page
knowledge-hub ingest https://docs.python.org/3/tutorial/classes.html

# Index a PDF
knowledge-hub ingest /path/to/paper.pdf

# Index a YouTube video
knowledge-hub ingest https://www.youtube.com/watch?v=dQw4w9WgXcQ

# Store a note
knowledge-hub ingest --text "Redis HGETALL returns all fields as strings" --title "Redis Notes"

Search

# Semantic search
knowledge-hub search "how does authentication work"

# Search only code
knowledge-hub search "database connection pooling" --type code

# RAG query (requires ANTHROPIC_API_KEY)
knowledge-hub ask "What is the retry strategy for failed API calls?"

MCP Server (Claude Code Integration)

The primary interface. Add to your Claude Code config (~/.claude.json):

{
  "mcpServers": {
    "knowledge-hub": {
      "command": "/path/to/KnowledgeHub/.venv/bin/python3",
      "args": ["-m", "knowledge_hub"],
      "env": {
        "QDRANT_HOST": "localhost",
        "QDRANT_PORT": "6333",
        "OLLAMA_URL": "http://localhost:11434",
        "KB_SQLITE_PATH": "/path/to/knowledge-hub-data/metadata.db"
      }
    }
  }
}

Once configured, these tools are available in every Claude Code session:

MCP Tools

Tool Description Key Parameters
kb_search Semantic search across all indexed content query, top_k (default 10), source_type filter
kb_ingest_url Ingest a web page, YouTube video, or PDF url, force (re-ingest even if unchanged)
kb_ingest_code Index a local codebase directory path, name
kb_ingest_text Store raw text or markdown text, title, source_type
kb_query RAG: search + synthesize answer with citations question, top_k (default 8)
kb_list List all indexed documents source_type filter, limit
kb_stats System health and document counts
kb_delete Remove a document and its vectors doc_id

HTTP API

Start the HTTP server for remote access (e.g., via Tailscale):

knowledge-hub serve --host 0.0.0.0 --port 8006
Method Endpoint Description
GET /health System health check
POST /ingest/url Ingest URL (web/PDF/YouTube)
POST /ingest/text Ingest raw text
POST /ingest/code Index a codebase directory
POST /search Semantic search
POST /query RAG query with synthesized answer
GET /documents List indexed documents
DELETE /documents/{doc_id} Delete a document
GET /stats System statistics

Example API Call

curl -X POST http://localhost:8006/search \
  -H "Content-Type: application/json" \
  -d '{"query": "retry logic for API calls", "top_k": 5}'

How It Works

Ingestion Pipeline

Source → Detect Type → Extract Content → Hash Check → Chunk → Embed → Store
  1. Detect: IngestorRegistry tries YouTube → PDF → Code → Web (most specific first)
  2. Extract: Pull text content, title, metadata from source
  3. Hash Check: SHA-256 content hash — skip if unchanged (unless force=True)
  4. Chunk: Split into semantically meaningful pieces (strategy depends on content type)
  5. Embed: Generate 768-dim vectors via Ollama nomic-embed-text (batches of 32)
  6. Store: Upsert vectors to Qdrant + document metadata to SQLite

Chunking Strategies

Python Code (AST-based)

  • Parses with ast.parse() for exact node boundaries
  • Imports grouped as one chunk
  • Each function/class becomes its own chunk
  • Large classes split into per-method chunks with class signature as context
  • Decorators stay attached to their definitions

Prose (sentence-aware)

  • Splits at sentence boundaries (.!? followed by uppercase)
  • 256-word target chunks with 32-word overlap
  • Never splits mid-sentence

Markdown (heading-aware)

  • Splits at heading boundaries (# through ######)
  • Maintains parent heading context for sub-sections
  • Falls back to sentence chunking for large sections

Safety limits:

  • Max 5,000 characters per chunk
  • Max 3,500 characters sent to embedding model (nomic-embed-text has 8,192 token context)
  • Minimum 10 words per chunk (filters noise)

Search Scoring

Hybrid scoring combines three signals:

final_score = (similarity × 0.70) + (time_score × 0.15) + (source_weight × 0.15)
Signal Weight Formula
Semantic similarity 70% Cosine distance from Qdrant
Time decay 15% exp(-0.693 × days_old / 90) — half-life of 90 days
Source credibility 15% code: 1.3, pdf: 1.2, markdown: 1.1, article: 1.0, youtube: 0.9, text: 0.8

Deduplication: Max 3 chunks per document in results to prevent one large file from dominating.

RAG Queries

When you use kb_query or the /query endpoint (requires ANTHROPIC_API_KEY):

  1. Search for the top-k most relevant chunks
  2. Build a context window with numbered sources
  3. Send to Claude (claude-sonnet-4-20250514) with a system prompt enforcing context-only answers
  4. Returns synthesized answer with [Source N] citations

Falls back to raw search results if no API key is configured.

Supported Content Types

Type Ingestor Details
Codebases CodeIngestor 25+ file extensions, respects .gitignore, skips node_modules/tests/.venv, 1MB file limit
Web pages WebIngestor Strips nav/footer/ads, extracts <article> or <main> content
PDFs PDFIngestor Local files or URLs, extracts via PyPDF2
YouTube YouTubeIngestor Transcript extraction (manual → auto-generated → any language)
Raw text Direct Notes, analysis results, anything you want searchable
Markdown Direct Heading-aware chunking with hierarchy context

Supported Code Languages

Python (AST-parsed), JavaScript, TypeScript, TSX/JSX, Go, Rust, Java, C, C++, SQL, Shell/Bash, YAML, TOML, JSON, HTML, CSS, Svelte, Vue, Terraform, HCL, Dockerfile, Makefile.

Configuration

All configuration via environment variables with sensible defaults:

Variable Default Description
QDRANT_HOST localhost Qdrant server host
QDRANT_PORT 6333 Qdrant HTTP port
QDRANT_GRPC_PORT 6334 Qdrant gRPC port
KB_COLLECTION knowledge_hub Qdrant collection name
KB_EMBEDDING_DIM 768 Embedding vector dimensions
OLLAMA_URL http://localhost:11434 Ollama API endpoint
KB_EMBEDDING_MODEL nomic-embed-text Ollama model name
KB_CHUNK_SIZE 256 Target words per chunk
KB_CHUNK_OVERLAP 32 Overlap words between chunks
KB_SQLITE_PATH ~/knowledge-hub-data/metadata.db SQLite database path
KB_API_HOST 0.0.0.0 HTTP server bind address
KB_API_PORT 8006 HTTP server port
ANTHROPIC_API_KEY Required for RAG queries (kb_query)

Docker Deployment

For running both Qdrant and the HTTP API server:

cd docker

# Start everything
docker compose up -d

# Or just Qdrant (if running MCP server locally)
docker compose up -d qdrant

The docker-compose.yml binds data to host directories for persistence:

  • Qdrant data: /Users/srp/knowledge-hub-data/qdrant
  • SQLite + app data: /Users/srp/knowledge-hub-data

Update the volume paths in docker-compose.yml for your system.

Project Structure

KnowledgeHub/
├── knowledge_hub/
│   ├── __init__.py
│   ├── __main__.py          # Entry point (MCP server)
│   ├── config.py             # All configuration (env vars)
│   ├── models.py             # Domain models (Document, Chunk, SearchResult)
│   ├── pipeline.py           # Orchestrator (ingest, search, query)
│   ├── embeddings.py         # Ollama nomic-embed-text client
│   ├── vector_store.py       # Qdrant wrapper (upsert, search, delete)
│   ├── metadata_db.py        # SQLite metadata (documents, dedup)
│   ├── chunker.py            # AST/sentence/markdown chunking
│   ├── api.py                # FastAPI HTTP server
│   ├── cli.py                # CLI interface
│   ├── ingestors/
│   │   ├── base.py           # BaseIngestor ABC
│   │   ├── web.py            # Web page ingestor
│   │   ├── pdf.py            # PDF ingestor
│   │   ├── youtube.py        # YouTube transcript ingestor
│   │   └── code.py           # Codebase directory ingestor
│   ├── search/
│   │   └── engine.py         # Hybrid search + RAG query
│   └── mcp_server/
│       └── server.py         # FastMCP tool definitions
├── docker/
│   ├── docker-compose.yml
│   └── Dockerfile
├── scripts/
│   ├── start.sh
│   ├── stop.sh
│   ├── configure-claude-code.sh
│   └── index-complextading.sh
├── pyproject.toml
└── README.md

License

MIT

About

Universal MCP knowledge base with semantic search. Qdrant vector DB + nomic-embed-text embeddings + AST-based code chunking. Injectable into any Claude Code session via MCP server.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors