From f801f1ad4d0a49f5fa4b21511143bbe003710622 Mon Sep 17 00:00:00 2001 From: Laith Al-Saadoon Date: Sun, 10 May 2026 18:10:19 +0000 Subject: [PATCH 01/17] docs(repo): restore Astro Starlight site from pre-PR-#53 history The packages/docs/ Starlight site was deleted in commit 4431b53 (PR #53, OCH v1.0 -- M1 + M2 stabilize) under T-M2-3 with the explicit intent to spin it up as a separate repo. That separate repo never materialized; theagenticguy/opencodehub-docs does not exist. The README has carried the stale "being bootstrapped in a dedicated repo; this README will link it once published" placeholder for over a week while six v1-finalize milestones (M3-M7 + Track A-D) shipped. The live site at https://theagenticguy.github.io/opencodehub/ is still serving the May 1 build (29 tools advertised as 28; missing all M3-M7 coverage; no parse-runtime / LadybugDB / 20-scanner / Tracks A-D content). Pages does not auto-tear-down when its feeding workflow is removed, so the orphaned snapshot has been load-bearing for a week. This commit restores the 56 files exactly as they were at 4431b53^ plus a baseline `pnpm install` to update the lockfile. A follow-up commit will refresh content end-to-end against v1 reality and add the agent-friendly USAGE / discovery section. Build verified: `pnpm -F @opencodehub/docs build` produces 47 pages, internal links validate clean, Pagefind index generates, llm-nav banners patch all 46 .md files. --- .github/workflows/pages.yml | 59 + .gitignore | 4 + packages/docs/README.md | 77 + packages/docs/astro.config.mjs | 148 + packages/docs/package.json | 31 + packages/docs/public/.nojekyll | 0 packages/docs/public/favicon.svg | 6 + packages/docs/scripts/inject-llm-nav.mjs | 275 + packages/docs/src/assets/logo.svg | 8 + packages/docs/src/content.config.ts | 7 + .../src/content/docs/architecture/adrs.md | 201 + .../content/docs/architecture/determinism.md | 125 + .../content/docs/architecture/embeddings.md | 197 + .../content/docs/architecture/monorepo-map.md | 81 + .../src/content/docs/architecture/overview.md | 172 + .../architecture/parsing-and-resolution.md | 201 + .../docs/architecture/scanners-and-sarif.md | 200 + .../docs/architecture/scip-reconciliation.md | 200 + .../architecture/summarization-and-fusion.md | 202 + .../content/docs/architecture/supply-chain.md | 141 + .../adding-a-language-provider.md | 169 + .../docs/contributing/commit-conventions.md | 128 + .../src/content/docs/contributing/dev-loop.md | 141 + .../content/docs/contributing/ip-hygiene.md | 155 + .../src/content/docs/contributing/overview.md | 102 + .../docs/contributing/release-process.md | 128 + .../src/content/docs/contributing/testing.md | 150 + .../src/content/docs/guides/ci-integration.md | 86 + .../content/docs/guides/cross-repo-groups.md | 74 + .../content/docs/guides/indexing-a-repo.md | 101 + .../content/docs/guides/troubleshooting.md | 88 + .../docs/guides/using-with-claude-code.md | 99 + .../content/docs/guides/using-with-codex.md | 69 + .../content/docs/guides/using-with-cursor.md | 86 + .../docs/guides/using-with-opencode.md | 80 + .../docs/guides/using-with-windsurf.md | 80 + packages/docs/src/content/docs/index.mdx | 92 + .../docs/src/content/docs/mcp/overview.md | 76 + packages/docs/src/content/docs/mcp/prompts.md | 20 + .../docs/src/content/docs/mcp/resources.md | 24 + packages/docs/src/content/docs/mcp/tools.md | 85 + .../docs/src/content/docs/reference/cli.md | 385 ++ .../content/docs/reference/configuration.md | 60 + .../content/docs/reference/docmeta-schema.mdx | 98 + .../src/content/docs/reference/error-codes.md | 50 + .../src/content/docs/reference/languages.md | 72 + .../docs/skills/codehub-contract-map.mdx | 89 + .../content/docs/skills/codehub-document.mdx | 121 + .../docs/skills/codehub-onboarding.mdx | 86 + .../docs/skills/codehub-pr-description.mdx | 72 + .../docs/src/content/docs/skills/index.mdx | 84 + .../content/docs/start-here/codehub-init.md | 112 + .../content/docs/start-here/first-query.md | 104 + .../src/content/docs/start-here/install.md | 107 + .../content/docs/start-here/quick-start.md | 120 + .../docs/start-here/what-is-opencodehub.md | 62 + packages/docs/src/styles/custom.css | 31 + packages/docs/tsconfig.json | 5 + pnpm-lock.yaml | 5204 ++++++++++++++++- 59 files changed, 11076 insertions(+), 154 deletions(-) create mode 100644 .github/workflows/pages.yml create mode 100644 packages/docs/README.md create mode 100644 packages/docs/astro.config.mjs create mode 100644 packages/docs/package.json create mode 100644 packages/docs/public/.nojekyll create mode 100644 packages/docs/public/favicon.svg create mode 100644 packages/docs/scripts/inject-llm-nav.mjs create mode 100644 packages/docs/src/assets/logo.svg create mode 100644 packages/docs/src/content.config.ts create mode 100644 packages/docs/src/content/docs/architecture/adrs.md create mode 100644 packages/docs/src/content/docs/architecture/determinism.md create mode 100644 packages/docs/src/content/docs/architecture/embeddings.md create mode 100644 packages/docs/src/content/docs/architecture/monorepo-map.md create mode 100644 packages/docs/src/content/docs/architecture/overview.md create mode 100644 packages/docs/src/content/docs/architecture/parsing-and-resolution.md create mode 100644 packages/docs/src/content/docs/architecture/scanners-and-sarif.md create mode 100644 packages/docs/src/content/docs/architecture/scip-reconciliation.md create mode 100644 packages/docs/src/content/docs/architecture/summarization-and-fusion.md create mode 100644 packages/docs/src/content/docs/architecture/supply-chain.md create mode 100644 packages/docs/src/content/docs/contributing/adding-a-language-provider.md create mode 100644 packages/docs/src/content/docs/contributing/commit-conventions.md create mode 100644 packages/docs/src/content/docs/contributing/dev-loop.md create mode 100644 packages/docs/src/content/docs/contributing/ip-hygiene.md create mode 100644 packages/docs/src/content/docs/contributing/overview.md create mode 100644 packages/docs/src/content/docs/contributing/release-process.md create mode 100644 packages/docs/src/content/docs/contributing/testing.md create mode 100644 packages/docs/src/content/docs/guides/ci-integration.md create mode 100644 packages/docs/src/content/docs/guides/cross-repo-groups.md create mode 100644 packages/docs/src/content/docs/guides/indexing-a-repo.md create mode 100644 packages/docs/src/content/docs/guides/troubleshooting.md create mode 100644 packages/docs/src/content/docs/guides/using-with-claude-code.md create mode 100644 packages/docs/src/content/docs/guides/using-with-codex.md create mode 100644 packages/docs/src/content/docs/guides/using-with-cursor.md create mode 100644 packages/docs/src/content/docs/guides/using-with-opencode.md create mode 100644 packages/docs/src/content/docs/guides/using-with-windsurf.md create mode 100644 packages/docs/src/content/docs/index.mdx create mode 100644 packages/docs/src/content/docs/mcp/overview.md create mode 100644 packages/docs/src/content/docs/mcp/prompts.md create mode 100644 packages/docs/src/content/docs/mcp/resources.md create mode 100644 packages/docs/src/content/docs/mcp/tools.md create mode 100644 packages/docs/src/content/docs/reference/cli.md create mode 100644 packages/docs/src/content/docs/reference/configuration.md create mode 100644 packages/docs/src/content/docs/reference/docmeta-schema.mdx create mode 100644 packages/docs/src/content/docs/reference/error-codes.md create mode 100644 packages/docs/src/content/docs/reference/languages.md create mode 100644 packages/docs/src/content/docs/skills/codehub-contract-map.mdx create mode 100644 packages/docs/src/content/docs/skills/codehub-document.mdx create mode 100644 packages/docs/src/content/docs/skills/codehub-onboarding.mdx create mode 100644 packages/docs/src/content/docs/skills/codehub-pr-description.mdx create mode 100644 packages/docs/src/content/docs/skills/index.mdx create mode 100644 packages/docs/src/content/docs/start-here/codehub-init.md create mode 100644 packages/docs/src/content/docs/start-here/first-query.md create mode 100644 packages/docs/src/content/docs/start-here/install.md create mode 100644 packages/docs/src/content/docs/start-here/quick-start.md create mode 100644 packages/docs/src/content/docs/start-here/what-is-opencodehub.md create mode 100644 packages/docs/src/styles/custom.css create mode 100644 packages/docs/tsconfig.json diff --git a/.github/workflows/pages.yml b/.github/workflows/pages.yml new file mode 100644 index 0000000..3c61aa7 --- /dev/null +++ b/.github/workflows/pages.yml @@ -0,0 +1,59 @@ +name: Pages + +on: + push: + branches: [main] + paths: + - "packages/docs/**" + - ".github/workflows/pages.yml" + workflow_dispatch: + +permissions: + contents: read + pages: write + id-token: write + +concurrency: + group: pages + cancel-in-progress: false + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + - uses: jdx/mise-action@v4 + # NOTE: --ignore-scripts removed so sharp's native binary download + # and Playwright's chromium install (via rehype-mermaid) are allowed. + - run: pnpm install --frozen-lockfile + # rehype-mermaid renders ```mermaid fences to inline SVG at build + # time using Playwright chromium. --with-deps installs the apt + # libraries Playwright needs on ubuntu-latest. Do not use --with-deps + # on dev machines. + - run: pnpm --filter @opencodehub/docs exec playwright install chromium --with-deps + # mise-action@v4 writes NODE_ENV=development to $GITHUB_ENV, which + # takes precedence over job-level env. Override at the step level + # (step env beats $GITHUB_ENV) to force Astro into production mode + # so Starlight wires Pagefind instead of shipping the dev stub + # ("Search is only available in production builds"). + # mise-action@v4 writes NODE_ENV=development to $GITHUB_ENV at setup + # time. Inline `env VAR=val` and step-level env: don't reach astro's + # Vite compilation reliably — Vite resolves `import.meta.env.DEV` at + # bundle time from process.env.NODE_ENV that the Vite module sees. + # Override $GITHUB_ENV directly so every subsequent step (including + # the Node process spawned by pnpm → astro) inherits production. + - run: echo "NODE_ENV=production" >> "$GITHUB_ENV" + - run: pnpm -F @opencodehub/docs build + - uses: actions/upload-pages-artifact@v5 + with: + path: packages/docs/dist + + deploy: + needs: build + runs-on: ubuntu-latest + environment: + name: github-pages + url: ${{ steps.deployment.outputs.page_url }} + steps: + - id: deployment + uses: actions/deploy-pages@v5 diff --git a/.gitignore b/.gitignore index 079fd48..69dae1e 100644 --- a/.gitignore +++ b/.gitignore @@ -3,6 +3,10 @@ dist/ .tsbuildinfo *.tsbuildinfo +# Astro / Starlight build artifacts (packages/docs/) +.astro/ +packages/docs/dist/ + # Python (eval package) .venv/ __pycache__/ diff --git a/packages/docs/README.md b/packages/docs/README.md new file mode 100644 index 0000000..c22afbc --- /dev/null +++ b/packages/docs/README.md @@ -0,0 +1,77 @@ +# @opencodehub/docs + +Astro + Starlight documentation site for OpenCodeHub. Deployed to +GitHub Pages at https://theagenticguy.github.io/opencodehub/. + +## Local development + +```bash +pnpm install +pnpm -F @opencodehub/docs dev # http://localhost:4321/opencodehub +pnpm -F @opencodehub/docs build # writes to packages/docs/dist +pnpm -F @opencodehub/docs preview # serves dist/ locally +``` + +Prefer the mise tasks from the repo root: + +```bash +mise run docs:dev +mise run docs:build +mise run docs:preview +``` + +## Site IA + +Top-level sections under `src/content/docs/`: + +- `start-here/` — install, quick-start, first query. +- `guides/` — editor integrations and task-oriented walkthroughs. +- `mcp/` — server overview, tool catalog, resources, prompts. +- `reference/` — CLI, error codes, language matrix, configuration. +- `architecture/` — monorepo map, determinism, supply chain, ADR index. +- `skills/` — Claude Code skill references. +- `contributing/` — dev loop, testing, release process. + +## ADRs + +Architecture decision records live at `/docs/adr/` at the repo root — 10 +files, numbered `0001-*.md` through `0010-*.md`. The Starlight site +surfaces them through an index page at +`src/content/docs/architecture/adrs.md`, so readers get both the canonical +source and a browsable index. + +## Starlight plugins + +Configured in `astro.config.mjs`: + +- `starlight-llms-txt` — emits `/llms.txt`, `/llms-full.txt`, and + `/llms-small.txt` at build time for LLM-crawlable bundles. +- `starlight-page-actions` — per-page "Copy as Markdown", "Open in ChatGPT", + "Open in Claude", and Share actions. +- `starlight-links-validator` — build-time broken-link check so shipped + bundles never carry dead links. + +## Authoring + +Pages live under `src/content/docs/`. Starlight picks up any +`.md` or `.mdx` file automatically; the sidebar auto-generates +per top-level directory. + +Frontmatter fields we use: + +```yaml +--- +title: Page title +description: One-sentence SEO/summary +sidebar: + order: 1 # lower first; ties break alphabetically + label: Short # optional override +--- +``` + +## Deploy + +`.github/workflows/pages.yml` runs on pushes to `main` that touch +`packages/docs/**` or the workflow itself. It builds with +`withastro/action@v6` pinned to Node 22 and deploys with +`actions/deploy-pages@v5`. diff --git a/packages/docs/astro.config.mjs b/packages/docs/astro.config.mjs new file mode 100644 index 0000000..b02ccbb --- /dev/null +++ b/packages/docs/astro.config.mjs @@ -0,0 +1,148 @@ +import { defineConfig } from "astro/config"; +import starlight from "@astrojs/starlight"; +import starlightLinksValidator from "starlight-links-validator"; +import starlightLlmsTxt from "starlight-llms-txt"; +import starlightPageActions from "starlight-page-actions"; +import rehypeMermaid from "rehype-mermaid"; + +// https://astro.build/config +export default defineConfig({ + site: "https://theagenticguy.github.io", + base: "/opencodehub", + // Mermaid: render ```mermaid ``` fences to inline SVG at build time. + // excludeLangs is critical — without it, Shiki grabs the mermaid fence + // first and rehype-mermaid never sees it. + markdown: { + syntaxHighlight: { type: "shiki", excludeLangs: ["mermaid"] }, + rehypePlugins: [[rehypeMermaid, { strategy: "img-svg", dark: true }]], + }, + integrations: [ + starlight({ + title: "OpenCodeHub", + description: + "Apache-2.0 code intelligence graph + MCP server for AI coding agents.", + logo: { + src: "./src/assets/logo.svg", + replacesTitle: false, + }, + favicon: "/favicon.svg", + social: [ + { + icon: "github", + label: "GitHub", + href: "https://github.com/theagenticguy/opencodehub", + }, + ], + editLink: { + baseUrl: + "https://github.com/theagenticguy/opencodehub/edit/main/packages/docs/", + }, + lastUpdated: true, + credits: true, + plugins: [ + // 1) LLM-crawlable bundles. Emits /llms.txt, /llms-full.txt, + // /llms-small.txt at build time. Must run first so page-actions + // sees it already registered. + starlightLlmsTxt({ + projectName: "OpenCodeHub", + description: + "Apache-2.0 code intelligence graph + MCP server for AI coding agents. Gives agents callers, callees, processes, and blast radius in one MCP tool call — local, offline-capable, deterministic.", + details: + "OpenCodeHub indexes a repository into a hybrid structural + semantic knowledge graph and exposes it over the Model Context Protocol (MCP) to AI coding agents. The MCP server registers 28 tools spanning search, change-impact, findings, and cross-repo groups. The CLI binary is `codehub`. Runtime: Node 22, pnpm 10, DuckDB + hnsw_acorn storage, 15 tree-sitter languages, SCIP indexers for TypeScript / Python / Go / Rust / Java.", + promote: [ + "start-here/**", + "guides/**", + "mcp/**", + ], + demote: [ + "architecture/**", + "contributing/**", + ], + // Keep llms-small.txt tight by dropping internals-y prose. + exclude: [], + minify: { + note: true, + tip: true, + details: true, + whitespace: true, + caution: false, + danger: false, + }, + customSets: [ + { + label: "user-guide", + paths: ["start-here/**", "guides/**"], + description: + "User-facing pages only: install, quick-start, editor integration guides.", + }, + { + label: "mcp", + paths: ["mcp/**", "reference/**"], + description: + "MCP surface: server tools, resources, prompts, CLI reference, error codes, language matrix.", + }, + { + label: "contributing", + paths: ["contributing/**", "architecture/**"], + description: + "Developer and architecture docs: dev loop, release flow, ADRs, determinism, supply-chain.", + }, + ], + }), + + // 2) Per-page "Copy as Markdown" + "Open in ChatGPT" + "Open in + // Claude" + Share. IMPORTANT: do NOT set `baseUrl`, or this + // plugin will try to own /llms.txt too and collide with + // starlight-llms-txt. Leave llms generation to plugin #1. + starlightPageActions({ + actions: { + markdown: true, + chatgpt: true, + claude: true, + t3chat: false, + v0: false, + }, + share: true, + }), + + // 3) Build-time broken-link check. Runs after content is built + // but before deploy, so llms-full.txt never ships dead links. + starlightLinksValidator({ + errorOnFallbackPages: false, + errorOnInconsistentLocale: false, + }), + ], + sidebar: [ + { + label: "Start Here", + autogenerate: { directory: "start-here" }, + }, + { + label: "User Guide", + autogenerate: { directory: "guides" }, + }, + { + label: "MCP Server", + autogenerate: { directory: "mcp" }, + }, + { + label: "Skills", + autogenerate: { directory: "skills" }, + }, + { + label: "Reference", + autogenerate: { directory: "reference" }, + }, + { + label: "Contributing", + autogenerate: { directory: "contributing" }, + }, + { + label: "Architecture", + autogenerate: { directory: "architecture" }, + }, + ], + customCss: ["./src/styles/custom.css"], + }), + ], +}); diff --git a/packages/docs/package.json b/packages/docs/package.json new file mode 100644 index 0000000..8e1e1ab --- /dev/null +++ b/packages/docs/package.json @@ -0,0 +1,31 @@ +{ + "name": "@opencodehub/docs", + "version": "0.0.0", + "private": true, + "description": "OpenCodeHub documentation site (Astro + Starlight)", + "license": "Apache-2.0", + "type": "module", + "engines": { + "node": ">=22.12.0" + }, + "scripts": { + "dev": "astro dev", + "start": "astro dev", + "build": "NODE_ENV=production astro build && node scripts/inject-llm-nav.mjs", + "preview": "astro preview", + "check": "astro check", + "clean": "rm -rf dist .astro" + }, + "dependencies": { + "@astrojs/starlight": "^0.38.4", + "astro": "^6.2.1", + "sharp": "^0.34.5" + }, + "devDependencies": { + "playwright": "^1.59.1", + "rehype-mermaid": "^3.0.0", + "starlight-links-validator": "^0.24.0", + "starlight-llms-txt": "^0.8.1", + "starlight-page-actions": "^0.6.0" + } +} diff --git a/packages/docs/public/.nojekyll b/packages/docs/public/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/packages/docs/public/favicon.svg b/packages/docs/public/favicon.svg new file mode 100644 index 0000000..6561943 --- /dev/null +++ b/packages/docs/public/favicon.svg @@ -0,0 +1,6 @@ + + + + + + diff --git a/packages/docs/scripts/inject-llm-nav.mjs b/packages/docs/scripts/inject-llm-nav.mjs new file mode 100644 index 0000000..a22f07c --- /dev/null +++ b/packages/docs/scripts/inject-llm-nav.mjs @@ -0,0 +1,275 @@ +#!/usr/bin/env node +// Postbuild: inject LLM-navigation helpers into every per-page .md emitted +// by starlight-page-actions, mirroring the pattern from +// https://code.claude.com/docs/en/agent-sdk/python.md: +// +// 1. Index banner at the top of every page pointing at /llms.txt +// 2. "See also" footer with 3-5 curated related-page links +// +// Runs after `astro build` against packages/docs/dist/**/*.md. + +import { promises as fs } from "node:fs"; +import path from "node:path"; +import { fileURLToPath } from "node:url"; + +const __dirname = path.dirname(fileURLToPath(import.meta.url)); +const DIST = path.resolve(__dirname, "..", "dist"); +const BASE = "/opencodehub"; +const SITE = "https://theagenticguy.github.io"; + +const INDEX_BANNER = `> ## Documentation Index +> Fetch the complete documentation index at: ${SITE}${BASE}/llms.txt +> Use this file to discover all available pages before exploring further. +> Scoped bundles: [user-guide](${SITE}${BASE}/_llms-txt/user-guide.txt) · [mcp](${SITE}${BASE}/_llms-txt/mcp.txt) · [contributing](${SITE}${BASE}/_llms-txt/contributing.txt) + +`; + +// Per-page "See also" — curated by section. +// Keys are the doc slug (path from dist/ without .md extension, leading slash). +const RELATED = { + "/index": [ + ["Quick start", `${BASE}/start-here/quick-start/`], + ["What is OpenCodeHub?", `${BASE}/start-here/what-is-opencodehub/`], + ["MCP tools", `${BASE}/mcp/tools/`], + ["CLI reference", `${BASE}/reference/cli/`], + ], + + // Start here + "/start-here/what-is-opencodehub": [ + ["Install", `${BASE}/start-here/install/`], + ["Quick start", `${BASE}/start-here/quick-start/`], + ["MCP overview", `${BASE}/mcp/overview/`], + ], + "/start-here/install": [ + ["Quick start", `${BASE}/start-here/quick-start/`], + ["First query", `${BASE}/start-here/first-query/`], + ["Troubleshooting", `${BASE}/guides/troubleshooting/`], + ], + "/start-here/quick-start": [ + ["First query", `${BASE}/start-here/first-query/`], + ["Indexing a repo", `${BASE}/guides/indexing-a-repo/`], + ["Using with Claude Code", `${BASE}/guides/using-with-claude-code/`], + ["MCP tools", `${BASE}/mcp/tools/`], + ], + "/start-here/first-query": [ + ["CLI reference", `${BASE}/reference/cli/`], + ["MCP tools", `${BASE}/mcp/tools/`], + ["Indexing a repo", `${BASE}/guides/indexing-a-repo/`], + ], + + // Guides + "/guides/indexing-a-repo": [ + ["CLI reference — analyze", `${BASE}/reference/cli/`], + ["Troubleshooting", `${BASE}/guides/troubleshooting/`], + ["Language matrix", `${BASE}/reference/languages/`], + ], + "/guides/using-with-claude-code": [ + ["Using with Cursor", `${BASE}/guides/using-with-cursor/`], + ["Using with Codex", `${BASE}/guides/using-with-codex/`], + ["MCP overview", `${BASE}/mcp/overview/`], + ["MCP tools", `${BASE}/mcp/tools/`], + ], + "/guides/using-with-cursor": [ + ["Using with Claude Code", `${BASE}/guides/using-with-claude-code/`], + ["Using with Codex", `${BASE}/guides/using-with-codex/`], + ["MCP overview", `${BASE}/mcp/overview/`], + ], + "/guides/using-with-codex": [ + ["Using with Claude Code", `${BASE}/guides/using-with-claude-code/`], + ["Using with Cursor", `${BASE}/guides/using-with-cursor/`], + ["MCP overview", `${BASE}/mcp/overview/`], + ], + "/guides/using-with-windsurf": [ + ["Using with Claude Code", `${BASE}/guides/using-with-claude-code/`], + ["Using with Cursor", `${BASE}/guides/using-with-cursor/`], + ["MCP overview", `${BASE}/mcp/overview/`], + ], + "/guides/using-with-opencode": [ + ["Using with Claude Code", `${BASE}/guides/using-with-claude-code/`], + ["Using with Cursor", `${BASE}/guides/using-with-cursor/`], + ["MCP overview", `${BASE}/mcp/overview/`], + ], + "/guides/cross-repo-groups": [ + ["MCP tools — group_*", `${BASE}/mcp/tools/`], + ["CLI reference — group", `${BASE}/reference/cli/`], + ["Indexing a repo", `${BASE}/guides/indexing-a-repo/`], + ], + "/guides/ci-integration": [ + ["CLI reference — verdict / detect-changes", `${BASE}/reference/cli/`], + ["MCP tools — verdict", `${BASE}/mcp/tools/`], + ["Error codes", `${BASE}/reference/error-codes/`], + ], + "/guides/troubleshooting": [ + ["CLI reference — doctor", `${BASE}/reference/cli/`], + ["Error codes", `${BASE}/reference/error-codes/`], + ["Install", `${BASE}/start-here/install/`], + ], + + // Reference + "/reference/cli": [ + ["MCP tools", `${BASE}/mcp/tools/`], + ["Configuration", `${BASE}/reference/configuration/`], + ["Error codes", `${BASE}/reference/error-codes/`], + ], + "/reference/configuration": [ + ["CLI reference", `${BASE}/reference/cli/`], + ["Using with Claude Code", `${BASE}/guides/using-with-claude-code/`], + ["Troubleshooting", `${BASE}/guides/troubleshooting/`], + ], + "/reference/error-codes": [ + ["CLI reference", `${BASE}/reference/cli/`], + ["MCP overview", `${BASE}/mcp/overview/`], + ["Troubleshooting", `${BASE}/guides/troubleshooting/`], + ], + "/reference/languages": [ + ["Adding a language provider", `${BASE}/contributing/adding-a-language-provider/`], + ["Indexing a repo", `${BASE}/guides/indexing-a-repo/`], + ["Architecture overview", `${BASE}/architecture/overview/`], + ], + + // MCP + "/mcp/overview": [ + ["MCP tools", `${BASE}/mcp/tools/`], + ["Resources", `${BASE}/mcp/resources/`], + ["Prompts", `${BASE}/mcp/prompts/`], + ["Using with Claude Code", `${BASE}/guides/using-with-claude-code/`], + ], + "/mcp/tools": [ + ["MCP overview", `${BASE}/mcp/overview/`], + ["Resources", `${BASE}/mcp/resources/`], + ["Prompts", `${BASE}/mcp/prompts/`], + ["CLI reference", `${BASE}/reference/cli/`], + ], + "/mcp/resources": [ + ["MCP overview", `${BASE}/mcp/overview/`], + ["MCP tools", `${BASE}/mcp/tools/`], + ["Prompts", `${BASE}/mcp/prompts/`], + ], + "/mcp/prompts": [ + ["MCP overview", `${BASE}/mcp/overview/`], + ["MCP tools", `${BASE}/mcp/tools/`], + ["Resources", `${BASE}/mcp/resources/`], + ], + + // Contributing + "/contributing/overview": [ + ["Dev loop", `${BASE}/contributing/dev-loop/`], + ["Commit conventions", `${BASE}/contributing/commit-conventions/`], + ["IP hygiene", `${BASE}/contributing/ip-hygiene/`], + ["Adding a language provider", `${BASE}/contributing/adding-a-language-provider/`], + ], + "/contributing/dev-loop": [ + ["Commit conventions", `${BASE}/contributing/commit-conventions/`], + ["Testing", `${BASE}/contributing/testing/`], + ["Release process", `${BASE}/contributing/release-process/`], + ], + "/contributing/commit-conventions": [ + ["Release process", `${BASE}/contributing/release-process/`], + ["Dev loop", `${BASE}/contributing/dev-loop/`], + ["Contributing overview", `${BASE}/contributing/overview/`], + ], + "/contributing/release-process": [ + ["Commit conventions", `${BASE}/contributing/commit-conventions/`], + ["Contributing overview", `${BASE}/contributing/overview/`], + ["Supply chain", `${BASE}/architecture/supply-chain/`], + ], + "/contributing/ip-hygiene": [ + ["Supply chain", `${BASE}/architecture/supply-chain/`], + ["Contributing overview", `${BASE}/contributing/overview/`], + ["Dev loop", `${BASE}/contributing/dev-loop/`], + ], + "/contributing/adding-a-language-provider": [ + ["Language matrix", `${BASE}/reference/languages/`], + ["Architecture overview", `${BASE}/architecture/overview/`], + ["Testing", `${BASE}/contributing/testing/`], + ], + "/contributing/testing": [ + ["Dev loop", `${BASE}/contributing/dev-loop/`], + ["Architecture overview", `${BASE}/architecture/overview/`], + ["Determinism", `${BASE}/architecture/determinism/`], + ], + + // Architecture + "/architecture/overview": [ + ["Monorepo map", `${BASE}/architecture/monorepo-map/`], + ["ADRs", `${BASE}/architecture/adrs/`], + ["Determinism", `${BASE}/architecture/determinism/`], + ["Supply chain", `${BASE}/architecture/supply-chain/`], + ], + "/architecture/monorepo-map": [ + ["Architecture overview", `${BASE}/architecture/overview/`], + ["Adding a language provider", `${BASE}/contributing/adding-a-language-provider/`], + ["Dev loop", `${BASE}/contributing/dev-loop/`], + ], + "/architecture/adrs": [ + ["Architecture overview", `${BASE}/architecture/overview/`], + ["Determinism", `${BASE}/architecture/determinism/`], + ["Supply chain", `${BASE}/architecture/supply-chain/`], + ], + "/architecture/determinism": [ + ["Architecture overview", `${BASE}/architecture/overview/`], + ["Testing", `${BASE}/contributing/testing/`], + ["ADRs", `${BASE}/architecture/adrs/`], + ], + "/architecture/supply-chain": [ + ["IP hygiene", `${BASE}/contributing/ip-hygiene/`], + ["Architecture overview", `${BASE}/architecture/overview/`], + ["Release process", `${BASE}/contributing/release-process/`], + ], +}; + +function slugForFile(mdPath) { + const rel = path.relative(DIST, mdPath).replace(/\\/g, "/"); + return "/" + rel.replace(/\.md$/, ""); +} + +function seeAlso(slug) { + const links = RELATED[slug]; + if (!links) return ""; + const lines = links.map(([label, href]) => `* [${label}](${href})`).join("\n"); + return `\n\n## See also\n\n${lines}\n`; +} + +async function walk(dir) { + const ents = await fs.readdir(dir, { withFileTypes: true }); + const out = []; + for (const e of ents) { + const full = path.join(dir, e.name); + if (e.isDirectory()) out.push(...(await walk(full))); + else if (e.isFile() && e.name.endsWith(".md")) out.push(full); + } + return out; +} + +async function main() { + let patched = 0; + let skipped = 0; + const files = await walk(DIST); + for (const file of files) { + // Skip llms.txt-family (they're already the index). + if (file.endsWith("/llms.txt")) continue; + if (file.includes("/_llms-txt/")) continue; + + const original = await fs.readFile(file, "utf8"); + + // Idempotency guard — don't double-inject. + if (original.startsWith("> ## Documentation Index")) { + skipped += 1; + continue; + } + + const slug = slugForFile(file); + const body = INDEX_BANNER + original + seeAlso(slug); + await fs.writeFile(file, body, "utf8"); + patched += 1; + } + + console.warn( + `[inject-llm-nav] patched ${patched} .md files, skipped ${skipped} already-patched`, + ); +} + +main().catch((err) => { + console.error("[inject-llm-nav] failed:", err); + process.exitCode = 1; +}); diff --git a/packages/docs/src/assets/logo.svg b/packages/docs/src/assets/logo.svg new file mode 100644 index 0000000..3ad310a --- /dev/null +++ b/packages/docs/src/assets/logo.svg @@ -0,0 +1,8 @@ + + + + + + + + diff --git a/packages/docs/src/content.config.ts b/packages/docs/src/content.config.ts new file mode 100644 index 0000000..7fbcf2c --- /dev/null +++ b/packages/docs/src/content.config.ts @@ -0,0 +1,7 @@ +import { defineCollection } from "astro:content"; +import { docsLoader } from "@astrojs/starlight/loaders"; +import { docsSchema } from "@astrojs/starlight/schema"; + +export const collections = { + docs: defineCollection({ loader: docsLoader(), schema: docsSchema() }), +}; diff --git a/packages/docs/src/content/docs/architecture/adrs.md b/packages/docs/src/content/docs/architecture/adrs.md new file mode 100644 index 0000000..a235d2e --- /dev/null +++ b/packages/docs/src/content/docs/architecture/adrs.md @@ -0,0 +1,201 @@ +--- +title: Architecture decision records +description: Index of OpenCodeHub ADRs — every accepted and superseded decision. +sidebar: + order: 30 +--- + +Every load-bearing architectural choice in OpenCodeHub is recorded as +an ADR under `docs/adr/` in the repo. This page is the index. Click +through to the source ADR for the full context, candidates +considered, and consequences. + +## Accepted + +### ADR 0001 — Storage backend selection + +**Status:** Accepted (2026-04-18; supersedes prior SQLite recommendation). + +**Decision:** DuckDB via `@duckdb/node-api`, with the `hnsw_acorn` +community extension for filter-aware vector search, the official `fts` +extension for BM25, and recursive CTEs with `USING KEY` for +memory-efficient graph traversal. All three choices are MIT. + +SQLite + `sqlite-vec` was considered and rejected because FTS5 has no +filtered-HNSW story and `sqlite-vec` HNSW was still early when this +ADR was written. LanceDB was considered and kept as a future alternate +adapter behind the `IGraphStore` interface. + +[Read ADR 0001](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0001-storage-backend.md) + +### ADR 0002 — Rust core spike deferred to v2.1+ + +**Status:** Accepted (2026-04-20). + +**Decision:** v2.0 ships pure TypeScript. A Rust NAPI-RS native core +is deferred to v2.1+ because the measured p95 single-file incremental +edit on the 100-file fixture (~195-250 ms) is well under the 1 s hard +gate, and the extrapolated cold full analyze on a 100k-LOC fixture +(~3-5 s) is well under the 30 s trigger from the PRD. + +Reopens if cold analyze on a user-reported 500k+ LOC repo exceeds 4 +minutes, p95 incremental edit on 10k+ files exceeds 30 s, or a +`--cpu-prof` run shows a single function burning >40% of wall clock. + +[Read ADR 0002](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0002-rust-core-deferred.md) + +### ADR 0004 — Hierarchical embeddings with filter-aware HNSW + +**Status:** Accepted (shipped as P03 in v1.1). + +**Decision:** One `embeddings` table with a `granularity` discriminator +column (`symbol | file | community`) and a single HNSW index. +Filter-aware traversal via `hnsw_acorn` keeps the one index serving +every tier — the ACORN-1 algorithm pushes the granularity predicate +into the graph walk. + +ColBERT / token-level embeddings were rejected (10–30× storage, +bespoke index). RAPTOR tree-traversal was rejected — collapsed-tree + +filter-aware HNSW matches the recall at lower latency. + +[Read ADR 0004](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0004-hierarchical-embeddings.md) + +### ADR 0005 — SCIP replaces LSP; repomix is output-side only + +**Status:** Accepted (2026-04-26). + +**Decision:** The four per-language LSP phases and `@opencodehub/lsp-oracle` +are deleted and replaced with a single `scip-index` phase backed by +`@opencodehub/scip-ingest`. Oracle-edge provenance switches from +per-LSP to `scip:@`. The old LSP-specific reason +suffix `+lsp-unconfirmed` is renamed to `+scip-unconfirmed` (the old +constant is aliased for one release). + +This cuts ~10.6k LOC of LSP client and per-language phases, removes +the pyright / typescript-language-server binary dependency from npm +install, and reshapes indexing from stateful per-symbol JSON-RPC to +one-shot protobuf ingestion. + +[Read ADR 0005](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0005-scip-replaces-lsp.md) + +### ADR 0006 — SCIP indexer CI pins + +**Status:** Accepted (2026-04-27). + +**Decision:** Pin table for the per-language SCIP indexers the gym +installs: + +| Language | Indexer | Version | Install channel | +|------------|-----------------|------------------|-----------------------------------------| +| TypeScript | scip-typescript | 0.4.0 | `npm install -g @sourcegraph/scip-typescript` | +| Python | scip-python | 0.6.6 | `npm install -g @sourcegraph/scip-python` | +| Go | scip-go | v0.2.3 | `go install github.com/scip-code/scip-go/cmd/scip-go` | +| Rust | rust-analyzer | stable component | `rustup component add rust-analyzer` | +| Java | scip-java | 0.12.3 | `coursier install scip-java` | + +Versions are mirrored in `.github/workflows/gym.yml` and +`packages/gym/baselines/performance.json` so the regression harness +has a single source of truth. The ADR also explains why `scip-go` +resolves to the `scip-code` fork rather than upstream `sourcegraph`. + +[Read ADR 0006](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0006-scip-indexer-pins.md) + +### ADR 0007 — Artifact factory + +**Status:** Accepted (2026-04-27). + +**Decision:** Ship an artifact-generation skill family inside +`plugins/opencodehub/` that turns the graph into committed Markdown. +Four P0 skills (`codehub-document`, `codehub-pr-description`, +`codehub-onboarding`, `codehub-contract-map`), six `doc-*` subagents, +Phase 0 precompute, `.docmeta.json` + Phase E assembler, PostToolUse +staleness hook, discoverability patches. + +Scope exclusions (durable, not timeline): no hosted/managed/SaaS tier, +no remote/HTTP MCP server, no agent SDK, no `grounding_pack` +compositor tool, no own coding agent, no LLM-based PR review, no +IDE plugin/LSP, no model fine-tuning. + +[Read ADR 0007](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0007-artifact-factory.md) + +### ADR 0008 — Document pattern port + +**Status:** Accepted (2026-04-27). + +**Decision:** Adopt the four-phase document pattern (Phase 0 +precompute → Phase AB parallel content → Phase CD parallel diagrams + +specialty → Phase E deterministic assembler), adapted for OpenCodeHub +in three ways: six subagents (our supply-chain tools pre-digest a lot +of output), group mode as a first-class topology, and an extended +assembler contract that handles both `path:LOC` and `repo:path:LOC` +citation forms. + +Preserves the pattern invariants verbatim: shared-context files on +disk (not in-prompt copy-paste), eight-section agent scaffold, +deterministic Phase E (no LLM call), `.docmeta.json` as source of +truth for `--refresh`, no YAML frontmatter on outputs. + +[Read ADR 0008](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0008-document-pattern-port.md) + +### ADR 0009 — Artifact output conventions + +**Status:** Accepted (2026-04-27). + +**Decision:** Single authoritative output contract. `.codehub/docs/` +gitignored default; `--committed` opts in to `docs/codehub/`. Backtick +citation grammar with a single Phase E regex covering both single-repo +and group-qualified forms. `.docmeta.json` schema v1 with +`cross_repo_refs[]` for group mode. Mermaid-only diagrams (no +SVG/PNG). 20-node diagram cap with a Legend table for overflow. +Deterministic structure; non-deterministic prose; disclaimer on every +generated `README.md`. + +[Read ADR 0009](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0009-artifact-output-conventions.md) + +### ADR 0010 — Three dogfood findings from 2026-04-27 + +**Status:** Accepted (2026-04-27). + +**Decision:** Three small fixes landed after dogfooding `codehub init` +and the artifact factory against a private two-repo workspace. + +1. `--embeddings` now defaults `--embeddings-workers` to `"auto"` at + the CLI layer. Single-worker ONNX inference on 98k nodes took 56 + minutes; parallel workers cut that to single-digit minutes. +2. `codehub list` adds a `HEALTH` column that flags dangling registry + entries (`⚠ missing path`) and cleaned indexes (`⚠ no graph.duckdb`), + plus a trailing advisory when any row is unhealthy. Caught a real + registry typo where the `path` no longer existed on disk. +3. Phase 0 of `codehub-document` now includes a schema preflight — + subagents consult `information_schema.columns` once (cached in + `.prefetch.md`) before composing SQL, preventing `Binder Error` + failures from columns that don't exist (e.g., `nodes.path` was + assumed; the real columns are `name`, `file_path`, `method`). + +Full observations, root-cause traces, and evidence pointers in the ADR. + +[Read ADR 0010](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0010-dogfood-findings-2026-04-27.md) + +## Superseded + +### ADR 0003 — CI toolchain pins (gopls ↔ Go, pnpm build-script allowlist) + +**Status:** Superseded by ADR 0006 (2026-04-27). + +The gopls pin matrix is historical — OpenCodeHub no longer runs +long-running language servers; code-graph oracle edges come from SCIP +indexers. See ADR 0005 for the migration and ADR 0006 for the current +pin table. The pnpm lifecycle-script guidance remains in force and is +reiterated in ADR 0006. + +[Read ADR 0003](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0003-ci-toolchain-pins.md) + +## Adding an ADR + +New architectural decisions go under `docs/adr/NNNN-slug.md` using the +next numeric prefix. Keep the headings: Status, Date, Context, +Decision, Consequences, plus any ADR-specific sections. + +If a new decision supersedes an older one, update the superseded +ADR's status line with a forward link and add a reverse link from the +new ADR's context section. diff --git a/packages/docs/src/content/docs/architecture/determinism.md b/packages/docs/src/content/docs/architecture/determinism.md new file mode 100644 index 0000000..08578f7 --- /dev/null +++ b/packages/docs/src/content/docs/architecture/determinism.md @@ -0,0 +1,125 @@ +--- +title: Determinism contract +description: Identical inputs produce byte-identical graph hash. Why it matters and how we test it. +sidebar: + order: 40 +--- + +OpenCodeHub makes one load-bearing promise to agents and humans alike: +**identical inputs produce a byte-identical graph hash**. If you +analyze the same commit twice on the same machine — or on a different +machine with the same toolchain — you get the same `graphHash`. That +is the determinism contract. + +## Why it matters + +Three concrete reasons: + +- **Reproducibility.** An agent that reports a blast radius at + `graphHash=abc123` and a human reviewer who re-runs `codehub + analyze` should see the same graph. If the hash diverges, the + agent's claim is not auditable. +- **Cache-safety.** `codehub status` and CI runners assume that two + analyze runs at the same commit have the same output. Without + determinism, incremental caches would drift silently and staleness + detection would get unreliable. +- **Regression testing.** Every `feat` or `refactor` that touches the + ingestion pipeline has to demonstrate it did not move the hash + unintentionally. Determinism makes that assertion possible in one + line of CI. + +## What "inputs" means + +An input is: + +- Source tree contents at the current commit. +- Toolchain versions (Node 22.x, pnpm 10.33.2, tree-sitter grammars + pinned in `packages/ingestion/package.json`, SCIP indexer versions + pinned in `.github/workflows/gym.yml` per ADR 0006). +- OpenCodeHub version (the monorepo version pinned in + `release-please`). +- Any user-supplied configuration (AGENTS.md overrides, `.codehub/` + config). + +Anything outside that list — wall-clock time, process ID, file-system +inode ordering — must not influence the hash. The ingestion phases +are pure: inputs in, relations out, no ambient state. + +## How we test it + +Acceptance gate 6 is the regression test. It: + +1. Copies a fixture repo into two temp directories. +2. `git init` + commit each (identical tree → identical commit hash). +3. Runs `codehub analyze --force --skip-agents-md` against each, + capturing the printed `graphHash`. +4. Asserts the two hashes are byte-identical. + +If the hashes diverge, the gate fails and the acceptance run exits +non-zero. See `scripts/acceptance.sh` gate 6 for the exact script. + +Two adjacent gates reinforce the contract: + +- **Gate 10 — embeddings determinism.** Runs the same double-analyze + with `--embeddings`. Skipped if model weights are not present + locally. Advisory-only today because embeddings do not yet propagate + into the headline `graphHash`; the gate prints the hashes so a + reviewer can spot drift manually. +- **Gym replay (`mise run gym:replay`).** Bit-exact re-invocation of + the pinned SCIP indexer against the frozen manifest. Catches drift + introduced by an indexer bump before it lands in `main`. + +Full analyze and incremental re-analyze at the same commit must +produce identical hashes (this is asserted explicitly in the +determinism CI gate, not just on a clean tree). That is the "full vs +incremental byte-identical" invariant called out in ADR 0002. + +## The `--offline` contract + +`codehub analyze --offline` is a separate but related guarantee: +**zero sockets opened** during the run. The flag sets +`OCH_WASM_ONLY=1` (which also forces the WASM-only tree-sitter +runtime path) and disables every non-filesystem I/O path in the +pipeline. + +"Zero sockets" is the literal, measurable claim. It is testable by +running under `strace -e connect` or the equivalent on macOS +(`dtruss`); a socket attempt is a bug. + +Why it matters: OpenCodeHub is local-first. Your code never leaves +your machine by default. The `--offline` flag makes that an enforceable +contract for users who need to prove it. + +## Sources of non-determinism we actively guard against + +Ingestion phases are reviewed for the usual suspects: + +- **Set / map iteration order.** All emitted records are sorted by a + stable key before being persisted. Providers that emit + `extractPropertyAccesses` must return records sorted by + `(enclosingSymbolId, propertyName, startLine)` — see the + `LanguageProvider` interface docstring. +- **`Date.now()`, `crypto.randomUUID()`, any `Math.random()`.** + Banned in ingestion code. The graph-hash computation uses content + hashes, never timestamps. +- **File-system walk order.** `readdir` results are sorted by byte + value before dispatch. +- **Parallel worker output ordering.** Worker pools emit into + per-worker buffers that are concatenated in deterministic file order + at join time. + +A fresh contributor reviewing a PR that adds a new phase should ask: +"If I ran this twice on the same commit, would I get the same +bytes?" If the answer is not obviously yes, the phase is wrong. + +## Related + +- [ADR 0001 — Storage backend](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0001-storage-backend.md) — + "Deterministic writes given identical INSERT order" is a listed + positive of DuckDB vs. engines with random header UUIDs. +- [ADR 0002 — Rust core deferred](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0002-rust-core-deferred.md) — + calls out the "full vs incremental `graphHash` byte-identical" + determinism CI gate explicitly. +- [Contributing overview — Tenets](/opencodehub/contributing/overview/#tenets) — + "Determinism is non-negotiable" is the first tenet in `CONTRIBUTING.md`. +- `scripts/acceptance.sh` gate 6 — the runtime regression test. diff --git a/packages/docs/src/content/docs/architecture/embeddings.md b/packages/docs/src/content/docs/architecture/embeddings.md new file mode 100644 index 0000000..c0bc47b --- /dev/null +++ b/packages/docs/src/content/docs/architecture/embeddings.md @@ -0,0 +1,197 @@ +--- +title: Embeddings +description: Three backends in a priority cascade, three tiers keyed by a granularity discriminator, one HNSW index with filter-aware ACORN traversal. +sidebar: + order: 50 +--- + +Embeddings are optional. When enabled, the pipeline produces vectors +at three granularities (symbol, file, community) from one of three +backends (ONNX local, HTTP/OpenAI-compat, SageMaker) and persists +them in one DuckDB table served by one HNSW index. This page covers +the backend cascade, the tier model, the storage shape, and why +`WHERE granularity='symbol'` does not collapse recall. + +## Backend cascade + +`openEmbedder(opts)` selects exactly one backend. The cascade is, in +order, **SageMaker → HTTP → ONNX**: + +1. If `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` is set, the SageMaker + backend runs. SigV4 auth, TEI-native wire format (raw + `list[list[float]]`, not OpenAI-wrapped), dynamic-import + credential + soft-fail. +2. Else if `CODEHUB_EMBEDDING_URL` + `CODEHUB_EMBEDDING_MODEL` are set, + the generic OpenAI-compatible HTTP backend runs. Base URL gets + `/embeddings` appended; 30 s timeout, 2 retries. +3. Else the local ONNX backend runs. Deterministic path; weights + loaded from the setup directory managed by + `@opencodehub/embedder/paths`. + +The **offline invariant** is enforced in three places +(`openEmbedder`, `tryOpenHttpEmbedder`, and the ingestion phase): +remote-env-var-set together with `offline=true` throws rather than +silently falling through. + +```mermaid +flowchart LR + opts[openEmbedder opts] --> sm{SAGEMAKER
_ENDPOINT set?} + sm -- yes --> smem[SageMaker
backend] + sm -- no --> http{URL + MODEL
set?} + http -- yes --> httpem[HTTP backend
OpenAI-compat] + http -- no --> onnxem[ONNX local
backend] + smem --> embedder[Embedder] + httpem --> embedder + onnxem --> embedder +``` + +## Per-backend details + +### ONNX local + +The default. Deterministic 768-dim embeddings from +`Alibaba-NLP/gte-modernbert-base`. Weights live in the directory +managed by `@opencodehub/embedder/paths`; missing weights throw +`EmbedderNotSetupError`, which `codehub setup --embeddings` fixes. + +A Piscina worker pool (`embedder-pool.ts`) spins up when +`embeddingsWorkers >= 2`, running ONNX inference across worker +threads. Single-worker mode is the default and is good enough for +most repos. + +### HTTP (OpenAI-compatible) + +A generic path for any endpoint that speaks the OpenAI embeddings +wire format: + +- `CODEHUB_EMBEDDING_URL` — base URL (`/embeddings` is appended). +- `CODEHUB_EMBEDDING_MODEL` — model id passed through verbatim. +- `CODEHUB_EMBEDDING_DIMS` — dimensions (default 768). +- `CODEHUB_EMBEDDING_API_KEY` — bearer token. + +30 s timeout, 2 retries with 1 s backoff. + +### SageMaker + +Runtime client is dynamically imported, so a repo that does not use +SageMaker does not pay the AWS SDK bundle cost. Missing credentials +trigger a credential soft-fail (`CredentialsProviderError`, +`NoCredentialsError`, `ExpiredTokenException`) rather than an +exception — the phase reports `skippedReason: "no-credentials"` and +carries on. + +ModelId stamping is explicit to prevent silent cross-backend +pollution of the `embeddings.model` column: SageMaker rows carry +`gte-modernbert-base/sagemaker:`, ONNX rows carry +`gte-modernbert-base/fp32`, HTTP rows pass the configured model id +through. See the durable lesson linked below for the full pattern +(dynamic import, structural-typing seam, 413 split-retry). + +## Three tiers + +The `EmbeddingGranularity` discriminator is `"symbol" | "file" | +"community"`. Each tier feeds one kind of query: + +| Tier | Unit | Character cap | +|-----------|------------------------------------------------------|----------------------------------| +| symbol | Callable or declaration (Function, Method, Constructor, Route, Tool, Class, Interface) | 1200 (body only; fused signature + summary add on top) | +| file | One vector per scanned file | 8192 tokens (`FILE_CHAR_CAP = 8192 * 4`) | +| community | One vector per Community node | N/A — built from member symbols | + +The default is `["symbol"]` to preserve v1.0 behavior. File and +community tiers opt in via `PipelineOptions.embeddingsGranularity`. + +Symbol-tier fusion combines `signature + summary + body` into the +embedded text when an LLM summary exists for the node. See +[Summarization and fusion](/opencodehub/architecture/summarization-and-fusion/) +for the formula. + +## Single HNSW index + +The storage shape is deliberately simple: one `embeddings` table, +one HNSW index over the `vector` column, one `granularity` column as +a discriminator. The v1.2 schema adds `granularity DEFAULT 'symbol'` +so v1.0 files auto-migrate in place. + +```sql +CREATE INDEX idx_embeddings_vec + ON embeddings USING HNSW (vector); +``` + +All three tiers share this index. Granularity filtering is pushed as +`WHERE e.granularity IN (…)` into the ACORN predicate, so selective +filters narrow the candidate set during traversal rather than being +applied after the fact. + +## Filter-aware HNSW (ACORN-1) + +The `hnsw_acorn` extension's ACORN-1 algorithm is the reason filters +like `WHERE language='python'` or `WHERE granularity='community'` +actually return results. Stock `duckdb-vss` post-filters: it walks +the top-k by cosine distance and drops rows that fail the predicate, +which collapses to zero recall under selective filters. ACORN pushes +the predicate into the traversal itself. + +Two DuckDB pragmas make this work: + +- `SET hnsw_acorn_threshold = 1.0` — force ACORN on every query + (default would skip ACORN on low-selectivity predicates). +- `SET hnsw_enable_experimental_persistence = true` — persist the + HNSW index across restarts. + +If `hnsw_acorn` fails to install or load (first-run requires network +to pull from the DuckDB community extension repo), the adapter falls +back to `vss` with a post-filter warning. If both fail, +`vectorExtension='none'` disables vector search entirely — queries +return zero rows plus a surfaced warning rather than crashing. + +## RaBitQ quantization + +`hnsw_acorn` supports RaBitQ quantization, documented at 21-30× +memory reduction versus fp32 vectors. It is a capability of the +extension rather than a separately-configured knob in OpenCodeHub — +enabling `hnsw_acorn` enables it. + +## Configuration knobs + +- `PipelineOptions.embeddings: boolean` — master on/off (default off). +- `PipelineOptions.embeddingsVariant: "fp32" | "int8"` — ONNX variant. +- `PipelineOptions.embeddingsModelDir` — override ONNX weights dir. +- `PipelineOptions.embeddingsGranularity` — tier selection (default + `["symbol"]`). +- `PipelineOptions.embeddingsWorkers` — Piscina pool size for ONNX. +- `PipelineOptions.embeddingsBatchSize` — default 32. +- `DuckDbStoreOptions.embeddingDim` — default 768. +- Env vars: `CODEHUB_EMBEDDING_SAGEMAKER_ENDPOINT` / `_REGION` / + `_MODEL` / `_DIMS`; `CODEHUB_EMBEDDING_URL` / `_MODEL` / `_DIMS` / + `_API_KEY`. + +## Gotchas + +- **ONNX fallback on silent SageMaker failure is blocked.** A + remote-env-var-set + offline=true combination throws. A missing + SageMaker endpoint with no env vars just picks ONNX — that is the + intended cascade, not a failure. +- **`vectorExtension='none'` is a real state.** Queries return no + rows and surface an extension warning. This is the air-gapped / + offline / extension-broken state; it is not an exception. +- **Graph-hash independence.** The embeddings phase does not + contribute to `graphHash` — embeddings are optional and + probabilistic across backends. Gate 10 (the embeddings determinism + gate) is advisory-only for this reason. +- **Content-hash keying.** `hashText(granularity, text)` is + `sha256(\0)`. Changing granularity + changes the hash, so the same text embedded at two tiers produces + two distinct cache rows. + +## Further reading + +- [ADR 0001 — Storage backend](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0001-storage-backend.md) + — why DuckDB + `hnsw_acorn`. +- [ADR 0004 — Hierarchical embeddings](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0004-hierarchical-embeddings.md) + — one table, three granularities, one HNSW index. +- [Summarization and fusion](/opencodehub/architecture/summarization-and-fusion/) + — where the symbol-tier text comes from. +- Durable lesson: `api-patterns/sagemaker-embedder-backend.md` — + dynamic-import + credential soft-fail + structural-typing seam + + modelId stamping + 413 split-retry. diff --git a/packages/docs/src/content/docs/architecture/monorepo-map.md b/packages/docs/src/content/docs/architecture/monorepo-map.md new file mode 100644 index 0000000..d3a111a --- /dev/null +++ b/packages/docs/src/content/docs/architecture/monorepo-map.md @@ -0,0 +1,81 @@ +--- +title: Monorepo map +description: Every OpenCodeHub workspace package, its folder, purpose, versioning, and key exports. +sidebar: + order: 20 +--- + +OpenCodeHub is a pnpm workspace under `packages/*`. Fourteen TypeScript +packages plus one Python harness (15 total). Ten of the TypeScript +packages are versioned independently by release-please; the rest are +internal harnesses or the Starlight docs site that ride along with the +monorepo version. The Python eval lives outside the pnpm package graph +entirely. + +## All packages + +| Package | Folder | Versioned? | Purpose | Key surface | +|-----------------------------|------------------------|------------|-----------------------------------------------------------|------------------------------------------------| +| `@opencodehub/analysis` | `packages/analysis` | yes | `impact`, `rename`, `detect_changes`, staleness logic | `computeImpact()`, `computeRename()` | +| `@opencodehub/cli` | `packages/cli` | yes | User-facing CLI | `codehub` bin | +| `@opencodehub/core-types` | `packages/core-types` | yes | Shared graph schema, `LanguageId`, determinism primitives | `LanguageId`, `SCIP_PROVENANCE_PREFIXES` | +| `@opencodehub/embedder` | `packages/embedder` | yes | Deterministic ONNX embedder (gte-modernbert-base) | `embed()`, `embedInt8()` | +| `@opencodehub/ingestion` | `packages/ingestion` | yes | 12-phase analyze pipeline, tree-sitter, language providers | `LanguageProvider` registry, pipeline phases | +| `@opencodehub/mcp` | `packages/mcp` | yes | stdio MCP server, tools, resources, prompts | `buildServer()` | +| `@opencodehub/sarif` | `packages/sarif` | yes | SARIF 2.1.0 Zod schemas, merge + enrich | `SarifLogSchema`, `mergeSarif()` | +| `@opencodehub/scanners` | `packages/scanners` | yes | Priority-1 scanner wrappers (semgrep, osv, etc.) | Subprocess runners | +| `@opencodehub/search` | `packages/search` | yes | Hybrid BM25 + RRF search | `hybridSearch()` | +| `@opencodehub/storage` | `packages/storage` | yes | DuckDB graph store (`@duckdb/node-api` + `hnsw_acorn` + `fts`) | `IGraphStore` | +| `@opencodehub/docs` | `packages/docs` | no | Starlight documentation site (Astro + starlight-llms-txt) | `pnpm -F @opencodehub/docs build` | +| `@opencodehub/gym` | `packages/gym` | no | SCIP-indexer differential gym + regression gates | `codehub-gym` bin | +| `@opencodehub/scip-ingest` | `packages/scip-ingest` | no | `.scip` protobuf reader + per-language indexer runners | `readScipFile()`, per-language runners | +| `@opencodehub/summarizer` | `packages/summarizer` | no | Structured code-symbol summarizer (Bedrock Converse + Zod) | `summarizeSymbol()` | +| `opencodehub-eval` | `packages/eval` | no (Python) | Parity + regression eval harness (98 core cases) | `pytest` suite driven by MCP stdio | + +## Versioning + +Ten packages get their own tag and changelog via `release-please`. They +are the public surface — anyone who takes a `peerDependency` on +OpenCodeHub gets versioned guarantees on these. + +The five unversioned packages (`docs`, `gym`, `scip-ingest`, +`summarizer`, `eval`) are harnesses, the documentation site, or +internal-only dependencies with no external consumer at v1.0. They move +in lockstep with the monorepo but do not publish independent tags. See +[Release process](/opencodehub/contributing/release-process/) for the +full table. + +## The CLI is the only bin + +The only packaged executable is `codehub` under `@opencodehub/cli`. +`@opencodehub/gym` exposes a `codehub-gym` bin for internal harness +use; it is not distributed separately. + +Every other package is a library imported by `cli`, `mcp`, or the +ingestion pipeline. + +## Dependency direction + +Think of it as two layers: + +- **Leaf libraries.** `core-types`, `sarif`, `embedder`, `storage`, + `search`, `summarizer`, `scip-ingest`. +- **Orchestrators.** `ingestion`, `analysis`, `scanners`, `mcp`, + `gym`, `cli`. + +Orchestrators import leaves; leaves do not import orchestrators. The +TypeScript project-references graph enforces this via +`tsc --noEmit`. + +## Python eval lives outside the graph + +`packages/eval` is a uv-managed Python project (Python 3.12, pytest, +anyio, mcp). It sits in the monorepo for colocation but is not in the +pnpm workspace. Run it with `mise run test:eval`; see +[Testing](/opencodehub/contributing/testing/#python-eval-harness). + +## Related files + +- `pnpm-workspace.yaml` — `packages/*` glob. +- `.release-please-config.json` — which packages are versioned. +- `packages/*/package.json` — per-package `name` and `description`. diff --git a/packages/docs/src/content/docs/architecture/overview.md b/packages/docs/src/content/docs/architecture/overview.md new file mode 100644 index 0000000..2dade9c --- /dev/null +++ b/packages/docs/src/content/docs/architecture/overview.md @@ -0,0 +1,172 @@ +--- +title: Architecture overview +description: Six-phase pipeline from source tree to MCP — parse, resolve, augment, index, cluster, serve — with links to each phase's deep page. +sidebar: + order: 10 +--- + +OpenCodeHub turns a source tree into a typed graph that agents can +query over MCP. The pipeline has six phases, and each phase has one +job. This page is the index. Each section names a phase, states its +one job, and links to the page that covers it in depth. + +## Pipeline at a glance + +```mermaid +flowchart LR + tree[Source tree] --> parse[Parse] + parse --> resolve[Resolve] + resolve --> augment[Augment
SCIP] + augment --> index[Index
BM25 + HNSW] + index --> cluster[Cluster
communities + processes] + cluster --> serve[Serve
MCP] +``` + +Fifteen tree-sitter grammars produce a unified `ParseCapture` stream. +Per-language resolvers turn captures into typed relations. Five SCIP +indexers upgrade heuristic edges to compiler-grade references where +available. DuckDB persists the graph, BM25, and HNSW in one embedded +file. Communities and processes are precomputed. An stdio MCP server +answers agent queries. + +## Where the data lives + +```mermaid +flowchart LR + subgraph duckdb[".codehub/graph.duckdb"] + nodes[(nodes)] + edges[(edges)] + embeddings[(embeddings)] + findings[(nodes WHERE
kind='Finding')] + end + fts["fts_main_nodes_name
(BM25)"] --- nodes + hnsw["idx_embeddings_vec
(HNSW + ACORN)"] --- embeddings +``` + +Every tier — symbol, file, community — lives in one `embeddings` +table keyed by a `granularity` discriminator, so one HNSW index serves +all three. Findings reuse the `nodes` table with `kind='Finding'`. + +## The six phases + +### 1. Parse — source tree to captures + +One job: lex every file with its tree-sitter grammar and emit a +`ParseCapture[]` stream in a unified schema (tag, text, start/end +line+col, nodeType). Lines are 1-indexed, columns 0-indexed. + +Fifteen languages are registered via a compile-time exhaustive +`satisfies Record` table: TypeScript, +TSX, JavaScript, Python, Go, Rust, Java, C#, C, C++, Ruby, Kotlin, +Swift, PHP, Dart. + +See [Parsing and resolution](/opencodehub/architecture/parsing-and-resolution/). + +### 2. Resolve — captures to typed relations + +One job: turn captures into typed edges (`DEFINES`, `HAS_METHOD`, +`HAS_PROPERTY`, `IMPORTS`, `EXTENDS`, `IMPLEMENTS`, `CALLS`) by +resolving names against a per-language symbol scope. + +A three-tier resolver handles the common case (same-file 0.95, +import-scoped 0.9, global 0.5). Python and the TS family opt into a +stack-graphs backend for tighter cross-module resolution. Heritage +linearization is per-language: C3, first-wins, single-inheritance, or +no-op. + +See [Parsing and resolution](/opencodehub/architecture/parsing-and-resolution/). + +### 3. Augment — SCIP indexers upgrade edges + +One job: run each repo's SCIP indexer, parse the resulting `.scip` +protobuf, and emit `CALLS` edges with `confidence=1.0` and +`reason=scip:@`. The `confidence-demote` phase then +rescales any heuristic edge the SCIP oracle contradicts from 0.5 to +0.2. + +Five indexers: scip-typescript 0.4.0, scip-python 0.6.6, scip-go +v0.2.3, scip-java 0.12.3, rust-analyzer (stable channel). Pins live +in `.github/workflows/gym.yml`. + +See [SCIP reconciliation](/opencodehub/architecture/scip-reconciliation/). + +### 4. Index — BM25, HNSW, and scanners + +One job: persist the graph into DuckDB with search indexes wired up. + +- **`fts`** — BM25 over symbol names, docstrings, file paths. +- **`hnsw_acorn`** — filter-aware HNSW (ACORN-1 traversal, RaBitQ + quantization, 21-30× memory reduction). `vss` is the fallback. +- **Recursive CTEs with `USING KEY`** — multi-hop graph traversal. + +Embeddings are optional, gated on `PipelineOptions.embeddings`. Three +tiers (symbol, file, community) live in one table under one HNSW +index. Three backend cascades select one: ONNX local, OpenAI-compat +HTTP, or SageMaker. + +Scanners run separately through the `scan` MCP tool, merging SARIF +onto disk and indexing findings back into the `nodes` table. + +See [Embeddings](/opencodehub/architecture/embeddings/) and +[Scanners and SARIF](/opencodehub/architecture/scanners-and-sarif/). + +### 5. Cluster — communities and processes + +One job: group related symbols into communities (Louvain) and walk +call chains to produce processes (handler → service → data access). +Both are precomputed so MCP tools read them directly. + +Symbol-level LLM summaries are produced here when enabled. Summaries +are fused into the symbol-tier embedding text at ingestion time (not +query time) so retrieval runs against a pre-fused vector. + +See [Summarization and fusion](/opencodehub/architecture/summarization-and-fusion/). + +### 6. Serve — MCP over stdio + +One job: expose the graph through an stdio MCP server (`codehub +mcp`). Every tool returns a structured envelope with `next_steps` and, +when the index lags HEAD, a `_meta["codehub/staleness"]` block. No +daemon, no socket, no remote state. + +See [MCP tool map](/opencodehub/mcp/tools/) for the full +tool list. + +## Why this shape + +OpenCodeHub's primary user is an AI coding agent that needs callers, +callees, processes, and blast radius in one tool call — and needs the +answer to be reproducible across runs. The six-phase shape is the +cheapest configuration that hits all three: + +- **Local + offline.** DuckDB is embedded. Indexing reads the + filesystem, nothing else. `codehub analyze --offline` opens zero + sockets. +- **Deterministic.** Phases are pure: same inputs → same outputs, + byte-identical `graphHash`. See [Determinism](/opencodehub/architecture/determinism/). +- **Apache-2.0, every transitive dep on the permissive allowlist.** + DuckDB is MIT, `hnsw_acorn` is MIT, tree-sitter is MIT. No BSL, no + AGPL, no source-available engines in the core. See + [Supply chain](/opencodehub/architecture/supply-chain/). + +## Reference ADRs + +| ADR | Topic | +|-----|-----------------------------------------------------------------------------| +| 0001 | Storage backend selection — why DuckDB + `hnsw_acorn` + `fts` | +| 0002 | Rust core deferred to v2.1+ — why v2.0 stays pure TypeScript | +| 0004 | Hierarchical embeddings — one table, three granularities, filter-aware HNSW | +| 0005 | SCIP replaces LSP — compiler-grade edges without long-running language servers | +| 0006 | SCIP indexer CI pins — current version table per language | + +See [ADRs](/opencodehub/architecture/adrs/) for the full list and +decisions. + +## Related pages + +- [Monorepo map](/opencodehub/architecture/monorepo-map/) — every + workspace package and what it owns. +- [Determinism](/opencodehub/architecture/determinism/) — the + reproducibility contract and how it is tested. +- [Supply chain](/opencodehub/architecture/supply-chain/) — SBOM, + license allowlist, vulnerability posture. diff --git a/packages/docs/src/content/docs/architecture/parsing-and-resolution.md b/packages/docs/src/content/docs/architecture/parsing-and-resolution.md new file mode 100644 index 0000000..6c57b51 --- /dev/null +++ b/packages/docs/src/content/docs/architecture/parsing-and-resolution.md @@ -0,0 +1,201 @@ +--- +title: Parsing and resolution +description: How 15 tree-sitter grammars produce a unified capture stream, how per-language resolvers turn captures into typed edges, and where stack-graphs opt in. +sidebar: + order: 20 +--- + +This page covers phases 1 and 2 of the pipeline: from source files to +typed `CALLS` / `EXTENDS` / `IMPLEMENTS` / `FETCHES` / `ACCESSES` +edges on the graph. The goal is to explain the moving parts — +grammars, the provider registry, resolver flavors, and import +semantics — well enough that adding a new language is a mechanical +exercise. + +## The tree-sitter layer + +Fifteen grammars are pinned through `packages/ingestion/package.json` +and loaded by a worker pool that clamps to `max(2, min(cpus, 8))` +threads. Each file is hashed and the resulting `ParseCapture[]` is +cached keyed on `(sha256, grammarSha, SCHEMA_VERSION)`, so a subsequent +analyze with the same content skips tree-sitter entirely. + +`ParseCapture` is the shared per-capture schema emitted by the worker +— one interface with 7 readonly fields: + +```ts +interface ParseCapture { + readonly tag: string; // e.g. "definition.function" + readonly text: string; + readonly startLine: number; // 1-indexed + readonly endLine: number; + readonly startCol: number; // 0-indexed + readonly endCol: number; + readonly nodeType: string; +} +``` + +The tag vocabulary is a clean-room set (`definition.*`, +`reference.*`, `doc`, `name`) that decouples the downstream providers +from each grammar's internal node naming. + +## The language provider registry + +Providers are registered via a compile-time exhaustive table: + +```ts +export const PROVIDERS = { + typescript: typescriptProvider, + tsx: tsxProvider, + javascript: javascriptProvider, + python: pythonProvider, + go: goProvider, + rust: rustProvider, + java: javaProvider, + csharp: csharpProvider, + c: cProvider, + cpp: cppProvider, + ruby: rubyProvider, + kotlin: kotlinProvider, + swift: swiftProvider, + php: phpProvider, + dart: dartProvider, +} satisfies Record; +``` + +The `satisfies` clause is load-bearing: if `LanguageId` gains a new +member and the table does not, the build fails. `getProvider(lang)` +and `listProviders()` are the two helpers the pipeline uses to reach +providers without hard-coding names. + +Each `LanguageProvider` exposes six hooks — `extractDefinitions`, +`extractCalls`, `extractImports`, `extractHeritage`, +`detectOutboundHttp`, `extractPropertyAccesses` — plus configuration +fields (`importSemantics`, `mroStrategy`, optional +`resolverStrategyName`). + +## Per-language resolvers + +Name resolution runs in two tiers. The default walker resolves a +reference against three scopes in order: + +| Scope | Confidence | +|--------------|------------| +| Same file | 0.95 | +| Import-scoped| 0.9 | +| Global | 0.5 | + +Heritage linearization — which matters when `super.foo()` can come +from any of several bases — is selected per language. Four flavors: + +| Strategy | Languages | +|----------------------|-----------------------------------------------------| +| `c3` | Python, Kotlin, Dart, C++, Ruby | +| `first-wins` | TypeScript, TSX, JavaScript, Rust | +| `single-inheritance` | Java, C#, PHP, Swift | +| `none` | Go, C | + +The `STRATEGIES` record in `providers/resolution/mro.ts` is the source +of truth; each provider declares `mroStrategy: MroStrategyName` and +the resolver dispatches on it. + +## Import-semantic taxonomy + +The provider contract enforces one of three import semantics: + +| Value | What it means | Example languages | +|--------------------|-----------------------------------------------------|-----------------------| +| `named` | Imports bring specific names into scope. | TS/TSX/JS, Rust, Java, C# | +| `namespace` | Imports bring a namespace; members accessed via dot.| Python | +| `package-wildcard` | Whole package is re-exported as one bag. | Go, Kotlin | + +The `package-wildcard` value has a concrete consequence: the resolver +does not chase cross-module names through the import, because the +package re-exports everything and the exact origin file is undecidable +from the import site alone. Go's `import "fmt"` followed by +`fmt.Println` does not tell the resolver which file inside `fmt` +defines `Println`; the SCIP augmenter fills that in when present. + +## What captures become + +Parse emits five edge types directly (`DEFINES`, `HAS_METHOD`, +`HAS_PROPERTY`, `IMPORTS`, `EXTENDS`, `IMPLEMENTS`, `CALLS`). Two +more edge types come from later dedicated phases: + +- **`ACCESSES`** (read/write) — emitted by the `accesses` phase from + `extractPropertyAccesses` captures. When no matching field is + found, a synthetic `Property:unresolved:` stub anchors the + edge rather than dropping it. Intentional anchoring, not a bug. +- **`FETCHES`** — emitted by the `fetches` phase from + `detectOutboundHttp` captures. When no local `Route` matches the + URL pattern, the edge targets `fetches:unresolved:` pseudo-nodes + that `group_contracts` recognizes for cross-repo contract mapping. + +## Stack-graphs opt-in + +Four providers opt into the stack-graphs resolver by setting +`resolverStrategyName: "stack-graphs"`: + +| Provider | Default resolver confidence gain | +|------------|----------------------------------| +| typescript | Tighter cross-file lookup | +| tsx | Same as typescript | +| javascript | Same as typescript | +| python | Attribute resolution across modules | + +Stack-graphs adds incremental, precise name-binding over the +heuristic three-tier walker — it models scope, inheritance, and +imports as a graph whose path-finding produces a deterministic binding. +The other 11 providers fall back to the default walker, which is +cheaper and good enough given that SCIP is expected to augment the +compiled languages. + +## The flow, end-to-end + +```mermaid +sequenceDiagram + participant File as Source file + participant Scan as scanPhase + participant Parse as parsePhase + participant Worker as ParsePool (tree-sitter) + participant Provider as LanguageProvider + participant Resolver as Resolver (default or stack-graphs) + participant Graph as KnowledgeGraph + + File->>Scan: scanned file metadata + Scan->>Parse: file + language + Parse->>Worker: grammar(source) + cache probe + Worker-->>Parse: ParseCapture[] + Parse->>Provider: extractDefinitions / Calls / Imports / Heritage + Provider-->>Parse: typed captures + Parse->>Resolver: resolve(name, scope) + Resolver-->>Parse: (target nodeId, confidence) + Parse->>Graph: DEFINES / CALLS / EXTENDS / IMPLEMENTS / ... +``` + +Stack-graphs-enabled providers route through the +`stackGraphsRouter` side of `getResolver()` instead of the default +walker; the rest of the pipeline is unchanged. + +## Gotchas + +- **Properties without a matching field produce synthetic + `Property:unresolved:` stubs**, not dropped edges. Queries + that BM25-rank over node IDs will see these stubs compete with real + symbols. See the durable lesson linked below. +- **`FETCHES` without a local route emit to `fetches:unresolved:` + pseudo-targets**. These are recognized by `group_contracts` when + fanning out cross-repo contract analysis. +- **`DEBUG_PHASE_MEM=1`** brackets `graphHash` with stderr telemetry + for memory profiling. +- **`PipelineOptions.force`** bypasses parse-cache lookups (still + writes fresh entries). Useful for debugging but not day-to-day. + +## Further reading + +- [Adding a language provider](/opencodehub/contributing/adding-a-language-provider/) + — the step-by-step contract for adding a 16th language. +- [SCIP reconciliation](/opencodehub/architecture/scip-reconciliation/) + — how compiler-grade edges demote heuristic ones. +- Durable lesson: `conventions/bm25-over-node-id-favors-stubs.md` — + why BM25 over node IDs needs to be gated against unresolved stubs. diff --git a/packages/docs/src/content/docs/architecture/scanners-and-sarif.md b/packages/docs/src/content/docs/architecture/scanners-and-sarif.md new file mode 100644 index 0000000..171939e --- /dev/null +++ b/packages/docs/src/content/docs/architecture/scanners-and-sarif.md @@ -0,0 +1,200 @@ +--- +title: Scanners and SARIF +description: Two scanner tiers, how SARIF enrichment preserves GHAS dedup, and how the findings baseline bucketizes new versus fixed versus unchanged results. +sidebar: + order: 40 +--- + +Scanners are a tier-one MCP surface: the `scan` tool is the only tool +that spawns processes (`openWorldHint=true`) and the only tool that is +non-idempotent. SARIF is the on-disk exchange format. This page +covers the catalog, the license distinction between bundled and +wrapped tools, how SARIF enrichment stays GHAS-compatible, and how +baseline diffs get bucketized. + +## Scanner tiers + +The catalog at `packages/scanners/src/catalog.ts` is a flat module: +one exported `ScannerSpec` per tool plus three aggregate arrays. +Selection is driven by the project profile (languages, IaC types, API +contracts) and can be overridden with an explicit scanner list. + +### Priority-1 (11 scanners) + +Always considered for a default scan; each one is gated on the +project's detected languages. + +- **semgrep** — multi-language static analysis, rule packs for common + bugs and insecure patterns. +- **betterleaks** — secret scanner, permissive license. +- **osv-scanner** — vulnerability scan against the OSV database + keyed on lockfiles. +- **bandit** — Python static security analyzer. +- **biome** — JS/TS formatter and linter in one binary. +- **pip-audit** — Python dependency vulnerability audit. +- **npm-audit** — npm dependency vulnerability audit. +- **ruff** — Python lint + format. +- **grype** — container image and filesystem vulnerability scanner. +- **checkov-docker-compose** — IaC policy scan scoped to + docker-compose files (kept in P1 for every repo with a compose file). +- **vulture** — Python dead-code detection. + +### Priority-2 (8 scanners) + +Opt-in or gated by profile fields beyond language: + +- **trivy** — broader container / IaC / SBOM scanner. +- **checkov** — full IaC policy coverage (Terraform, Kubernetes, + CloudFormation, Helm). +- **hadolint** — Dockerfile lint. Invoked as a subprocess only + (license note below). +- **tflint** — Terraform lint. Subprocess-only. +- **spectral** — OpenAPI / AsyncAPI contract lint. +- **radon** — Python complexity / maintainability metrics. +- **ty** — Python type checker. +- **clamav** — malware scan. Carries the `opt-in` flag so it is + excluded from every default gate; explicit `scanners: ["clamav"]` + turns it on. + +## License-incompatible wrappers + +hadolint (GPL-3.0) and tflint (MPL-2.0 + BUSL-1.1 depending on vendor +build) are not on the permissive license allowlist. OpenCodeHub still +supports them the same way it supports any other scanner: **wrap, +don't link**. + +Concretely: + +- `packages/scanners/src/wrappers/hadolint.ts` and `.../tflint.ts` + spawn the OS binary, capture stdout as SARIF, and emit findings. +- The binary is a user-provided runtime dependency. OpenCodeHub does + not bundle it, ship it, or require it at install time. +- License obligations flow to the user who installed the scanner, + not to OpenCodeHub. + +This is the same pattern GitHub CodeQL uses with third-party SARIF +producers. See [Supply chain](/opencodehub/architecture/supply-chain/) +for the broader policy. + +A missing binary yields an empty SARIF run, not a crash — the catalog +is built to degrade gracefully when a wrapper's tool is not installed. + +## SARIF emission + +`@opencodehub/sarif` owns the schema, merge, enrichment, suppressions, +and baseline logic. Every scanner run produces SARIF v2.1.0, +zod-validated against the spec. + +### Rule IDs and fingerprints + +Two fingerprints are computed per result, under +`properties.opencodehub.*`: + +- `opencodehub/v1` — `sha256(scannerId \0 ruleId \0 filePath \0 + contextHash)[:32]`. The match key for baseline diffing. +- `primaryLocationLineHash` — `sha256(ruleId \0 filePath \0 + normalizedSnippet)[:16] + ":" + startLine`. The GHAS dedup key. + +**Invariant:** `result.fingerprints`, `partialFingerprints`, `ruleId`, +and `artifactLocation.uri` are never mutated by enrichment. All +enrichment goes under `properties.opencodehub.*`. This is how SARIF +output stays GHAS-compatible — GitHub's deduplication on +`primaryLocationLineHash` still works. + +### Enrichment fields + +`enrichWithProperties` adds graph-derived context to each result: + +- `blastRadius` — dependent count from `impact`. +- `community` — the containing Louvain community. +- `cochangeScore` — temporal co-change coefficient. +- `centrality` — node centrality. +- `temporalFixDensity` — how often this file has been a fix target. +- `busFactor` — unique recent authors. +- `cyclomaticComplexity` — McCabe complexity of the enclosing + function. +- `ownershipDrift` — recent change in top contributor. + +### Suppressions + +Two paths, same output: + +- **External YAML** — `.codehub/suppressions.yaml` declares + `{ruleId, filePathPattern, reason, expiresAt?}`. +- **Inline comment** — `// codehub-suppress: ` (or + `#`, `/* */` variants) in source. + +Both write to `result.suppressions[]` with `{kind: +"external"|"inSource", justification}`. Suppressions past their +`expiresAt` are dropped at load with a warning, so `codehub verdict` +can re-block the finding. + +## Findings baseline and delta + +Two SARIF files on disk: + +- `.codehub/scan.sarif` — the current scan. +- `.codehub/baseline.sarif` — the frozen baseline written by + `codehub scan --baseline`. + +`list_findings_delta` reads both and runs `diffSarif`. The match key +is the `opencodehub/v1` partial fingerprint, with a fallback to +`(ruleId, uri, startLine)` when the fingerprint is missing. Rename +follow-through is optional: if the storage layer supplies a +`renameChainFor` resolver (backed by `FileNode.renameHistoryChain` +from the temporal phase), a finding that followed a rename still +matches. + +Four buckets: + +| Bucket | Meaning | +|-------------|----------------------------------------------------------| +| `new` | In current, not in baseline. | +| `fixed` | In baseline, not in current. | +| `unchanged` | Same fingerprint, same contextHash. | +| `updated` | Same fingerprint, changed line / snippet. | + +When the current SARIF already carries baked-in `baselineState` tags +(written by `codehub scan --baseline`), `list_findings_delta` reuses +them instead of re-running the diff — the on-disk SARIF is the source +of truth. + +## The `scan` tool + +`scan` is deliberately the odd one out. Annotations: + +``` +readOnlyHint: false +destructiveHint: false +openWorldHint: true // spawns subprocesses +idempotentHint: false // writes disk, state-changing +``` + +The tool picks scanners via `selectScanners()`, which honors an +explicit list or falls back to profile-gated defaults. Concurrency is +clamped to `min(availableParallelism(), opts.concurrency ?? 4)`. A +per-wrapper failure does not abort the run — it just omits that +scanner's results from the merged SARIF. + +The merged SARIF is persisted to `.codehub/scan.sarif`; a summary +groups result counts by `tool.driver.name` and `result.level` +(defaulting to `note` when the scanner omits the level). + +## Configuration knobs + +- `ScanInput.timeoutMs` — per-scanner timeout (default 300_000, max + 600_000). +- `ScanInput.scanners` — explicit id list overrides profile gating. +- `ProjectProfileGate.languages / iacTypes / apiContracts` — stored + in `nodes WHERE kind='ProjectProfile'`; drives default selection. +- `.codehub/suppressions.yaml` — external suppression rules. + +## Related + +- [`scan` tool reference](/opencodehub/mcp/tools/) — + the full input schema. +- [`list_findings` tool reference](/opencodehub/mcp/tools/) + — querying findings stored as nodes. +- [Supply chain](/opencodehub/architecture/supply-chain/) — why + subprocess invocation is the right pattern for non-permissive + scanners. diff --git a/packages/docs/src/content/docs/architecture/scip-reconciliation.md b/packages/docs/src/content/docs/architecture/scip-reconciliation.md new file mode 100644 index 0000000..1286d04 --- /dev/null +++ b/packages/docs/src/content/docs/architecture/scip-reconciliation.md @@ -0,0 +1,200 @@ +--- +title: SCIP reconciliation +description: How five SCIP indexers augment the heuristic graph — ingest path, confidence demotion, provenance tagging, and the known gotchas that shaped the design. +sidebar: + order: 30 +--- + +SCIP is the augmenter, not the primary. OpenCodeHub's default +resolver produces a graph on its own; SCIP then runs for each +detected language, produces compiler-grade occurrences, and +reconciles against the heuristic edges. Heuristic edges never get +deleted — they get demoted. This page covers the ingest path, +reconciliation, and the corners that took a few iterations to get +right. + +## Why SCIP is an augmenter + +Three reasons SCIP does not replace the default resolver: + +- **Not every language has an indexer.** Only five of the 15 registered + providers have a pinned SCIP indexer. +- **SCIP requires a buildable repo.** Missing dependencies, unsettable + credentials, or a half-written feature branch all make the indexer + fall over. The heuristic resolver still produces a usable graph. +- **Rust and Java need build scripts to run.** SCIP is gated behind + `CODEHUB_ALLOW_BUILD_SCRIPTS=1`. Heuristic parsing is always safe. + +SCIP contributes `CALLS` edges with `confidence=1.0` — the oracle +tier — and the reconciliation phase rescales any colliding heuristic +edge to `confidence=0.2` with a `+scip-unconfirmed` suffix on the +reason. + +## Indexer pins + +Versions live in `.github/workflows/gym.yml` so gym replay catches +drift: + +| Indexer | Pin | Install channel | +|----------------|--------------|----------------------------------------------| +| scip-typescript| `0.4.0` | `npm install -g` | +| scip-python | `0.6.6` | `uv tool install` | +| scip-go | `v0.2.3` | `go install github.com/scip-code/scip-go/cmd/scip-go@...` | +| scip-java | `0.12.3` | `coursier install` | +| rust-analyzer | `stable` | `rustup component add rust-analyzer rust-src`| + +rust-analyzer tracks the stable channel rather than a pinned tag; ADR +0006 covers the decision. + +## The `.scip` ingest path + +`@opencodehub/scip-ingest` hand-rolls the protobuf reader (~130 LOC) +instead of pulling in buf plus codegen — the SCIP schema is small +enough that the extra build-time dependency is not worth the +maintenance burden. The public API is narrow: `parseScipIndex`, +`deriveIndex`, `deriveEdges`, `buildSymbolDefIndex`, `materialize`, +`runIndexer`, `detectLanguages`, `scipProvenanceReason`. + +The phase flow: + +1. `detectLanguages(repo)` — fs-based heuristic (tsconfig.json, + pyproject.toml, go.mod, Cargo.toml, pom.xml / build.gradle / + build.sbt). +2. For each detected language, `runIndexer()` spawns the per-language + binary and writes `.codehub/scip/.scip`. Fan-out uses + `Promise.all`; a per-language failure never aborts the run. +3. `parseScipIndex` decodes the protobuf into typed wire shapes. +4. `deriveIndex` + `deriveEdges` attribute each occurrence to a caller + (via innermost-enclosing `enclosing_range`) and a callee (via a + `symbolDef` table keyed on `SCIP_ROLE_DEFINITION` occurrences). +5. `emitEdges()` writes `CALLS` edges with `confidence=1.0` and + `reason=scip:@`. + +A cached `.scip` artifact that passes the freshness check is reused; +re-running an indexer is expensive, especially rust-analyzer. + +## Confidence demote + +The `confidence-demote` phase runs immediately after `scip-index` and +carries three constants: + +``` +HEURISTIC_CONFIDENCE = 0.5 +DEMOTED_CONFIDENCE = 0.2 +ORACLE_CONFIDENCE = 1.0 +UNCONFIRMED_SUFFIX = "+scip-unconfirmed" +``` + +It iterates edges twice: first to build the set of +`(from, type, to)` triples that SCIP has confirmed, second to demote +any matching heuristic edge. Three edge types are demotable: `CALLS`, +`REFERENCES`, `EXTENDS`. The demoted edge keeps its original reason +with the `+scip-unconfirmed` suffix so provenance is visible. + +The invariant: **SCIP replaces (never rejects) heuristic edges — +demote only, do not delete**. Downstream consumers can still filter +on confidence; the information is not lost. + +## Provenance tagging + +Every oracle-derived edge carries a reason of the form +`scip:@`, e.g. `scip:scip-python@0.6.6`. The +prefix set is declared once in `@opencodehub/core-types`: + +```ts +export const SCIP_PROVENANCE_PREFIXES = [ + "scip:scip-typescript@", + "scip:scip-python@", + "scip:scip-go@", + "scip:rust-analyzer@", + "scip:scip-java@", +] as const; +``` + +Consumers (summarizer trust filter, `verdict`, MCP tools) test against +this list rather than string-matching ad hoc. + +## The pipeline slice + +```mermaid +flowchart LR + heur[Heuristic edges
confidence=0.5] --> reconcile + detect[detectLanguages] --> runner[runIndexer
Promise.all] + runner --> scip[.codehub/scip/*.scip] + scip --> parse[parseScipIndex] + parse --> derive[deriveIndex
innermost enclosing] + derive --> oracle[SCIP edges
confidence=1.0] + oracle --> reconcile[confidence-demote] + reconcile --> kg[KnowledgeGraph] +``` + +`reconcile` is the phase that makes heuristic and oracle edges +coherent. Only `CALLS` edges currently flow from SCIP (see +limitations below). + +## Known gotchas + +The design has been shaped by four durable lessons. Each one is a +concrete bug that was found, fixed, and captured: + +- **Callee resolution must go through `symbolDef` keyed on + `SCIP_ROLE_DEFINITION`.** Resolving a callee from the first-seen + call site routes same-named symbols to wrong local nodes — a + Python method named `save` in multiple classes all collapse onto + whichever `save()` call happened first in the file. The + `buildSymbolDefIndex` path is the fix. See durable lesson + `architecture-patterns/scip-callee-definition-site.md`. +- **TS monorepos emit `dist/` paths in cross-package refs and `src/` + paths in defs.** The `symbolDef` table aliases the two so a + reference to `@acme/core/dist/foo.js` binds to its definition in + `packages/core/src/foo.ts`. See durable lesson + `architecture-patterns/scip-monorepo-dist-src-alias.md`. +- **SCIP is 0-indexed, the graph is 1-indexed.** The `+1` + conversion lives at the boundary in `scip-index.ts`. Getting this + wrong shifts every caller attribution by one line. See durable + lesson `conventions/scip-0-indexed-vs-graph-1-indexed.md`. +- **The protobuf reader is hand-rolled on purpose.** SCIP's schema + is small and stable; pulling in buf plus codegen would pay a + recurring build-time cost for decoding logic that fits in 130 + lines. See durable lesson + `conventions/scip-protobuf-hand-rolled-reader.md`. + +## Known limitations + +Two gaps are tracked for future work rather than hidden: + +- **`REFERENCES` edges are demotable but not yet emitted from SCIP.** + `emitEdges()` currently only writes `CALLS`. The `confidence-demote` + phase already handles `REFERENCES` if they arrive. +- **Heritage edges from SCIP relationships are not wired in.** + `DerivedRelation` exists in `scip-ingest` and carries + `IMPLEMENTS` / `TYPE_OF` synthesized from + `SymbolInformation.relationships.is_implementation`, but nothing + consumes it into the graph yet. The derivation code is ready; + `scip-index.ts:emitEdges` needs an additional branch. + +Both are partially-vestigial: the plumbing exists, the wiring does +not. They are not currently blocking, because the heuristic +`extractHeritage` hook covers the common cases. + +## Configuration knobs + +- `CODEHUB_DISABLE_SCIP=1` — the phase is a full no-op. +- `CODEHUB_ALLOW_BUILD_SCRIPTS=1` — required for the rust + java + runners (build.rs, gradle). +- `PipelineOptions.offline === true` — skips indexer runs entirely; + cached `.scip` artifacts are still consumed if present. + +## Further reading + +- [ADR 0005 — SCIP replaces LSP](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0005-scip-replaces-lsp.md) + — why SCIP (no long-running language server) over LSP. +- [ADR 0006 — SCIP indexer pins](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0006-scip-indexer-pins.md) + — the version table and rationale. +- [Determinism](/opencodehub/architecture/determinism/) — gym replay + catches indexer drift before it lands in main. +- Durable lessons: `architecture-patterns/scip-replaces-lsp.md`, + `architecture-patterns/scip-callee-definition-site.md`, + `architecture-patterns/scip-monorepo-dist-src-alias.md`, + `conventions/scip-0-indexed-vs-graph-1-indexed.md`, + `conventions/scip-protobuf-hand-rolled-reader.md`. diff --git a/packages/docs/src/content/docs/architecture/summarization-and-fusion.md b/packages/docs/src/content/docs/architecture/summarization-and-fusion.md new file mode 100644 index 0000000..ea182f5 --- /dev/null +++ b/packages/docs/src/content/docs/architecture/summarization-and-fusion.md @@ -0,0 +1,202 @@ +--- +title: Summarization and fusion +description: Per-symbol LLM summaries via Bedrock + Haiku with ReAct retry, and how summaries fuse into the symbol-tier embedding at ingest time — not query time. +sidebar: + order: 60 +--- + +`@opencodehub/summarizer` produces per-symbol natural-language +summaries grounded in source. The ingestion `summarize` phase +persists them; the downstream `embeddings` phase fuses each summary +into the symbol-tier embedding text so retrieval runs against a +pre-fused vector. + +This page covers the schema, the Bedrock caching shape, the ReAct +retry loop, and where fusion happens. + +## Schema + +`SymbolSummary` is a Zod 4 schema with strict field bounds and a +SuperRefine that enforces citation completeness — every populated +field must carry ≥1 citation. + +| Field | Shape | +|---------------|-----------------------------------------------------------------| +| `purpose` | string (30-400 chars); becomes `summaryText` in the row. | +| `inputs` | `InputSpec[]`: name + type + description per input. | +| `returns` | `{type, type_summary (10-80), details (20-400)}`. | +| `side_effects`| array; each entry contains one of `reads|writes|emits|raises|mutates`. | +| `invariants` | array (nullable). | +| `citations` | ≥1; each has `field_name` enum + `line_start` + `line_end`. | + +`buildToolInputSchema()` runs `z.toJSONSchema(SymbolSummary)` and +strips `$schema` before handing it to Bedrock — any post-processing +that re-adds `$schema` breaks the cacheable prefix. A runtime +`validateCitationLines()` pass checks every citation range sits +inside the source span. + +## Model + caching + +Two constants govern the model choice: + +``` +DEFAULT_MODEL_ID = "global.anthropic.claude-haiku-4-5-20251001-v1:0" +DEFAULT_MAX_ATTEMPTS = 3 +``` + +`summarizeSymbol(client, input, options)` issues a Bedrock +`ConverseCommand` with structured output via tool use. Key knobs: + +- `toolChoice` is forced to `emit_symbol_summary` — the model MUST + call this tool; a text-only response is a retry. +- `inferenceConfig = {temperature: 0, maxTokens: 2048}`. +- `cachePoint` is placed **twice**: after the system prompt, and + after the tool spec inside `toolConfig.tools`. + +The dual `cachePoint` placement matters because Haiku 4.5's +cacheable-prefix floor is 4,096 tokens. `SYSTEM_PROMPT` is sized to +clear that floor with three worked examples baked in (`normalize_path` +as a pure function, `register_handler` as a side-effectful handler, +`LRUCache` as a constructor). The tool spec's cache point covers the +JSON Schema itself, which is stable as long as `$schema` is stripped +and `SUMMARIZER_PROMPT_VERSION` is unchanged. + +## ReAct retry + +The retry loop handles two failure modes: + +- **Schema-invalid tool call.** The model returns a tool use that + fails Zod validation. The Zod error text is fed back as + `toolResult(status: "error")` and the model retries. +- **No tool call at all.** The model returned text only. Same fix — + feed back an error and retry. + +`maxAttempts=3` is the default; three tries is enough in practice. A +third failure throws `SummarizerError` to the caller. + +## Ingestion invocation + +The ingestion call site is +`packages/ingestion/src/pipeline/phases/summarize.ts`. Its deps +include `confidence-demote`, so the trust filter (SCIP-touched +symbols only) sees finalized confidence scores. + +The phase applies four gates in strict order: + +1. **Offline** — `PipelineOptions.offline === true` is a hard no-op. +2. **Flag** — `PipelineOptions.summaries === true` required. +3. **Trust filter** — only symbols touched by a SCIP oracle + (confidence 1.0 with a reason prefixed by `scip:`) are candidates. + A repo without SCIP produces zero summaries even with + `summaries=true`. +4. **Cost cap** — `maxSummariesPerRun` (default 0) slices the + candidate list. A default run is a dry-run: it counts + `wouldHaveSummarized` without issuing a single Bedrock call. + +Reordering any gate silently changes cost behavior, so the order is +deliberately rigid. See the phase docstring for the full precedence +contract. + +Credential soft-fail is handled twice — once on client factory +construction, once on the first `send()` — so an SSO token that +expires mid-run produces `skippedReason: "no-credentials"` rather +than an uncaught exception. + +Successful rows persist as `SymbolSummaryRow`: +`{nodeId, contentHash, promptVersion, modelId, summaryText, +signatureSummary, returnsTypeSummary, createdAt}`. + +## Fusion at ingestion, not query time + +This is the bit to internalize: **fusion happens at ingestion, not at +query**. When the `embeddings` phase builds a symbol's vector, it +calls `symbolText(node, summary, body)`. If a summary row exists, +the embedded text is: + +``` +\n\n +``` + +with `bodyPiece` capped at `SYMBOL_BODY_CHAR_CAP = 1200`. Without a +summary, the fallback is `\n`. + +The resulting vector already encodes the signature, the summary, and +the body. Retrieval does not re-fuse at query time — it searches +against the pre-fused vector. This keeps query latency low and keeps +the query path free of LLM calls. + +```mermaid +sequenceDiagram + participant Summ as summarize phase + participant Bedrock + participant Emb as embeddings phase + participant HNSW as embeddings table + HNSW + + Summ->>Summ: filter by SCIP-trust + Summ->>Summ: cache probe (nodeId, contentHash, promptVersion) + alt cache miss + Summ->>Bedrock: Converse with dual cachePoint + Bedrock-->>Summ: tool_use purpose, inputs, returns, ... + Summ->>Summ: Zod validate, ReAct retry on error + end + Summ->>Summ: persist SymbolSummaryRow + Emb->>Emb: symbolText(node, summary, body) — fuse + Emb->>HNSW: upsert symbol-tier vector +``` + +## Cache-key discriminator + +The cache key is `(nodeId, contentHash, promptVersion)`: + +- `contentHash` is `sha256` of the raw UTF-8 span `[startLine, + endLine]`. A whitespace-only edit inside the span changes the hash + and invalidates the cached summary for that symbol. +- `promptVersion` is `SUMMARIZER_PROMPT_VERSION = "1"`. Bumping this + constant invalidates every cached summary in one shot — the + prior rows survive in the cache (no deletion), but lookups miss. + Planned rollout is the new version coexisting with the old so a + rollback is cheap. + +## Cost profile + +Haiku 4.5 calls happen once per callable symbol at ingest time. A +re-ingest without a prompt-version bump is a cache hit. With the +default `maxSummariesPerRun=0`, the phase never contacts Bedrock — +the dry-run mode is the production default until an operator opts in. + +## Configuration knobs + +- `PipelineOptions.summaries: boolean` — master enable (default + false). +- `PipelineOptions.maxSummariesPerRun` — default 0 (dry-run). Counts + `wouldHaveSummarized` without calling Bedrock. +- `PipelineOptions.summaryModel` — override the default model id. +- `SummarizeOptions.maxAttempts` (default 3) / `maxTokens` (default + 2048). +- AWS SDK credentials via default chain — expired SSO soft-fails to + `skippedReason: "no-credentials"`. + +## Gotchas + +- **Trust filter excludes non-SCIP repos.** A repo without any SCIP + indexer configured produces zero summaries because no symbol is + SCIP-confirmed. This is intentional: summaries over uncertain + edges would pollute the downstream retrieval vector. +- **Whitespace-only edits bust the cache.** `contentHash` is over + the raw span, not a normalized form. A reformatter run will + re-summarize every touched symbol. This is a deliberate trade — + normalization would require per-language logic and is not worth it + for a once-per-symbol call. +- **`signatureSummary` appears in both `SymbolSummaryRow` and + `SearchResult`.** The two are populated by different paths: the + summarize phase writes one, the MCP query layer post-joins the + other. Storage-layer `search()` never fills it directly. + +## Further reading + +- [Embeddings](/opencodehub/architecture/embeddings/) — where the + symbol-tier fused text lands. +- [SCIP reconciliation](/opencodehub/architecture/scip-reconciliation/) + — the trust filter source. +- `@opencodehub/summarizer` package README — the schema field + bounds in one page. diff --git a/packages/docs/src/content/docs/architecture/supply-chain.md b/packages/docs/src/content/docs/architecture/supply-chain.md new file mode 100644 index 0000000..c4f390a --- /dev/null +++ b/packages/docs/src/content/docs/architecture/supply-chain.md @@ -0,0 +1,141 @@ +--- +title: Supply chain posture +description: SBOM, license allowlist, vulnerability gates, and how we handle non-permissive scanners. +sidebar: + order: 50 +--- + +OpenCodeHub ships under Apache-2.0 with a hard rule: every transitive +runtime dependency must sit on a permissive-license allowlist. This +page documents what we ship, the CI gates that prove it, and the +narrow set of tools we invoke as subprocesses rather than link +against. + +## What we ship + +Every release produces, in the `main` tree and the GitHub Release +artifacts: + +- **`SBOM.cdx.json`** — a CycloneDX v1.5 Software Bill of Materials + covering the full runtime dependency graph. Regenerated on every + release by `.github/workflows/sbom.yml`. +- **`THIRD_PARTY_LICENSES.md`** — a human-readable inventory of every + third-party package with its license text. +- **`NOTICE`** — the Apache-2.0 NOTICE file naming every attribution + we carry. +- **`CHANGELOG.md`** — generated by `release-please` from Conventional + Commits since the last release. + +All four files are tracked in the repo. Consumers can audit them +without cloning the history. + +## License allowlist + +Every production dependency must be on this list: + +``` +Apache-2.0 +MIT +BSD-2-Clause +BSD-3-Clause +ISC +CC0-1.0 +BlueOak-1.0.0 +0BSD +``` + +The check is enforced by +[`license-checker-rseidelsohn`](https://www.npmjs.com/package/license-checker-rseidelsohn) +on every PR and as part of `mise run check:full`. See +[IP hygiene / License allowlist](/opencodehub/contributing/ip-hygiene/#license-allowlist) +for the exact command and the note on the one known acceptance-script +inconsistency. + +BSL, BUSL, PolyForm, Commons Clause, GPL, and AGPL are rejected +upfront. Source-available engines (e.g. LanceDB's former license, +Elastic) were considered and rejected in ADR 0001 specifically +because preserving Apache-2.0 distribution rights is load-bearing. + +## Vulnerability gates + +| Gate | Tool | Trigger | +|---------------------|------------------------------------------------|---------------------------------------| +| OSV scan | `osv-scanner scan source --lockfile pnpm-lock.yaml` | Every CI run + `mise run check:full` | +| CodeQL | `.github/workflows/codeql.yml` | Every push + weekly schedule | +| OpenSSF Scorecard | `.github/workflows/scorecard.yml` | Weekly + push to `main` | +| SARIF schema | `mise run sarif:validate` + acceptance gate 13 | Every scanner run | + +Release gate policy: zero open CVEs on the lockfile at release time. +If a bump is blocked (upstream has not shipped a fix, or the fix +requires a breaking change), the PR must document the CVE, the reason, +and a due date before release-please cuts the version. + +All scanner outputs are uploaded as SARIF to the GitHub Security tab, +so the org-wide view is one dashboard. + +## Non-permissive scanners + +Some scanners that end users may want to run through `codehub scan` +— hadolint (GPL-3.0), tflint (MPL-2.0 / BUSL depending on vendor +build) — are not on the permissive allowlist. We still expose them. +The trick is **how**: we invoke them as subprocesses, we never +`import` them, never link them in, and never redistribute the +binaries. + +Concretely: + +- `packages/scanners/src/` is a thin shell-out layer. Each scanner + runner spawns the binary, captures stdout as SARIF, and emits + findings into the graph. +- The scanner binaries are a **user-provided runtime dependency**. + Users install them separately (via `brew`, `apt`, `choco`, the + vendor-published Docker image, etc.). OpenCodeHub does not ship + them, bundle them, or require them at install time. +- Scanner license obligations flow to the user running the scanner, + not to OpenCodeHub. + +This is the same pattern GitHub CodeQL uses with third-party SARIF +producers, and it is the reason OBJECTIVES.md can commit to an +Apache-2.0-end-to-end posture without crippling the scan surface. + +## SCIP indexers + +The SCIP indexers the gym uses (scip-typescript, scip-python, +scip-go, rust-analyzer, scip-java) follow the same subprocess-only +rule. They are installed via their language's native package +manager (`npm install -g`, `go install`, `rustup component add`, +`coursier install`) and invoked via subprocess. ADR 0006 pins the +versions and documents the install channel per language. + +## Lockfile policy + +- `pnpm-lock.yaml` is committed. +- Every install uses `--frozen-lockfile`. +- Dependency bumps are Conventional Commits under `build(deps): ...` + (or `chore(deps): ...` for devDependencies). +- Dependabot or manual bumps go through the same osv + license gates + as any other PR. + +## Verifying a release + +To verify a downloaded release: + +1. Pull the SBOM: `SBOM.cdx.json` at the release tag. +2. Confirm every component license is on the allowlist above. +3. Cross-check against `THIRD_PARTY_LICENSES.md` for any omissions. +4. Run `osv-scanner` against the tag's lockfile locally. + +The SBOM is deterministic — two regenerations at the same commit +produce the same bytes. That is an extension of the determinism +contract to the supply-chain layer. + +## Related + +- [IP hygiene](/opencodehub/contributing/ip-hygiene/) — the rules a + contributor has to follow to keep this posture. +- [ADR 0001 — Storage backend](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0001-storage-backend.md) — + why every storage-layer dependency is MIT. +- [ADR 0006 — SCIP indexer CI pins](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0006-scip-indexer-pins.md) — + current SCIP indexer version + install channel table. +- `SBOM.cdx.json`, `THIRD_PARTY_LICENSES.md`, `NOTICE`, `LICENSE` at + the repo root. diff --git a/packages/docs/src/content/docs/contributing/adding-a-language-provider.md b/packages/docs/src/content/docs/contributing/adding-a-language-provider.md new file mode 100644 index 0000000..d0405eb --- /dev/null +++ b/packages/docs/src/content/docs/contributing/adding-a-language-provider.md @@ -0,0 +1,169 @@ +--- +title: Adding a language provider +description: Four steps to wire a new language into the OpenCodeHub ingestion pipeline. +sidebar: + order: 60 +--- + +OpenCodeHub ships 15 tree-sitter language providers today: TypeScript, +TSX, JavaScript, Python, Go, Rust, Java, C#, C, C++, Ruby, Kotlin, +Swift, PHP, and Dart. Five of them (TypeScript, Python, Go, Rust, Java) +are further upgraded with SCIP indexers for compiler-grade cross-module +edges. + +Adding a new language is four steps. The registry is compile-time +exhaustive, so the TypeScript build fails if you forget step three. + +## Step 1 — Pin the tree-sitter grammar + +Add the grammar as a pinned dependency in `packages/ingestion/package.json`. +Use a concrete semver; do not use `^` or `latest`. Grammars change AST +shapes between versions and a float range will silently break +extraction. + +```json title="packages/ingestion/package.json" +{ + "dependencies": { + "tree-sitter-": "1.2.3" + } +} +``` + +Then `pnpm install` and verify the grammar loads by running the parse +bootstrap tests locally. + +## Step 2 — Implement the provider + +Create `packages/ingestion/src/providers/.ts` exporting a +`LanguageProvider` object. The interface lives at +`packages/ingestion/src/providers/types.ts`. Required fields and +methods: + +| Member | Purpose | +|-----------------------|-------------------------------------------------------------------------| +| `id` | The `LanguageId` string (must already exist in `@opencodehub/core-types`) | +| `extensions` | File extensions this provider claims | +| `importSemantics` | `named` / `namespace` / `package-wildcard` (see below) | +| `mroStrategy` | `c3` / `first-wins` / `single-inheritance` / `none` (see below) | +| `typeConfig` | `{ structural, nominal, generics }` booleans | +| `heritageEdge` | `"EXTENDS"` / `"IMPLEMENTS"` / `null` | +| `extractDefinitions` | Emit one record per defined symbol | +| `extractCalls` | Emit one record per call site | +| `extractImports` | Parse `import` / `use` / `require` statements | +| `extractHeritage` | Emit inheritance / trait-impl / interface-implements edges | +| `isExported` | Predicate: is this definition publicly exported? | + +Optional hooks improve coverage: + +| Member | Purpose | +|---------------------------|-------------------------------------------------------------------| +| `detectOutboundHttp` | Detect `fetch("/api")`, `requests.get(url)`, `axios.post(url, ...)` | +| `extractPropertyAccesses` | Emit `ACCESSES` edges for `receiver.property` reads/writes | +| `preprocessImportPath` | Strip `.js` suffix for TS, resolve `__init__.py`, etc. | +| `inferImplicitReceiver` | Name for `this` / `self` inside a method body | +| `complexityDefinitionKinds` / `halsteadOperatorKinds` | Enable cyclomatic + Halstead metrics | + +### Picking `importSemantics` + +- **`named`** — the statement names specific symbols: + `import { foo } from "bar"` (TypeScript, JavaScript), `import foo.Bar` + (Java), `use std::io::Read` (Rust), `using System.IO` (C#). Use this + for most typed languages. +- **`namespace`** — the statement imports a whole module under a name: + `import os` / `from os import path` (Python). The resolver walks + `.` chains at call sites. +- **`package-wildcard`** — the statement pulls a whole package symbol + set into scope: `import "fmt"` (Go). Every exported symbol of `fmt` + becomes directly callable. + +Today's breakdown: `package-wildcard` is used by Go; `namespace` is +used by Python; everything else (12 languages) uses `named`. + +### Picking `mroStrategy` + +- **`c3`** — full C3 linearization. Raises on ambiguity. Used by + Python (matches CPython's MRO semantics). +- **`first-wins`** — left-to-right source order. Used by TypeScript, + TSX, JavaScript, and Rust. Fast, predictable, matches how these + languages' compilers actually resolve. +- **`single-inheritance`** — one `extends` chain plus a set of + interfaces. Used by Java, C#, Kotlin. The chain walk is cheap; the + implements set is checked at resolution time. +- **`none`** — no traditional inheritance. Used by Go (composition via + embedded fields, no `extends`). The method-resolution walker is + skipped entirely. + +If your language is new, pick the strategy that matches its compiler's +actual semantics. Do not invent a fifth option — the four above cover +every mainstream type system. + +## Step 3 — Register in the provider registry + +Open `packages/ingestion/src/providers/registry.ts` and add your +provider to the `providers` object. + +```ts title="packages/ingestion/src/providers/registry.ts" +const providers = { + typescript: typescriptProvider, + // ... + zig: zigProvider, // new +} satisfies Record; +``` + +The `satisfies Record` clause is the +compile-time check. If you add `zig` to the `LanguageId` union in +`@opencodehub/core-types` but forget to register a provider, the +TypeScript build fails with a missing-key error. That is intentional — +the type error is how the registry stays exhaustive. + +## Step 4 — Add fixture tests + +Under `packages/ingestion/test/fixtures//` add source files that +exercise every extractor the provider implements. Use the +`parseFixture` helper from +`packages/ingestion/src/providers/test-helpers.ts`: + +```ts title="packages/ingestion/test/providers/.test.ts" +import { parseFixture } from "../../src/providers/test-helpers.js"; +import { Provider } from "../../src/providers/.js"; + +const result = await parseFixture(pool, "", "sample.", src); +const defs = Provider.extractDefinitions({ + filePath: "sample.", + captures: result.captures, + sourceText: src, +}); +// assert on defs... +``` + +Cover at minimum: a top-level function, a class with one method, an +import statement, a call to an imported symbol, and an exported vs. +non-exported symbol. If your language has generics / traits / +interfaces, add a fixture per heritage shape. + +The `parseFixture` helper returns a pool-borrowed `ParseCapture` array +that matches exactly what the ingestion pipeline passes in at runtime, +so the assertions you write here mirror production behaviour. + +## CI expectations + +Once the four steps are in place: + +- `mise run lint` — Biome check passes. +- `mise run typecheck` — registry exhaustiveness passes. +- `mise run test` — your fixture tests pass under `pnpm -r test`. +- `mise run banned-strings` — you did not accidentally copy names from + another project. + +If your language has an available SCIP indexer, a follow-up PR can add +it to `packages/scip-ingest/src/runners/` and `.github/workflows/gym.yml` +to upgrade heuristic edges to compiler-grade. That is not required for +shipping the heuristic provider. + +## Related files + +- `packages/ingestion/src/providers/types.ts` — the `LanguageProvider` + interface. +- `packages/ingestion/src/providers/registry.ts` — the exhaustive map. +- `packages/ingestion/src/providers/test-helpers.ts` — `parseFixture`. +- `@opencodehub/core-types` — the `LanguageId` union. diff --git a/packages/docs/src/content/docs/contributing/commit-conventions.md b/packages/docs/src/content/docs/contributing/commit-conventions.md new file mode 100644 index 0000000..8bce313 --- /dev/null +++ b/packages/docs/src/content/docs/contributing/commit-conventions.md @@ -0,0 +1,128 @@ +--- +title: Commit conventions +description: Conventional Commits grammar, scopes, and breaking-change rules for OpenCodeHub. +sidebar: + order: 30 +--- + +OpenCodeHub follows [Conventional Commits](https://www.conventionalcommits.org/). +The commit log on `main` is the input to `release-please` — malformed +messages break version bumps, changelog generation, and release notes. So +we enforce the grammar three times: `lefthook` at `commit-msg`, the +`commitlint` GitHub Action on every PR, and `release-please` itself. + +## Grammar + +``` +(): + +[optional body] + +[optional footer(s)] +``` + +- Lowercase type. +- Scope is a single workspace package name or a meta-scope. +- Subject is imperative, ≤ 72 chars, no trailing period. +- Body wraps at 100 cols. Explain *why*, not *what* — the diff tells you + *what*. +- Footers are standard (`BREAKING CHANGE:`, `Refs: #123`, `Signed-off-by: ...`). + +If you are unsure of the type or scope: + +```bash title="Interactive Conventional Commit prompt" +pnpm run commit +``` + +That wraps Commitizen and walks you through type, scope, subject, body, +and breaking-change flags. + +## Types + +| Type | Use for | In changelog? | +|------------|------------------------------------------------------------------------|----------------------| +| `feat` | New user-facing capability (CLI flag, MCP tool, indexer, etc.) | Yes — "Features" | +| `fix` | Bug fix | Yes — "Bug Fixes" | +| `perf` | Performance improvement with no behaviour change | Yes — "Performance" | +| `revert` | Revert an earlier commit | Yes — "Reverts" | +| `docs` | Documentation-only change (this site, READMEs, ADRs, comments) | Yes — "Documentation"| +| `refactor` | Internal reshuffle, no behaviour change | Yes — "Refactoring" | +| `test` | Adding or fixing tests | Hidden | +| `build` | Build system, dependency bumps, package metadata | Hidden | +| `ci` | CI workflow change | Hidden | +| `chore` | Housekeeping that fits nowhere else | Hidden | +| `style` | Formatting only — Biome runs on pre-commit, so this is rare | Hidden | +| `release` | Release-please-authored commits only (do not use by hand) | — | + +"Hidden" means the commit is still enforced and still shows up in the +git log — it just does not appear in `CHANGELOG.md`. See +`.release-please-config.json` for the source of truth on which sections +are visible. + +## Scopes + +Workspace-package scopes map 1:1 to `packages//`: + +| Scope | Package | +|---------------|-------------------------------------| +| `analysis` | `@opencodehub/analysis` | +| `cli` | `@opencodehub/cli` (bin: `codehub`) | +| `core-types` | `@opencodehub/core-types` | +| `embedder` | `@opencodehub/embedder` | +| `gym` | `@opencodehub/gym` | +| `ingestion` | `@opencodehub/ingestion` | +| `mcp` | `@opencodehub/mcp` | +| `sarif` | `@opencodehub/sarif` | +| `scanners` | `@opencodehub/scanners` | +| `scip-ingest` | `@opencodehub/scip-ingest` | +| `search` | `@opencodehub/search` | +| `storage` | `@opencodehub/storage` | +| `summarizer` | `@opencodehub/summarizer` | + +Meta-scopes cover cross-cutting changes: + +| Meta-scope | Use for | +|------------|-----------------------------------------------------------| +| `deps` | Dependency bumps not tied to one package | +| `ci` | `.github/workflows/*.yml` changes | +| `docs` | `packages/docs/**` or top-level Markdown | +| `repo` | Root-level repo files (`.gitignore`, `mise.toml`, etc.) | +| `release` | Release-please-authored PRs only | + +## Breaking changes on 0.x + +OpenCodeHub is pre-1.0. The breaking-change rule is version-dependent: + +- **On 0.x:** `feat!` and a `BREAKING CHANGE:` footer both bump the + **minor** version (0.4.2 → 0.5.0). +- **After 1.0.0:** the same signals bump the **major** version. + +The `!` form is the short one: + +``` +feat(mcp)!: drop the `cypher` tool; use `sql` instead +``` + +The footer form is equivalent and plays nicer with long explanations: + +``` +feat(mcp): switch to SCIP-backed references + +BREAKING CHANGE: the `lsp-unconfirmed` reason suffix is now +`scip-unconfirmed`. Consumers that pattern-match on the old suffix +must update. +``` + +Use either form, not both. + +## Enforcement + +| Layer | Tool | Trigger | +|--------------------|----------------------------------------|------------------------| +| Local, pre-commit | `lefthook` + `commitlint --edit` | `commit-msg` hook | +| PR | `.github/workflows/commitlint.yml` | Every PR commit | +| Release | `release-please` action on push-to-main | New commit on `main` | + +If commitlint rejects your message locally, re-run `git commit` with a +fixed message — do not `--no-verify`. The tenet applies: every failure +is a blocker. diff --git a/packages/docs/src/content/docs/contributing/dev-loop.md b/packages/docs/src/content/docs/contributing/dev-loop.md new file mode 100644 index 0000000..7d0b657 --- /dev/null +++ b/packages/docs/src/content/docs/contributing/dev-loop.md @@ -0,0 +1,141 @@ +--- +title: Dev loop +description: Tools, install commands, and the mise task catalogue for local development. +sidebar: + order: 20 +--- + +The local dev loop is three commands once your toolchain is in place. This +page covers the toolchain pins, the full `mise` task catalogue, and when +to reach for the long-running `check:full` and `acceptance` targets. + +## Toolchain pins + +| Tool | Version | How it gets installed | +|--------|--------------|-------------------------------------------| +| Node | 22 (>=22.0.0) | `mise.toml` — matches root `engines.node` | +| pnpm | 10.33.2 | `mise.toml` + `packageManager` field | +| Python | 3.12 | `mise.toml` — only needed for `packages/eval` | +| uv | latest | `mise.toml` — Python package manager | + +The Python venv for the eval harness is auto-created by `mise` via this +stanza in `mise.toml`: + +```toml title="mise.toml" +[env] +_.python.venv = { path = "packages/eval/.venv", create = true } +``` + +You do not need `pyenv`, `nvm`, `direnv`, or a hand-rolled venv. `mise` +activates tools and environment variables when you `cd` into the repo. + +## Three-command dev loop + +```bash title="Daily loop" +mise install # once per machine or after mise.toml changes +pnpm install --frozen-lockfile # once per pnpm-lock.yaml change +mise run check # every time you want to know if your branch is green +``` + +`mise run check` runs lint, typecheck, test, and the banned-strings sweep +in a single chain and stops on the first failure. The equivalent +`pnpm run check` is wired to the same task. + +## Individual checks + +Run one gate at a time when you want a faster loop: + +```bash +mise run lint # Biome check across packages/**/src, packages/**/test, scripts +mise run typecheck # tsc --noEmit across every workspace package +mise run test # pnpm -r test (each package's `test` script) +mise run banned-strings # scripts/check-banned-strings.sh +``` + +## Heavier gates + +```bash +mise run check:full # check + licenses + osv +mise run acceptance # 15 Definition-of-Done gates (soft: 7, 10, 11) +mise run smoke:mcp # boot MCP server over stdio, assert tools/list +mise run test:eval # Python eval harness (pytest under uv) +mise run gym # SCIP-indexer differential gym vs. frozen baseline +``` + +`check:full` adds the license allowlist (`license-checker-rseidelsohn`) and +the `osv-scanner` vulnerability scan against `pnpm-lock.yaml`. CI runs both +on every PR. + +`acceptance` is the full v1.0 Definition-of-Done. Some gates are soft — +they log but do not block — because they depend on optional binaries +(semgrep, embedder weights) or measure timings on the local machine. + +## Full task catalogue + +Every task in `mise.toml`: + +| Task | Purpose | +|--------------------------|-------------------------------------------------------------------------| +| `install` | `pnpm install --frozen-lockfile` | +| `install:update` | `pnpm install` — allows the lockfile to update | +| `install:eval` | `uv sync` inside `packages/eval` | +| `bootstrap` | `install` + `install:eval` | +| `build` | `pnpm -r build` across every package | +| `build:cli` | Build only `@opencodehub/cli` | +| `build:clean` | Clean + full rebuild | +| `clean` | `pnpm -r clean` | +| `clean:all` | Clean + delete `node_modules` everywhere | +| `cli:link` | `pnpm link --global` — expose `codehub` system-wide for dev | +| `cli:unlink` | Reverse of `cli:link` | +| `cli:pack` | Produce a distributable tarball of the CLI | +| `cli:install-global` | Install the packed tarball globally with pnpm | +| `cli:uninstall-global` | Remove the globally installed `codehub` | +| `test` | `pnpm -r test` | +| `test:eval` | Python eval harness (`uv run pytest`) | +| `lint` | `biome check .` | +| `lint:fix` | `biome check --write .` | +| `format` | `biome format --write .` | +| `typecheck` | `pnpm -r exec tsc --noEmit` | +| `banned-strings` | `scripts/check-banned-strings.sh` | +| `licenses` | License allowlist check (prod deps, private packages excluded) | +| `osv` | `osv-scanner scan source --lockfile pnpm-lock.yaml` | +| `sarif:validate` | Validate emitted SARIF against the Zod schema | +| `check` | `lint` + `typecheck` + `test` + `banned-strings` | +| `check:full` | `check` + `licenses` + `osv` | +| `acceptance` | 15 v1.0 DoD gates (`scripts/acceptance.sh`) | +| `smoke:mcp` | Boot the MCP server over stdio and assert `tools/list` | +| `commit` | Commitizen-guided Conventional Commit prompt | +| `envinfo` | Print tool versions for bug reports | +| `gym` | SCIP-indexer differential gym run | +| `gym:baseline` | Lock a new baseline manifest | +| `gym:replay` | Bit-exact replay of a frozen manifest | +| `gym:refresh-expected` | Refresh corpus `expected:` lists from the current manifest | +| `analyze` | `codehub analyze` against the current repo | +| `status` | `codehub status` | +| `mcp` | Start the stdio MCP server | + +## Lefthook hooks + +`lefthook install` (run once after `pnpm install`) wires three hooks: + +| Hook | Runs | +|-------------|---------------------------------------------------------| +| `pre-commit` | Biome autofix on staged `.ts/.tsx/.js/.jsx/.json/.jsonc` + banned-strings sweep | +| `commit-msg` | `commitlint --edit` on the draft message | +| `pre-push` | `tsc --noEmit` across packages + `pnpm -r test` | + +The pre-push hook is the last safety net before CI picks up your branch. +If it fails on a supposedly-unrelated test, see [Tenets](/opencodehub/contributing/overview/#tenets): +we fix it, we do not skip it. + +## When to run `acceptance` + +Before opening a PR that touches any of: + +- The analyze pipeline (`packages/ingestion`, `packages/analysis`). +- Storage (`packages/storage`). +- The MCP server (`packages/mcp`). +- The graph-hash contract (anything that could affect determinism). +- `scripts/check-banned-strings.sh` or the CI workflows. + +Otherwise `mise run check` is enough locally; CI will run the full matrix. diff --git a/packages/docs/src/content/docs/contributing/ip-hygiene.md b/packages/docs/src/content/docs/contributing/ip-hygiene.md new file mode 100644 index 0000000..12769fb --- /dev/null +++ b/packages/docs/src/content/docs/contributing/ip-hygiene.md @@ -0,0 +1,155 @@ +--- +title: IP hygiene +description: The clean-room rule, the license allowlist, banned-strings sweep, and supply-chain gates. +sidebar: + order: 50 +--- + +OpenCodeHub is a clean-room implementation distributed under Apache-2.0. +That promise has to hold end to end — in the source we write, in the +dependencies we pull, and in the binaries we ship. This page documents +the rules and the CI gates that enforce them. + +## The clean-room rule + +Do not copy code, comments, or test data from any source licensed under +PolyForm, BSL, Commons Clause, GPL, or AGPL. If a prior-art project +solves a problem we also want to solve, you may read its docs and +papers, but you may not look at its source while writing ours. When in +doubt, ask. + +The rule is boring. Our enforcement is not: every file on `main` goes +through a banned-strings sweep that rejects identifiers lifted verbatim +from projects we deliberately do not copy from. If one of those names +appears in your diff, CI turns red. + +## License allowlist + +Every production (transitive) dependency must be on this list: + +``` +Apache-2.0 +MIT +BSD-2-Clause +BSD-3-Clause +ISC +CC0-1.0 +BlueOak-1.0.0 +0BSD +``` + +The check runs via +[`license-checker-rseidelsohn`](https://www.npmjs.com/package/license-checker-rseidelsohn): + +```bash title="mise.toml — licenses task" +pnpm exec license-checker-rseidelsohn \ + --onlyAllow 'Apache-2.0;MIT;BSD-2-Clause;BSD-3-Clause;ISC;CC0-1.0;BlueOak-1.0.0;0BSD' \ + --excludePrivatePackages \ + --production +``` + +`--excludePrivatePackages` skips our own workspace packages; `--production` +skips `devDependencies` (which may legitimately include non-redistributable +tooling like scanners invoked as subprocesses — see below). + +Run it locally with `mise run licenses`, or let `mise run check:full` run +it as part of the extended gate. + +:::note[Known inconsistency] +`scripts/acceptance.sh` gate 5 currently uses a shorter allowlist that +omits `BlueOak-1.0.0` and `0BSD`. The authoritative list — the one we +enforce before publishing — is the `mise.toml` / CI version above. We +plan to reconcile the acceptance script to match. If you find a +BlueOak- or 0BSD-licensed transitive dep and acceptance fails but +`mise run licenses` passes, that is why. +::: + +## Banned-strings sweep + +`scripts/check-banned-strings.sh` is a `git grep` sweep over every +tracked file (and every untracked, non-ignored file) for identifiers we +have agreed never to use. It runs on `pre-commit` via lefthook, on +every CI job, and as acceptance gate 4. + +The banned literals are the names of prior-art projects and internal +planning artifacts we scrubbed before going public. The exact list +lives in `scripts/check-banned-strings.sh` — read it there, do not +memorize it here. If you need to reference one of these names in +documentation (this rarely happens), add the file to the pathspec +allowlist at the bottom of that script. + +The sweep also rejects planning-code regex patterns that belong to an +older internal planning model we do not ship. The patterns themselves +live in `scripts/check-banned-strings.sh` — reference the script if +you need to know what is being rejected. + +## Vulnerability scanning + +Every CI run and `mise run check:full` pass runs +[osv-scanner](https://github.com/google/osv-scanner) against +`pnpm-lock.yaml`: + +```bash +osv-scanner scan source --lockfile pnpm-lock.yaml . +``` + +Results are uploaded as SARIF to the GitHub Security tab. Release gate +policy: zero open CVEs on the lockfile at release time. + +## CodeQL + +`.github/workflows/codeql.yml` runs GitHub's CodeQL on the TypeScript +surface. Findings surface in the Security tab and block release PRs at +`high` severity. + +## OpenSSF Scorecard + +`.github/workflows/scorecard.yml` runs the +[OpenSSF Scorecard](https://scorecard.dev/) weekly and on every push to +`main`. It checks branch-protection posture, signed releases, pinned +dependencies, CI test runs, and a dozen other supply-chain signals. The +score is visible on the repo homepage via the badge. + +## Software Bill of Materials + +`SBOM.cdx.json` at the repo root is a CycloneDX v1.5 SBOM covering the +full runtime dependency graph. It is regenerated on every release by +`.github/workflows/sbom.yml` and attached to the GitHub Release. + +The human-readable companion is `THIRD_PARTY_LICENSES.md`, also at the +repo root, which enumerates every third-party package with its license +text. + +## Scanners that are not permissively licensed + +Some tools we expose via `codehub scan` and `codehub ingest-sarif` +(hadolint GPL-3.0, tflint MPL-2.0/BUSL) are not on the allowlist. We +resolve this by invoking them as subprocesses only — we never `import` +them, never statically link them, and never redistribute them. The +scanners are a user-provided runtime dependency, not a OpenCodeHub +dependency. See `packages/scanners/src/` for the thin wrapper that +shells out. + +This is the same pattern GitHub CodeQL uses with third-party SARIF +producers, and the same that OBJECTIVES.md commits to explicitly. + +## If a gate fails + +Every failure is a blocker: + +- Banned literal found → rename the identifier or remove the borrowed + text. Do not add it to the allowlist unless you have a genuine + documentation reason. +- License allowlist violation → pick a different dep, wait for the dep + to relicense, or open an ADR explaining why this one is required. +- CVE on lockfile → bump the dep, patch-pin to a fixed version, or open + an advisory waiver in the PR description. Waivers must cite the CVE, + the reason the bump is not yet possible, and a due date. + +## Related files + +- `scripts/check-banned-strings.sh` — the sweep. +- `mise.toml` — `licenses` and `osv` tasks. +- `.github/workflows/{ci,codeql,scorecard,sbom}.yml` — CI gates. +- `SBOM.cdx.json`, `THIRD_PARTY_LICENSES.md`, `NOTICE`, `LICENSE` — what + ships in every release. diff --git a/packages/docs/src/content/docs/contributing/overview.md b/packages/docs/src/content/docs/contributing/overview.md new file mode 100644 index 0000000..122e47e --- /dev/null +++ b/packages/docs/src/content/docs/contributing/overview.md @@ -0,0 +1,102 @@ +--- +title: Contributing overview +description: Start here before you open a pull request against OpenCodeHub. +sidebar: + order: 10 +--- + +Welcome. OpenCodeHub is an Apache-2.0 code-intelligence graph plus MCP server +for AI coding agents. The project lives on a permissive, OSS-only stack and +makes a hard promise about determinism and offline-first behaviour — so the +contribution bar is specific, not generic. + +This page is the table of contents for contributors. Read it first, then work +through the page that matches what you want to do. + +## What we ship, and what we will not + +The primary product is the `codehub` CLI plus the stdio MCP server that +agents call over JSON-RPC. The scope is captured in +[OBJECTIVES.md](https://github.com/theagenticguy/opencodehub/blob/main/OBJECTIVES.md): + +- Graph-aware context (callers, callees, processes, blast radius) in one + MCP tool call. +- Apache-2.0 end to end, with every transitive runtime dep on the + permissive allowlist. +- Local, offline-capable, deterministic index. +- Fifteen tree-sitter languages, with SCIP indexers upgrading five of + them (TypeScript, Python, Go, Rust, Java) to compiler-grade edges. + +Explicit non-goals: + +- No hosted service. DuckDB is embedded and the MCP server is a stdio + process. +- No Rust port before we can measure it is needed (see + [ADR 0002](/opencodehub/architecture/adrs/)). + +Contributions that pull the project toward either non-goal will be sent +back — kindly, but sent back. + +## Who benefits from a contribution + +Three audiences benefit from most changes: + +1. **Agents.** Anything that makes tool responses richer, more structured, + or less ambiguous (typed errors, `next_steps`, `_meta` envelopes) helps + automated agent loops. +2. **Contributors.** Anything that shortens the dev loop, fixes flaky + tests, or documents a sharp edge helps the next person too. +3. **End users running the CLI.** Speed, offline robustness, and better + defaults show up here. + +If a change does not pay off for at least one of these three, it probably +does not belong. + +## Where to start + +If you are looking for an easy first ticket: + +- **Add or fix a language-provider fixture.** Every provider under + `packages/ingestion/src/providers/` is backed by fixtures in + `packages/ingestion/test/fixtures//`. More fixtures means more + extraction bugs caught. See + [Adding a language provider](/opencodehub/contributing/adding-a-language-provider/). +- **Doc improvements.** This site lives in `packages/docs/`. Fix a + typo, tighten a rationale, add a diagram, link a missing ADR. +- **MCP tool polish.** Every tool lives under + `packages/mcp/src/tools/.ts`. `next_steps`, error envelopes, and + response shapes all evolve in small PRs. + +## Read before you write code + +- [Dev loop](/opencodehub/contributing/dev-loop/) — `mise install`, + `pnpm install --frozen-lockfile`, `mise run check`, the full task + catalogue. +- [Commit conventions](/opencodehub/contributing/commit-conventions/) — + Conventional Commits are required; commitlint runs locally and in CI. +- [Release process](/opencodehub/contributing/release-process/) — how + release-please turns your commits into a version bump. +- [IP hygiene](/opencodehub/contributing/ip-hygiene/) — the clean-room + rule, the license allowlist, the banned-strings sweep. +- [Adding a language provider](/opencodehub/contributing/adding-a-language-provider/) — + four steps, compile-time enforced. +- [Testing](/opencodehub/contributing/testing/) — Node test runner, the + Python eval harness, the MCP smoke test, the acceptance gates. + +The canonical short form of these rules lives in +[CONTRIBUTING.md](https://github.com/theagenticguy/opencodehub/blob/main/CONTRIBUTING.md). +These pages expand the rationale. + +## Tenets + +These three are non-negotiable. They are reproduced verbatim from +`CONTRIBUTING.md`: + +- **Determinism is non-negotiable** — identical inputs must yield identical + graph-hash. +- **Offline-first** — `codehub analyze --offline` must open zero sockets. +- **Clean-room IP hygiene** — when in doubt, ask. + +The deeper rationale lives in +[Architecture / Determinism](/opencodehub/architecture/determinism/) and +[IP hygiene](/opencodehub/contributing/ip-hygiene/). diff --git a/packages/docs/src/content/docs/contributing/release-process.md b/packages/docs/src/content/docs/contributing/release-process.md new file mode 100644 index 0000000..1c5e209 --- /dev/null +++ b/packages/docs/src/content/docs/contributing/release-process.md @@ -0,0 +1,128 @@ +--- +title: Release process +description: How release-please turns your Conventional Commits into a versioned release and CHANGELOG.md. +sidebar: + order: 40 +--- + +OpenCodeHub releases are automated by +[release-please](https://github.com/googleapis/release-please). You do not +tag, you do not edit `CHANGELOG.md`, you do not hand-write release notes. +You write Conventional Commits on feature branches, merge them into `main`, +and a bot opens the release PR for you. + +This page explains how that works, where the configuration lives, and what +you need to know when your change lands in a release. + +## The pipeline + +1. You merge a PR into `main`. Each commit on `main` is a Conventional + Commit (see [Commit conventions](/opencodehub/contributing/commit-conventions/)). +2. `.github/workflows/release-please.yml` runs on every push to `main` and + calls `googleapis/release-please-action@v4`. +3. The action reads every commit since the last release tag and decides on + a version bump using the `changelog-sections` map in + `.release-please-config.json`. +4. It opens (or updates) a single release PR titled + "chore(root): release N.N.N". The PR body is the generated changelog. +5. When a maintainer merges that PR, the action cuts git tags, generates + `CHANGELOG.md` entries, and creates a GitHub Release. + +Because the repo uses `separate-pull-requests: false`, the whole monorepo +moves in a single release PR covering all versioned packages. The +`node-workspace` plugin (with `updatePeerDependencies: true`) keeps +cross-package versions and peer ranges consistent. + +## Versioned vs. unversioned packages + +`.release-please-config.json` declares 10 versioned packages. They each +get their own `package-name` and their own tag. + +| Package | Tag prefix | +|----------------------------|--------------------------------| +| `@opencodehub/analysis` | `@opencodehub/analysis-vN.N.N` | +| `@opencodehub/cli` | `@opencodehub/cli-vN.N.N` | +| `@opencodehub/core-types` | `@opencodehub/core-types-vN.N.N` | +| `@opencodehub/embedder` | `@opencodehub/embedder-vN.N.N` | +| `@opencodehub/ingestion` | `@opencodehub/ingestion-vN.N.N` | +| `@opencodehub/mcp` | `@opencodehub/mcp-vN.N.N` | +| `@opencodehub/sarif` | `@opencodehub/sarif-vN.N.N` | +| `@opencodehub/scanners` | `@opencodehub/scanners-vN.N.N` | +| `@opencodehub/search` | `@opencodehub/search-vN.N.N` | +| `@opencodehub/storage` | `@opencodehub/storage-vN.N.N` | + +Plus the root component `opencodehub` tagged as `root-vN.N.N`. + +Four packages are intentionally unversioned: `@opencodehub/gym`, +`@opencodehub/scip-ingest`, `@opencodehub/summarizer`, and the Python +`packages/eval` harness. They ride along with the monorepo version but do +not publish tags of their own. The gym and eval are harness code, not +product. `scip-ingest` and `summarizer` are internal dependencies with no +external consumer at v1.0 — they will start versioning once a public +contract exists. + +## Changelog sections + +`.release-please-config.json` controls which Conventional Commit types +show up in `CHANGELOG.md`: + +| Type | Section | Visible? | +|------------|-----------------|----------| +| `feat` | Features | Yes | +| `fix` | Bug Fixes | Yes | +| `perf` | Performance | Yes | +| `revert` | Reverts | Yes | +| `docs` | Documentation | Yes | +| `refactor` | Refactoring | Yes | +| `test` | Tests | Hidden | +| `build` | Build System | Hidden | +| `ci` | CI | Hidden | +| `chore` | Chores | Hidden | +| `style` | Style | Hidden | + +Hidden sections still land in git history and still trigger a patch bump +— they just do not appear in the release notes. + +## Tags + +`include-v-in-tag: true` means every tag is `vN.N.N`, not `N.N.N`. Tag +format: `-v` (e.g. `@opencodehub/cli-v0.4.2`) plus +a root tag `root-v0.4.2`. + +## Breaking changes on 0.x + +While OpenCodeHub sits on `0.x.y`, a `feat!` or `BREAKING CHANGE:` +footer bumps the **minor** version, not the major. That is intentional: +the 0.x prefix signals "not yet stable" and we want the freedom to break +things without forcing a 1.0 → 2.0 stampede. + +After the first 1.0.0 release, the same signals bump the major version. +See the breaking-change section in +[Commit conventions](/opencodehub/contributing/commit-conventions/#breaking-changes-on-0x). + +## What you do when your PR lands + +Nothing. release-please watches `main`. When you merge, the release PR +updates automatically. If your PR is a `fix` on top of a pending release +PR, the PR title and body refresh to include your fix. If yours is the +first commit since the last release, a new release PR is opened. + +If you are the maintainer about to cut a release: + +1. Check CI on the release PR is green. +2. Verify the changelog reads correctly — if a `feat!` is missing from + "Features" or a `BREAKING CHANGE:` footer was not picked up, fix the + offending commit via a follow-up commit with the right prefix rather + than editing release-please's output. +3. Merge the release PR. Tags, `CHANGELOG.md`, and the GitHub Release + are produced in one push. + +## Related files + +- `.release-please-config.json` — the config described above. +- `.release-please-manifest.json` — release-please's state file. Do not + hand-edit. +- `.github/workflows/release-please.yml` — the workflow that runs the + action. +- [Commit conventions](/opencodehub/contributing/commit-conventions/) — + what your commits need to look like to drive all of the above. diff --git a/packages/docs/src/content/docs/contributing/testing.md b/packages/docs/src/content/docs/contributing/testing.md new file mode 100644 index 0000000..bfbee88 --- /dev/null +++ b/packages/docs/src/content/docs/contributing/testing.md @@ -0,0 +1,150 @@ +--- +title: Testing +description: Test harnesses — Node test runner, Python eval, MCP smoke, acceptance gates, SCIP gym. +sidebar: + order: 70 +--- + +OpenCodeHub has four test surfaces. Each runs at a different cadence +and covers a different level of the stack. This page is the map. + +## Node tests — per-package + +Every TypeScript package has its own `test` script that runs the +[Node.js test runner](https://nodejs.org/api/test.html) against compiled +output: + +```bash +pnpm -r test +``` + +Conventions: + +- Test files live alongside source as `*.test.ts`. +- `tsc` compiles them into `dist/**/*.test.js`. +- Each package's `test` script is `node --test './dist/**/*.test.js'` + (or close — check `packages//package.json` for the exact form). +- No Jest, no Vitest. The stdlib test runner keeps the dev dependency + surface small and Apache-2.0 clean. + +`mise run test` runs the full matrix after a `build`. The `pre-push` +lefthook hook runs the same command, so you usually catch failures +before CI does. + +### When to add a Node test + +Any time you touch code under `packages/*/src/`. Fixtures live in +`packages//test/fixtures/`. The `parseFixture` helper in +`packages/ingestion` (see +[Adding a language provider](/opencodehub/contributing/adding-a-language-provider/)) +is the standard tool for ingestion-side assertions. + +## Python eval harness + +The parity and regression eval lives in `packages/eval/`. It is a +pytest suite that drives the MCP server end-to-end against fixture +repos and asserts on the tool responses. + +```bash +mise run test:eval # uv sync + uv run pytest in packages/eval/ +``` + +`mise.toml` wires a per-project venv via +`_.python.venv = { path = "packages/eval/.venv", create = true }`, so +the first run creates the venv; subsequent runs reuse it. + +There are 49 parametrized cases. The release gate (acceptance gate 9) +requires ≥ 40 / 49 to pass. This is the floor that prevents +undetected regressions in MCP tool behaviour between releases. + +### When to add an eval case + +Any time you change the shape of an MCP tool response, the resolver, +or a ranking behaviour. Fixtures live under +`packages/eval/src/opencodehub_eval/fixtures/`. Test definitions live +under `packages/eval/src/opencodehub_eval/tests/`. + +## MCP smoke test + +`scripts/smoke-mcp.sh` boots the stdio MCP server, sends +`initialize` + `tools/list`, and asserts that the advertised tool +count matches `EXPECTED_TOOLS`. Run it directly or via: + +```bash +mise run smoke:mcp +``` + +:::caution[Known drift] +`scripts/smoke-mcp.sh` defaults `EXPECTED_TOOLS=19`. +`packages/mcp/src/server.ts` currently registers **28** tools, and the +top-level README cites **27**. The smoke test is therefore wrong on any +build that has not overridden `EXPECTED_TOOLS`. The fix is a one-line +update to the default; until it lands, use `EXPECTED_TOOLS=28 mise run +smoke:mcp` locally, or expect the acceptance gate 8 output to reflect +the stale count. +::: + +## Acceptance gates — v1.0 Definition of Done + +`scripts/acceptance.sh` runs all 15 Definition-of-Done gates. Mandatory +gates fail the run; soft gates (gates 7, 10, 11) log timings or skip +when a dependency binary is missing and do not change the exit code. + +```bash +mise run acceptance +``` + +| Gate | What it checks | Soft? | +|------|-----------------------------------------------------------------------------|-------| +| 1 | `pnpm install --frozen-lockfile` | no | +| 2 | `pnpm -r build` | no | +| 3 | `pnpm -r test` | no | +| 4 | banned-strings sweep | no | +| 5 | license allowlist | no | +| 6 | determinism — double-run `graphHash` identical | no | +| 7 | incremental reindex timings (5-run p95, logged only) | soft | +| 8 | MCP stdio boot + `tools/list` | no | +| 9 | Python eval harness — ≥ 40 / 49 cases pass | no | +| 10 | embeddings determinism (skipped if model weights absent) | soft | +| 11 | 100-file fixture incremental timing (5-run p95, logged only) | soft | +| 12 | scanner smoke — `codehub scan --scanners semgrep` emits SARIF | no | +| 13 | SARIF Zod-schema validation | no | +| 14 | license-audit smoke via the MCP tool | no | +| 15 | verdict smoke on a 2-commit fixture | no | + +Run acceptance before opening a PR that touches the analyze pipeline, +storage, the MCP server, or anything else called out in +[Dev loop / When to run acceptance](/opencodehub/contributing/dev-loop/#when-to-run-acceptance). + +## Gym — SCIP indexer differential tests + +The gym drives each per-language SCIP indexer against a frozen baseline +manifest and asserts that precision, recall, and F1 have not regressed +per language. It is the regression gate for compiler-grade edge +upgrades. + +```bash +mise run gym # run against the frozen baseline +mise run gym:baseline # lock a new baseline manifest (careful) +mise run gym:replay # bit-exact replay of a frozen manifest +``` + +Baselines live at `packages/gym/baselines/`. The differential tests run +in CI via `.github/workflows/gym.yml` on every PR that touches +`packages/scip-ingest`, `packages/ingestion`, or the frozen corpus. + +## Tenets apply to failing tests too + +Every failure — a lint warning, a flaky eval, a soft acceptance gate +that turned hard because a binary became available — is a blocker until +it is fixed or explicitly waived. See the +[tenets block](/opencodehub/contributing/overview/#tenets). + +## Related files + +- `scripts/acceptance.sh` — the 15-gate runner. +- `scripts/smoke-mcp.sh` — MCP boot smoke. +- `packages/eval/src/opencodehub_eval/tests/` — Python parametrized + eval cases. +- `packages/gym/baselines/` — frozen gym baselines. +- `.github/workflows/{ci,gym}.yml` — CI workflows. diff --git a/packages/docs/src/content/docs/guides/ci-integration.md b/packages/docs/src/content/docs/guides/ci-integration.md new file mode 100644 index 0000000..dc50fe6 --- /dev/null +++ b/packages/docs/src/content/docs/guides/ci-integration.md @@ -0,0 +1,86 @@ +--- +title: CI integration +description: Emit CI workflows, compute PR verdicts, and gate PRs on detected changes. +sidebar: + order: 80 +--- + +OpenCodeHub is built for CI from day one. Every command that matters in +a pipeline emits structured exit codes, supports `--json`, and runs +offline against the committed index. + +## Scaffold a pipeline + +```bash title="emit opinionated CI workflows" +codehub ci-init +``` + +`ci-init` detects whether the repo is on GitHub or GitLab and writes +the corresponding workflow file. Pass `--platform github`, +`--platform gitlab`, or `--platform both` to override. Use +`--main-branch release` to change the base branch, and `--force` to +overwrite an existing workflow. + +The emitted workflow runs `codehub analyze`, `codehub detect-changes +--scope compare --compare-ref origin/main --strict`, `codehub scan`, +and `codehub verdict` in that order. + +## Verdict: a 5-tier PR gate + +```bash title="compute a PR verdict" +codehub verdict --base main --head HEAD +``` + +`verdict` returns one of five tiers with a deterministic exit code: + +| Tier | Exit code | Meaning | +|---|---|---| +| `auto_merge` | 0 | Low-risk, no reviewer required by the graph. | +| `single_review` | 1 | One reviewer sufficient. | +| `dual_review` | 1 | Two reviewers recommended. | +| `expert_review` | 2 | Domain owner review required. | +| `block` | 3 | Do not merge — critical blast radius or policy fail. | + +Use the exit code directly in a CI step, or pass `--json` for the full +envelope with reasoning and contributing signals. + +## Detect changes on a PR + +```bash title="map the diff to graph symbols and processes" +codehub detect-changes --scope compare --compare-ref origin/main --strict +``` + +`detect-changes` returns the list of symbols, processes, and files +touched by the diff, each tagged with a risk tier. Exit codes: + +- `0` — OK (no HIGH/CRITICAL; MEDIUM allowed unless `--strict`). +- `1` — HIGH/CRITICAL found, or MEDIUM found with `--strict`. +- `2` — the command itself crashed. + +## Exit-code reference + +| Command | Exit 0 | Exit 1 | Exit 2 | Exit 3 | +|---|---|---|---|---| +| `analyze` | success | caught error | — | — | +| `detect-changes` | OK | risk found | caught error | — | +| `verdict` | `auto_merge` | `single_review` / `dual_review` | `expert_review` | `block` | +| `scan` | clean | findings at severity | scanner crashed | — | + +## Ingesting external SARIF + +If you already run another SAST tool, ingest its SARIF output into the +graph so the same `list_findings` MCP tool surfaces both sets: + +```bash title="ingest an external SARIF file" +codehub ingest-sarif path/to/report.sarif +``` + +The findings become `Finding` nodes with `FOUND_IN` edges to the +symbol and file they reference. + +## Next + +- [CLI reference](/opencodehub/reference/cli/) — every command, every + flag. +- [Error codes](/opencodehub/reference/error-codes/) — the fixed set of + MCP error codes your CI tooling may encounter. diff --git a/packages/docs/src/content/docs/guides/cross-repo-groups.md b/packages/docs/src/content/docs/guides/cross-repo-groups.md new file mode 100644 index 0000000..782f137 --- /dev/null +++ b/packages/docs/src/content/docs/guides/cross-repo-groups.md @@ -0,0 +1,74 @@ +--- +title: Cross-repo groups +description: Query and analyse a fleet of microservices as one group with codehub group. +sidebar: + order: 70 +--- + +A platform team with 40 microservices does not want to run 40 separate +`codehub query` commands to find "the users endpoint". Groups let you +bundle several indexed repos and hit them with one cross-repo search, +one contract scan, or one status probe. + +## Create a group + +```bash title="bundle three repos into a group named fleet" +codehub group create fleet repoA repoB repoC +``` + +The repo arguments must already be indexed (registered in +`~/.codehub/registry.json`). Use `codehub list` to see what is +registered, or `codehub analyze` inside each repo to register it. + +Add `--description "core platform services"` to annotate the group. + +## Sync the group + +```bash title="rebuild the cross-repo contract registry" +codehub group sync fleet +``` + +`group sync` walks every repo in the group, rebuilds the contract +registry (HTTP routes, MCP tools, shared types), and populates the +cross-link table so route-change blast-radius is visible across +repos. + +## Query across every repo + +```bash title="fused BM25 + RRF search" +codehub group query fleet "users endpoint" +``` + +Cross-repo search runs BM25 (and embedding search, when each repo has +embeddings) against every member and fuses the ranked lists with +reciprocal-rank fusion (RRF). The result is a single ranked list of +hits annotated with their source repo. + +Pass `--limit 20` (the default) or `--json` for a script-friendly +envelope. + +## Contracts and cross-links + +```bash title="list HTTP contracts and cross-repo call edges" +codehub group contracts fleet +``` + +`group contracts` surfaces every HTTP route defined in the group, the +handler that serves it, and every known consumer (caller) across the +other repos in the group. Combined with `api_impact` over MCP, this is +how platform teams see the blast radius of a route change before +shipping it. + +## Other group commands + +| Command | Purpose | +|---|---| +| `codehub group list` | List every group on this machine. | +| `codehub group status ` | Show staleness and last sync time for a group. | +| `codehub group delete ` | Drop the group (repos stay indexed). | + +## MCP equivalents + +Every `group` CLI command has an MCP tool with the same name prefix: +`group_list`, `group_query`, `group_status`, `group_contracts`, +`group_sync`. See [MCP tools](/opencodehub/mcp/tools/). diff --git a/packages/docs/src/content/docs/guides/indexing-a-repo.md b/packages/docs/src/content/docs/guides/indexing-a-repo.md new file mode 100644 index 0000000..0706f64 --- /dev/null +++ b/packages/docs/src/content/docs/guides/indexing-a-repo.md @@ -0,0 +1,101 @@ +--- +title: Indexing a repo +description: Run codehub analyze, add embeddings, go offline, and manage .codehub state. +sidebar: + order: 10 +--- + +`codehub analyze` is the full indexing pipeline: parse with tree-sitter +(and SCIP for the five languages that have indexers), resolve imports +and inheritance, detect processes and clusters, build BM25 and HNSW +indexes, and write everything to `.codehub/` under the repo root. + +## Basic indexing + +```bash title="index the current repo" +codehub analyze +``` + +Re-run after significant changes. A no-op short-circuit skips work if +the index already matches `HEAD`; pass `--force` to rebuild. + +## Add semantic vectors + +```bash title="full index with embeddings" +codehub analyze --embeddings +``` + +`--embeddings` computes symbol and optional file/community vectors and +writes them to the HNSW index. After this, `codehub query` fuses BM25 +and vector results via reciprocal-rank fusion (RRF). + +Memory-constrained machines can use `--embeddings-int8` for quantised +vectors, `--embeddings-workers auto` to tune the worker pool, or +`--embeddings-batch-size 32` (default) to tune batch throughput. + +## Zero-network indexing + +```bash title="offline mode — no sockets" +codehub analyze --offline +``` + +`--offline` disables every code path that would open a socket. Combine +with cached embedder weights (see `codehub setup --embeddings +--model-dir `) to index fully air-gapped. + +## Staleness and status + +```bash title="check index freshness" +codehub status +``` + +`status` compares the index against the working tree and reports +staleness. MCP responses also carry an envelope field +`_meta["codehub/staleness"]` whenever the index lags `HEAD`, so agents +can detect drift without polling. + +## Resetting the index + +```bash title="delete the .codehub/ directory" +codehub clean +``` + +`codehub clean --all` deletes every index registered on the machine and +wipes `~/.codehub/registry.json`. + +## Granularity + +```bash title="index at symbol, file, and community level" +codehub analyze --granularity symbol,file,community +``` + +The pipeline produces hierarchical embeddings so a single query can +surface a symbol, the file that contains it, and the community the +symbol participates in. The default granularity is `symbol`. + +## What lives in `.codehub/` + +| Path | Purpose | +|---|---| +| `graph.duckdb` | The DuckDB database with symbols, edges, processes, and embeddings. | +| `meta.json` | Index metadata (graph hash, node counts, CLI version, toolchain pins). | +| `scan.sarif` | SARIF scan output when `codehub scan` has run. | +| `sbom.cdx.json` | CycloneDX SBOM when `codehub analyze --sbom` has run. | +| `coverage/` | Coverage bridge artefacts when `--coverage` has run. | + +## Other useful flags + +- `--sbom` — emit a CycloneDX SBOM alongside the index. +- `--coverage` — bridge coverage data into the graph. +- `--summaries` / `--no-summaries` — LLM-generated symbol summaries + (default on; capped by `--max-summaries`, default auto = 10% of + callables, hard cap 500). +- `--skills` — generate Claude Code skills from the graph. +- `--wasm-only` — force the WASM fallback for every tree-sitter + grammar (sets `OCH_WASM_ONLY=1`). +- `--strict-detectors` — fail the build if a detector (DET-O-001) + regresses. +- `--verbose` — noisier logs. + +See [CLI reference: analyze](/opencodehub/reference/cli/#analyze) for +the complete flag list. diff --git a/packages/docs/src/content/docs/guides/troubleshooting.md b/packages/docs/src/content/docs/guides/troubleshooting.md new file mode 100644 index 0000000..a7ae383 --- /dev/null +++ b/packages/docs/src/content/docs/guides/troubleshooting.md @@ -0,0 +1,88 @@ +--- +title: Troubleshooting +description: Fix native build failures, stale indexes, ambiguous-repo errors, and Windows quirks. +sidebar: + order: 90 +--- + +## Native build failures (tree-sitter or DuckDB) + +Symptoms: `pnpm install` fails while building `tree-sitter`, +`@duckdb/node-api`, or any other native addon. Error mentions +`node-gyp`, `python`, a C/C++ compiler, or `Visual Studio Build Tools`. + +Fix: + +```bash title="probe the native toolchain" +codehub doctor +``` + +`doctor` checks Node version, the platform's C/C++ toolchain, and +whether each native module can load. Follow the remediation hints it +prints. As a fallback, run any indexing command with `--wasm-only` +(which sets `OCH_WASM_ONLY=1`) to skip native tree-sitter bindings: + +```bash title="force WASM tree-sitter" +codehub analyze --wasm-only +``` + +## Stale index + +Symptoms: MCP responses carry `_meta["codehub/staleness"]`, or +`codehub query` returns symbols that no longer exist. + +Fix: + +```bash title="check then rebuild" +codehub status +codehub analyze --force +``` + +`status` reports how far behind `HEAD` the index is. `analyze --force` +rebuilds from scratch regardless of the no-op short-circuit. Run +`codehub analyze` after every significant pull to stay aligned. + +## `AMBIGUOUS_REPO` error from MCP tools + +Symptoms: an MCP tool returns an error envelope with +`error.code: "AMBIGUOUS_REPO"`. + +Cause: you have more than one repo indexed in +`~/.codehub/registry.json`, and the tool call did not include a `repo` +argument. + +Fix: pass a `repo` argument to every per-repo tool call. The value is +the repo name from `codehub list`. If you are driving the server from +an agent, tell the agent to include `repo` every time. + +## Windows quirks + +Native tree-sitter and DuckDB builds on Windows require the Microsoft +C++ Build Tools plus a matching Python for `node-gyp`. In practice the +fastest fix is to run everything under WSL2 — WSL2 ships with a +working toolchain out of the box and avoids path separator issues. + +If you must stay on native Windows: + +1. Install Visual Studio Build Tools with the "Desktop development + with C++" workload. +2. Install Python from the Microsoft Store (Python 3.12). +3. `npm config set msvs_version 2022` and `npm config set python + python3.12`. +4. Re-run `pnpm install --frozen-lockfile`. +5. If anything still fails, fall back to `codehub analyze --wasm-only`. + +## The index is missing a language I expected + +Check [supported languages](/opencodehub/reference/languages/). If the +language is listed but returns no symbols, the grammar may have +failed to load natively; retry with `--wasm-only`. If the language is +not listed, it is not yet registered — see +[adding a language provider](/opencodehub/contributing/adding-a-language-provider/). + +## More help + +- `codehub doctor --verbose` dumps every probe the doctor runs. +- File an issue at + [github.com/theagenticguy/opencodehub](https://github.com/theagenticguy/opencodehub/issues) + with the `doctor` output attached. diff --git a/packages/docs/src/content/docs/guides/using-with-claude-code.md b/packages/docs/src/content/docs/guides/using-with-claude-code.md new file mode 100644 index 0000000..5241f61 --- /dev/null +++ b/packages/docs/src/content/docs/guides/using-with-claude-code.md @@ -0,0 +1,99 @@ +--- +title: Using with Claude Code +description: Wire the codehub MCP server into Claude Code via the plugin or an MCP-only config. +sidebar: + order: 20 +--- + +There are two ways to connect OpenCodeHub to Claude Code. The **plugin** path +adds a PreToolUse hook that auto-augments rename-class edits with `impact` and +`detect_changes`. The **MCP-only** path wires the server without the hook. + +## Plugin (preferred) + +```bash title="install the Claude Code plugin" +codehub setup --plugin +``` + +`--plugin` installs the OpenCodeHub plugin into Claude Code. The plugin +registers a PreToolUse hook that runs before any edit that looks like a +rename or a cross-file refactor. The hook calls `impact` and +`detect_changes`, then feeds the results back to Claude Code as inline +context so the agent can adjust its plan before writing a diff. + +The plugin bundles the MCP server wiring too, so you do not need to +also run `setup --editors claude-code`. + +## MCP-only + +If you prefer the raw MCP connection without the hook: + +```bash title="write .mcp.json for the current project" +codehub setup --editors claude-code +``` + +The writer targets `/.mcp.json` (Claude Code's project scope). + +**Prerequisite:** `codehub` must be on your `PATH` — run +`mise run cli:link` from a checkout, or `mise run cli:install-global` +to install the packed tarball. See +[Install](/opencodehub/start-here/install/). + +The resulting entry looks like: + +```json title=".mcp.json" +{ + "mcpServers": { + "codehub": { + "command": "codehub", + "args": ["mcp"], + "env": {} + } + } +} +``` + +The server runs over stdio. Claude Code spawns it on demand, sends +JSON-RPC over stdin/stdout, and keeps it alive for the session. + +:::note[Fallback for unlinked checkouts] +If you cannot put `codehub` on `PATH`, point the MCP config at the +CLI's `dist/` entrypoint instead — same behaviour, longer path: + +```json title=".mcp.json (fallback)" +{ + "mcpServers": { + "codehub": { + "command": "node", + "args": ["/abs/path/to/opencodehub/packages/cli/dist/index.js", "mcp"], + "env": {} + } + } +} +``` +::: + +## Multi-editor setup + +`--editors` accepts any comma-separated subset of +`claude-code,cursor,codex,windsurf,opencode`. The default is all five. + +```bash title="wire Claude Code and Cursor together" +codehub setup --editors claude-code,cursor +``` + +## Reverting + +```bash title="remove the codehub entry the last setup wrote" +codehub setup --editors claude-code --undo +``` + +`--undo` removes only the `codehub` entry; any other `mcpServers` +entries in `.mcp.json` are preserved. + +## Next + +- [MCP tools](/opencodehub/mcp/tools/) — the full catalogue of 28 tools + Claude Code will see. +- [MCP overview](/opencodehub/mcp/overview/) — server name, transport, + envelope conventions. diff --git a/packages/docs/src/content/docs/guides/using-with-codex.md b/packages/docs/src/content/docs/guides/using-with-codex.md new file mode 100644 index 0000000..a4e2d37 --- /dev/null +++ b/packages/docs/src/content/docs/guides/using-with-codex.md @@ -0,0 +1,69 @@ +--- +title: Using with Codex +description: Wire the codehub MCP server into OpenAI Codex via codehub setup. +sidebar: + order: 40 +--- + +Codex reads its MCP config from `~/.codex/config.toml`. It is the only +one of the five supported editors that uses TOML instead of JSON. +`codehub setup` writes the correct TOML block for you. + +## Wire the MCP server + +```bash title="write ~/.codex/config.toml" +codehub setup --editors codex +``` + +The writer merges a `[mcp_servers.codehub]` table into the existing +TOML without touching other tables. + +**Prerequisite:** `codehub` must be on your `PATH` — run +`mise run cli:link` from a checkout, or `mise run cli:install-global` +to install the packed tarball. See +[Install](/opencodehub/start-here/install/). + +The resulting block looks like: + +```toml title="~/.codex/config.toml" +[mcp_servers.codehub] +command = "codehub" +args = ["mcp"] +``` + +Restart Codex after the first write so it picks up the new server. +Codex spawns the server over stdio and keeps it alive for the session. + +:::note[Fallback for unlinked checkouts] +If you cannot put `codehub` on `PATH`, point Codex at the CLI's +`dist/` entrypoint instead — same behaviour, longer path: + +```toml title="~/.codex/config.toml (fallback)" +[mcp_servers.codehub] +command = "node" +args = ["/abs/path/to/opencodehub/packages/cli/dist/index.js", "mcp"] +``` +::: + +## Multi-editor setup + +`--editors` accepts any comma-separated subset of +`claude-code,cursor,codex,windsurf,opencode`. The default is all five. + +```bash title="wire Codex alongside Claude Code" +codehub setup --editors codex,claude-code +``` + +## Reverting + +```bash title="remove only the codehub entry" +codehub setup --editors codex --undo +``` + +`--undo` removes only the `[mcp_servers.codehub]` table. Other Codex +MCP servers are left alone. + +## Next + +- [MCP tools](/opencodehub/mcp/tools/) — the catalogue of 28 tools + Codex will see. diff --git a/packages/docs/src/content/docs/guides/using-with-cursor.md b/packages/docs/src/content/docs/guides/using-with-cursor.md new file mode 100644 index 0000000..e518876 --- /dev/null +++ b/packages/docs/src/content/docs/guides/using-with-cursor.md @@ -0,0 +1,86 @@ +--- +title: Using with Cursor +description: Wire the codehub MCP server into Cursor via codehub setup. +sidebar: + order: 30 +--- + +Cursor reads MCP servers from `~/.cursor/mcp.json` (global scope, shared +across all Cursor projects). `codehub setup` writes the entry for you. + +## Wire the MCP server + +```bash title="write ~/.cursor/mcp.json" +codehub setup --editors cursor +``` + +The writer merges a `codehub` entry into the existing `mcpServers` +object without touching any other servers you may already have wired. + +**Prerequisite:** `codehub` must be on your `PATH` — run +`mise run cli:link` from a checkout, or `mise run cli:install-global` +to install the packed tarball. See +[Install](/opencodehub/start-here/install/). + +The entry has the same shape as Claude Code's: + +```json title="~/.cursor/mcp.json" +{ + "mcpServers": { + "codehub": { + "command": "codehub", + "args": ["mcp"], + "env": {} + } + } +} +``` + +Restart Cursor (or reload the window) after the first write so it picks +up the new server. Cursor spawns the server over stdio and keeps it +alive for the session. + +:::note[Fallback for unlinked checkouts] +If you cannot put `codehub` on `PATH`, point Cursor at the CLI's +`dist/` entrypoint instead — same behaviour, longer path: + +```json title="~/.cursor/mcp.json (fallback)" +{ + "mcpServers": { + "codehub": { + "command": "node", + "args": ["/abs/path/to/opencodehub/packages/cli/dist/index.js", "mcp"], + "env": {} + } + } +} +``` +::: + +## Using the tools + +Open Cursor's chat, select a model that supports tool use, and ask +questions like "What is the blast radius of `validateUser`?" or "Find +me everything related to the auth token refresh flow." Cursor will +call the codehub MCP tools directly and return structured results. + +See [MCP tools](/opencodehub/mcp/tools/) for the full catalogue of 28 +tools. + +## Multi-editor setup + +`--editors` accepts any comma-separated subset of +`claude-code,cursor,codex,windsurf,opencode`. The default is all five. + +```bash title="wire Cursor alongside Claude Code" +codehub setup --editors cursor,claude-code +``` + +## Reverting + +```bash title="remove only the codehub entry" +codehub setup --editors cursor --undo +``` + +`--undo` removes only the `codehub` entry from `~/.cursor/mcp.json`. +Other MCP servers are left alone. diff --git a/packages/docs/src/content/docs/guides/using-with-opencode.md b/packages/docs/src/content/docs/guides/using-with-opencode.md new file mode 100644 index 0000000..c1dee36 --- /dev/null +++ b/packages/docs/src/content/docs/guides/using-with-opencode.md @@ -0,0 +1,80 @@ +--- +title: Using with OpenCode +description: Wire the codehub MCP server into OpenCode via codehub setup. +sidebar: + order: 60 +--- + +OpenCode reads MCP servers from `/opencode.json`. The OpenCode +schema nests servers under a top-level `mcp` key with a `type: "local"` +discriminator. `codehub setup` writes the correct shape for you. + +## Wire the MCP server + +```bash title="write opencode.json in the current project" +codehub setup --editors opencode +``` + +The writer merges a `codehub` entry into the existing `mcp` object. + +**Prerequisite:** `codehub` must be on your `PATH` — run +`mise run cli:link` from a checkout, or `mise run cli:install-global` +to install the packed tarball. See +[Install](/opencodehub/start-here/install/). + +The entry looks like: + +```json title="opencode.json" +{ + "mcp": { + "codehub": { + "type": "local", + "command": ["codehub", "mcp"], + "enabled": true + } + } +} +``` + +Reload OpenCode after the first write. The server runs over stdio for +the session. + +:::note[Fallback for unlinked checkouts] +If you cannot put `codehub` on `PATH`, point OpenCode at the CLI's +`dist/` entrypoint instead — same behaviour, longer path: + +```json title="opencode.json (fallback)" +{ + "mcp": { + "codehub": { + "type": "local", + "command": ["node", "/abs/path/to/opencodehub/packages/cli/dist/index.js", "mcp"], + "enabled": true + } + } +} +``` +::: + +## Multi-editor setup + +`--editors` accepts any comma-separated subset of +`claude-code,cursor,codex,windsurf,opencode`. The default is all five. + +```bash title="wire OpenCode alongside Claude Code" +codehub setup --editors opencode,claude-code +``` + +## Reverting + +```bash title="remove only the codehub entry" +codehub setup --editors opencode --undo +``` + +`--undo` removes only the `codehub` entry from `opencode.json`. Other +MCP servers configured there are left alone. + +## Next + +- [MCP tools](/opencodehub/mcp/tools/) — the catalogue of 28 tools + OpenCode will see. diff --git a/packages/docs/src/content/docs/guides/using-with-windsurf.md b/packages/docs/src/content/docs/guides/using-with-windsurf.md new file mode 100644 index 0000000..34bcb87 --- /dev/null +++ b/packages/docs/src/content/docs/guides/using-with-windsurf.md @@ -0,0 +1,80 @@ +--- +title: Using with Windsurf +description: Wire the codehub MCP server into Windsurf via codehub setup. +sidebar: + order: 50 +--- + +Windsurf reads MCP servers from `~/.codeium/windsurf/mcp_config.json`. +`codehub setup` writes the entry for you. + +## Wire the MCP server + +```bash title="write Windsurf's MCP config" +codehub setup --editors windsurf +``` + +The writer merges a `codehub` entry into the existing `mcpServers` +object without touching other servers. + +**Prerequisite:** `codehub` must be on your `PATH` — run +`mise run cli:link` from a checkout, or `mise run cli:install-global` +to install the packed tarball. See +[Install](/opencodehub/start-here/install/). + +The entry uses the same shape as Claude Code and Cursor: + +```json title="~/.codeium/windsurf/mcp_config.json" +{ + "mcpServers": { + "codehub": { + "command": "codehub", + "args": ["mcp"], + "env": {} + } + } +} +``` + +Reload Windsurf after the first write so it picks up the new server. +The server runs over stdio for the lifetime of the session. + +:::note[Fallback for unlinked checkouts] +If you cannot put `codehub` on `PATH`, point Windsurf at the CLI's +`dist/` entrypoint instead — same behaviour, longer path: + +```json title="~/.codeium/windsurf/mcp_config.json (fallback)" +{ + "mcpServers": { + "codehub": { + "command": "node", + "args": ["/abs/path/to/opencodehub/packages/cli/dist/index.js", "mcp"], + "env": {} + } + } +} +``` +::: + +## Multi-editor setup + +`--editors` accepts any comma-separated subset of +`claude-code,cursor,codex,windsurf,opencode`. The default is all five. + +```bash title="wire Windsurf alongside Cursor" +codehub setup --editors windsurf,cursor +``` + +## Reverting + +```bash title="remove only the codehub entry" +codehub setup --editors windsurf --undo +``` + +`--undo` removes only the `codehub` entry. Other Windsurf MCP servers +are left alone. + +## Next + +- [MCP tools](/opencodehub/mcp/tools/) — the catalogue of 28 tools + Windsurf will see. diff --git a/packages/docs/src/content/docs/index.mdx b/packages/docs/src/content/docs/index.mdx new file mode 100644 index 0000000..0fa98d6 --- /dev/null +++ b/packages/docs/src/content/docs/index.mdx @@ -0,0 +1,92 @@ +--- +title: OpenCodeHub +description: Apache-2.0 code intelligence graph + MCP server for AI coding agents. +template: splash +hero: + tagline: Code intelligence for AI coding agents, under Apache-2.0, on an all-OSS stack. + image: + file: ../../assets/logo.svg + actions: + - text: Quick start + link: /opencodehub/start-here/quick-start/ + icon: right-arrow + variant: primary + - text: View on GitHub + link: https://github.com/theagenticguy/opencodehub + icon: external + variant: minimal +--- + +import { Card, CardGrid, LinkCard } from "@astrojs/starlight/components"; + +## Why OpenCodeHub + + + + Agents get callers, callees, processes, and blast radius in one + MCP tool call — no grep round-trips, no lossy embeddings alone. + + + `codehub analyze --offline` opens zero sockets. Your code never + leaves your machine. DuckDB + `hnsw_acorn` is the entire storage + stack — no daemon, no SaaS. + + + Every runtime dep sits on a permissive allowlist (Apache-2.0 / + MIT / BSD / ISC / CC0 / BlueOak / 0BSD). Fork, embed, and ship. + + + Identical inputs produce a byte-identical graph hash. + Reproducible. Auditable. Cacheable in CI. + + + +## Start here + + + + + + + + +## For contributors + + + + + + + diff --git a/packages/docs/src/content/docs/mcp/overview.md b/packages/docs/src/content/docs/mcp/overview.md new file mode 100644 index 0000000..267734a --- /dev/null +++ b/packages/docs/src/content/docs/mcp/overview.md @@ -0,0 +1,76 @@ +--- +title: MCP overview +description: Server name, transport, capabilities, and ambient conventions for the OpenCodeHub MCP server. +sidebar: + order: 10 +--- + +OpenCodeHub ships an MCP server that any Model-Context-Protocol client +can connect to over stdio. + +## Connection + +- **Server name:** `opencodehub` +- **Transport:** stdio (JSON-RPC over stdin/stdout) +- **Launch command:** `codehub mcp` +- **Capabilities:** `tools`, `resources`, `prompts` +- **Tool count:** 28 (registered in `packages/mcp/src/server.ts`) + +Clients spawn the `codehub mcp` process and exchange JSON-RPC frames +over its stdio pipes. Signals map to clean exits: `SIGINT` → 130, +`SIGTERM` → 143, stdin close → 0. + +## Client setup + +Every supported editor has a one-command setup path: + +- [Claude Code](/opencodehub/guides/using-with-claude-code/) +- [Cursor](/opencodehub/guides/using-with-cursor/) +- [Codex](/opencodehub/guides/using-with-codex/) +- [Windsurf](/opencodehub/guides/using-with-windsurf/) +- [OpenCode](/opencodehub/guides/using-with-opencode/) + +All five use `codehub setup --editors ` and write into the +editor's native MCP config location. + +## Ambient conventions + +The server follows two conventions every client should know. + +### Optional `repo` argument + +Per-repo tools accept an optional `repo` string. Resolution rules: + +- **Exactly one repo in the registry:** `repo` is optional; the server + infers it. +- **Two or more repos and `repo` omitted:** the tool returns + `AMBIGUOUS_REPO` in the error envelope with a list of registered + repos in `hint`. +- **`repo` provided:** the server uses it directly. + +### Response envelope + +Every successful tool result carries two ambient fields alongside the +tool-specific payload: + +- **`next_steps: string[]`** — one-line agent-targeted hints ("call + `context` on the top result" / "stage edits then call + `detect_changes`"). Helper: `packages/mcp/src/next-step-hints.ts`. +- **`_meta["codehub/staleness"]`** — populated only when the index + lags `HEAD`. Carries the staleness envelope so the agent can decide + whether to trust the result or ask the user to re-run `codehub + analyze`. Constant: `STALENESS_META_KEY = "codehub/staleness"`. + +Error responses instead carry `isError: true`, +`structuredContent.error`, and no payload. See +[error codes](/opencodehub/reference/error-codes/). + +## What the server exposes + +- **28 tools** — search, navigation, change analysis, findings, + verdict, routes, cross-repo groups, and metadata. See + [tools](/opencodehub/mcp/tools/). +- **7 resources** — structured views over repos, clusters, and + processes. See [resources](/opencodehub/mcp/resources/). +- **5 prompts** — pre-baked agent playbooks. See + [prompts](/opencodehub/mcp/prompts/). diff --git a/packages/docs/src/content/docs/mcp/prompts.md b/packages/docs/src/content/docs/mcp/prompts.md new file mode 100644 index 0000000..e018dbd --- /dev/null +++ b/packages/docs/src/content/docs/mcp/prompts.md @@ -0,0 +1,20 @@ +--- +title: MCP prompts +description: The five pre-baked prompts the opencodehub server ships. +sidebar: + order: 40 +--- + +The `opencodehub` MCP server registers five prompts. Each one is a +pre-baked playbook the agent can invoke to drive a multi-step task +with the right tool-call sequence and the right framing. + +| Prompt | Purpose | +|---|---| +| `detect-impact` | Walk a staged or compared diff through `detect_changes` → `impact` → `verdict`, then summarise risk. | +| `review-pr` | Structured PR review: findings, risk, route and contract diffs, and a recommended verdict tier. | +| `explore-area` | Onboard the agent to an unfamiliar part of the repo via `query` and `context`, grouped by process. | +| `audit-dependencies` | Inventory dependencies with `dependencies` and `license_audit`, flag license outliers, list high-risk packages. | +| `generate-map` | Emit a Markdown map of the repo (modules, routes, MCP tools) using `route_map`, `tool_map`, and clusters. | + +Implementations live under `packages/mcp/src/prompts/`. diff --git a/packages/docs/src/content/docs/mcp/resources.md b/packages/docs/src/content/docs/mcp/resources.md new file mode 100644 index 0000000..e4722c7 --- /dev/null +++ b/packages/docs/src/content/docs/mcp/resources.md @@ -0,0 +1,24 @@ +--- +title: MCP resources +description: The seven MCP resources the opencodehub server publishes. +sidebar: + order: 30 +--- + +The `opencodehub` MCP server publishes seven resources alongside its +tools. Clients that honour MCP resources (Claude Code, Cursor) can +read them directly; clients that do not can usually reach the same +data via the corresponding tool. + +| URI | Purpose | +|---|---| +| `codehub://repos` | All repos registered on this machine. | +| `codehub://repo-context` | High-level profile for one repo: language mix, entry points, top processes. | +| `codehub://repo-schema` | The graph schema (node kinds, edge kinds) for one repo. | +| `codehub://repo-clusters` | All clusters (communities) detected for one repo. | +| `codehub://repo-cluster` | One cluster with its members and connecting edges. | +| `codehub://repo-processes` | All execution-flow processes detected for one repo. | +| `codehub://repo-process` | One process with its ordered steps, files, and participating symbols. | + +Each resource returns JSON. Implementations live under +`packages/mcp/src/resources/`. diff --git a/packages/docs/src/content/docs/mcp/tools.md b/packages/docs/src/content/docs/mcp/tools.md new file mode 100644 index 0000000..5a5165f --- /dev/null +++ b/packages/docs/src/content/docs/mcp/tools.md @@ -0,0 +1,85 @@ +--- +title: MCP tools +description: All 28 MCP tools the opencodehub server registers, grouped by functional cluster. +sidebar: + order: 20 +--- + +The `opencodehub` MCP server registers **28 tools**, imported and +invoked from `packages/mcp/src/server.ts`. The canonical number is +taken live from `buildServer()` at startup. + +> `scripts/smoke-mcp.sh` currently expects 19 tools in its default +> `EXPECTED_TOOLS` env var — that is a stale smoke baseline, not the +> source of truth. + +Every per-repo tool accepts an optional `repo` argument; see +[MCP overview](/opencodehub/mcp/overview/) for the resolution rules. + +## Search and navigation + +| Tool | Purpose | Primary inputs | +|---|---|---| +| `list_repos` | List indexed repos on this machine. | — | +| `query` | Hybrid BM25 + vector code-graph search, grouped by process. | `text`, `repo?`, `limit?` | +| `context` | 360-degree view of one symbol: callers, callees, processes. | `symbol`, `repo?` | +| `impact` | Change-impact blast radius with risk tier. | `symbol`, `depth?`, `direction?`, `repo?` | +| `pack_codebase` | Pack a repo into an LLM-ready snapshot (repomix). | `path?`, `style?` | +| `sql` | Read-only SQL against the graph store; 5 s timeout. | `query`, `repo?` | + +## Change analysis + +| Tool | Purpose | Primary inputs | +|---|---|---| +| `detect_changes` | Map a git diff to indexed symbols and processes. | `scope?`, `compareRef?`, `repo?`, `strict?` | +| `rename` | Coordinated multi-file symbol rename with confidence-tagged edits. | `from`, `to`, `repo?`, `dryRun?` | +| `list_dead_code` | List dead and unreachable-export symbols. | `repo?` | +| `remove_dead_code` | Remove dead symbols from disk. | `repo?`, `targets` | + +## Findings and verdict + +| Tool | Purpose | Primary inputs | +|---|---|---| +| `scan` | Run Priority-1 scanners and ingest findings. | `scanners?`, `severity?`, `repo?` | +| `list_findings` | List SARIF findings for a repo. | `repo?`, `severity?` | +| `list_findings_delta` | Diff SARIF findings against a baseline. | `baseline`, `repo?` | +| `verdict` | 5-tier PR verdict. | `base?`, `head?`, `repo?` | +| `risk_trends` | Per-community risk trend plus 30-day projection. | `repo?` | + +## Routes and contracts + +| Tool | Purpose | Primary inputs | +|---|---|---| +| `route_map` | Map HTTP routes to handlers and consumers. | `repo?` | +| `api_impact` | Route change blast radius. | `route`, `repo?` | +| `shape_check` | Route response-shape mismatch check. | `route`, `repo?` | +| `tool_map` | Map MCP tool definitions defined in the repo. | `repo?` | + +## Cross-repo groups + +| Tool | Purpose | Primary inputs | +|---|---|---| +| `group_list` | List cross-repo groups on this machine. | — | +| `group_query` | Cross-repo BM25 + RRF search. | `group`, `text`, `limit?` | +| `group_status` | Staleness and last-sync report for a group. | `group` | +| `group_contracts` | Cross-repo HTTP contracts plus cross-links. | `group` | +| `group_sync` | Rebuild the cross-repo contract registry. | `group` | + +## Metadata + +| Tool | Purpose | Primary inputs | +|---|---|---| +| `project_profile` | Summary profile for the repo (language mix, entry points, owners). | `repo?` | +| `dependencies` | List external dependencies. | `repo?` | +| `license_audit` | Audit dependency licenses against the allowlist. | `repo?` | +| `owners` | List owners for a node. | `node`, `repo?` | + +## See also + +- [MCP overview](/opencodehub/mcp/overview/) — server name, transport, + envelope conventions. +- [Error codes](/opencodehub/reference/error-codes/) — the fixed error + envelope under `structuredContent.error`. +- [Resources](/opencodehub/mcp/resources/) — structured views + alongside the tools. +- [Prompts](/opencodehub/mcp/prompts/) — pre-baked agent playbooks. diff --git a/packages/docs/src/content/docs/reference/cli.md b/packages/docs/src/content/docs/reference/cli.md new file mode 100644 index 0000000..503e0b5 --- /dev/null +++ b/packages/docs/src/content/docs/reference/cli.md @@ -0,0 +1,385 @@ +--- +title: CLI reference +description: Every codehub command, flag, and exit code. +sidebar: + order: 10 +--- + +Binary: `codehub`. Source entry: `packages/cli/src/index.ts`. Published +entry: `packages/cli/dist/index.js`. Default error contract: an +unhandled throw writes `codehub: ` to stderr and sets +`process.exitCode = 1`. + +## `analyze` + +Index a repository. Runs the full pipeline: parse, resolve, cluster, +build BM25 + HNSW indexes, and write `.codehub/`. + +```bash title="usage" +codehub analyze [path] +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--force` | off | Rebuild even if the no-op short-circuit fires. | +| `--embeddings` | off | Compute semantic vectors. | +| `--embeddings-int8` | off | Quantise vectors to int8. | +| `--granularity ` | `symbol` | Any subset of `symbol,file,community`. | +| `--embeddings-workers ` | auto | Size of the embedding worker pool. | +| `--embeddings-batch-size ` | 32 | Batch size per worker. | +| `--offline` | off | Zero sockets. | +| `--verbose` | off | Noisier logs. | +| `--skip-agents-md` | off | Skip AGENTS.md ingestion. | +| `--sbom` | off | Emit `sbom.cdx.json` alongside the index. | +| `--coverage` | off | Bridge coverage data into the graph. | +| `--summaries` / `--no-summaries` | on | LLM-generated symbol summaries. | +| `--max-summaries ` | auto (10% of callables, cap 500) | Summary budget. | +| `--summary-model ` | — | Override the summary model. | +| `--skills` | off | Emit Claude Code skills. | +| `--wasm-only` | off | Force WASM tree-sitter; sets `OCH_WASM_ONLY=1`. | +| `--strict-detectors` | off | Fail the build if DET-O-001 regresses. | + +Exit codes: `0` success, `1` caught error. + +## `index` + +Register an existing `.codehub/` into `~/.codehub/registry.json` without +re-analysing. + +```bash title="usage" +codehub index [paths...] +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--force` | off | Overwrite an existing registry entry. | +| `--allow-non-git` | off | Permit registering a repo with no `.git`. | + +## `init` + +Bootstrap a repo for OpenCodeHub. Copies the Claude Code plugin assets +into `.claude/` (project scope, with hook tokens rewritten from +`${CLAUDE_PLUGIN_ROOT}` to `${CLAUDE_PROJECT_DIR}/.claude`), writes +`.mcp.json`, appends `.codehub/` to `.gitignore`, and seeds +`opencodehub.policy.yaml` with every rule commented out. + +```bash title="usage" +codehub init [path] +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--force` | off | Overwrite conflicting files under `.claude/`. | +| `--skip-mcp` | off | Skip writing `.mcp.json`. | +| `--skip-policy` | off | Skip seeding `opencodehub.policy.yaml`. | + +## `setup` + +Wire MCP config into supported editors, install the Claude Code +plugin, or download embedder weights. + +```bash title="usage" +codehub setup +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--editors ` | all | `claude-code,cursor,codex,windsurf,opencode`. | +| `--force` | off | Overwrite existing entries. | +| `--undo` | off | Remove only the `codehub` entry each writer added. | +| `--embeddings` | off | Download the embedder model weights. | +| `--int8` | off | Download int8-quantised weights. | +| `--model-dir ` | — | Custom weights directory. | +| `--plugin` | off | Install the Claude Code plugin. | + +## `mcp` + +Launch the stdio MCP server. + +```bash title="usage" +codehub mcp +``` + +Signal handling: `SIGINT` → 130, `SIGTERM` → 143, stdin close → 0. + +## `list` + +List repos indexed on this machine. + +```bash title="usage" +codehub list +``` + +## `status` + +Report index metadata and staleness for one repo. + +```bash title="usage" +codehub status [path] +``` + +## `clean` + +Delete the index at `[path]`. + +```bash title="usage" +codehub clean [path] +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--all` | off | Delete every registered index. | + +## `pack` + +Emit a single-file, LLM-ready, AST-compressed snapshot of the repo +(powered by repomix). + +```bash title="usage" +codehub pack [path] +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--style ` | `xml` | Output format. | +| `--no-compress` | off | Disable AST compression. | +| `--remove-comments` | off | Strip comments. | +| `--out ` | — | Output file. | + +## `query` + +Hybrid BM25 + embedding search. + +```bash title="usage" +codehub query +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--limit ` | 10 | Max results. | +| `--repo ` | current | Target repo (required when >1 indexed and no cwd match). | +| `--json` | off | Structured envelope. | +| `--content` | off | Include source content per result. | +| `--context ` | — | Extra context string for re-ranking. | +| `--goal ` | — | Goal string for re-ranking. | +| `--max-symbols ` | 50 | Cap on candidate symbols. | +| `--bm25-only` | off | Skip vector search. | +| `--rerank-top-k ` | 50 | Candidates fed into the re-ranker. | +| `--zoom` | off | Zoom into processes. | +| `--fanout ` | — | Fan-out per process. | +| `--granularity ` | symbol | Result granularity. | + +## `context` + +Callers, callees, and processes for one symbol. + +```bash title="usage" +codehub context +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--repo ` | current | Target repo. | +| `--json` | off | Structured envelope. | + +## `impact` + +Blast-radius for one symbol. + +```bash title="usage" +codehub impact +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--depth ` | 3 | BFS depth. | +| `--direction ` | both | Traversal direction. | +| `--repo ` | current | Target repo. | +| `--json` | off | Structured envelope. | +| `--target-uid ` | — | Disambiguate by graph UID. | +| `--file-path ` | — | Disambiguate by file. | +| `--kind ` | — | Disambiguate by kind. | + +## `detect-changes` + +Map a diff to symbols and processes. + +```bash title="usage" +codehub detect-changes +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--scope ` | `all` | Diff scope. | +| `--compare-ref ` | — | Ref for `--scope compare`. | +| `--repo ` | current | Target repo. | +| `--json` | off | Structured envelope. | +| `--strict` | off | Exit 1 on MEDIUM as well. | + +Exit codes: `0` OK, `1` HIGH/CRITICAL (or MEDIUM+ `--strict`), `2` caught error. + +## `verdict` + +5-tier PR verdict. + +```bash title="usage" +codehub verdict +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--base ` | `main` | Base ref. | +| `--head ` | `HEAD` | Head ref. | +| `--repo ` | current | Target repo. | +| `--json` | off | Structured envelope. | + +Exit codes: `auto_merge=0`, `single_review=1`, `dual_review=1`, +`expert_review=2`, `block=3`. + +## `group` + +Cross-repo group management. + +```bash title="usage" +codehub group create [--description ] +codehub group list +codehub group delete +codehub group status +codehub group query [--limit ] [--json] +codehub group sync [--json] +``` + +`--limit` defaults to 20 for `group query`. + +## `ingest-sarif` + +Ingest a SARIF 2.1.0 file into the graph as `Finding` nodes plus +`FOUND_IN` edges. + +```bash title="usage" +codehub ingest-sarif +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--repo ` | current | Target repo. | + +## `scan` + +Run Priority-1 scanners and ingest findings. + +```bash title="usage" +codehub scan [path] +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--scanners ` | all | Scanner IDs. | +| `--with ` | — | Additional scanners. | +| `--output ` | `/.codehub/scan.sarif` | SARIF output path. | +| `--severity ` | `HIGH,CRITICAL` | Gate severity. | +| `--repo ` | current | Target repo. | +| `--concurrency ` | — | Scanner concurrency. | +| `--timeout ` | — | Per-scanner timeout. | + +Exit codes: `0` clean, `1` findings at severity, `2` scanner crashed. + +## `doctor` + +Probe the environment. + +```bash title="usage" +codehub doctor +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--skip-native` | off | Skip native-module probes. | +| `--repoRoot ` | cwd | Repo root to probe. | + +## `bench` + +Run the acceptance-gate bench suite and emit a dashboard. + +```bash title="usage" +codehub bench +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--acceptance ` | — | Acceptance manifest. | +| `--silent` | off | Suppress console output. | + +## `wiki` + +Emit a Markdown wiki for the repo. + +```bash title="usage" +codehub wiki +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--output ` | required | Destination directory. | +| `--repo ` | current | Target repo. | +| `--json` | off | Structured envelope. | +| `--offline` | off | Incompatible with `--llm`. | +| `--llm` | off | Enrich with LLM prose. | +| `--max-llm-calls ` | 0 (dry-run) | Budget. | +| `--llm-model ` | — | Override LLM model. | + +## `ci-init` + +Emit opinionated CI workflows. + +```bash title="usage" +codehub ci-init +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--platform ` | auto-detect | Target CI. | +| `--main-branch ` | `main` | Base branch. | +| `--repo ` | cwd | Repo root. | +| `--force` | off | Overwrite. | + +## `augment` + +Fast BM25 enrichment for editor PreToolUse hooks. Writes to stderr so +the hook can pipe it to the agent. + +```bash title="usage" +codehub augment +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--limit ` | 5 | Max hits. | + +## `eval-server` + +Launch the persistent loopback HTTP daemon that wraps MCP handlers +(used by SWE-bench loops). + +```bash title="usage" +codehub eval-server +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--port ` | 4848 | Listen port. | +| `--idle-timeout ` | 900 | Idle timeout. | + +## `sql` + +Read-only SQL against the graph store. + +```bash title="usage" +codehub sql +``` + +| Flag | Default | Purpose | +|---|---|---| +| `--repo ` | current | Target repo. | +| `--timeout ` | 5000 | Statement timeout. | +| `--json` | off | Structured envelope. | diff --git a/packages/docs/src/content/docs/reference/configuration.md b/packages/docs/src/content/docs/reference/configuration.md new file mode 100644 index 0000000..24abfdf --- /dev/null +++ b/packages/docs/src/content/docs/reference/configuration.md @@ -0,0 +1,60 @@ +--- +title: Configuration +description: Environment variables, on-disk layout, registry, and editor setup targets. +sidebar: + order: 20 +--- + +## Environment variables + +| Name | Purpose | +|---|---| +| `OCH_WASM_ONLY` | Force the WASM fallback for every tree-sitter grammar. Set to `1` by `codehub analyze --wasm-only`. | +| `CODEHUB_HOME` | Override `~/.codehub/` (where the registry and embedder weights live). | +| `CODEHUB_EMBEDDING_URL` | Endpoint URL for an external embedding service. | +| `CODEHUB_EMBEDDING_MODEL` | Model ID to request from the embedding service. | +| `CODEHUB_EMBEDDING_DIMS` | Integer dimensionality of the embedding model. | +| `CODEHUB_EMBEDDING_API_KEY` | API key for the embedding service (sent as `Authorization: Bearer ...`). | +| `NO_COLOR` | Standard convention; disables colored console output. | + +## On-disk layout: `.codehub/` + +`codehub analyze` writes everything under `/.codehub/`: + +| Path | Purpose | +|---|---| +| `graph.duckdb` | Primary DuckDB database: symbols, edges, processes, embeddings. | +| `meta.json` | Index metadata: graph hash, node counts, CLI version, toolchain pins. | +| `scan.sarif` | SARIF output from `codehub scan`. | +| `sbom.cdx.json` | CycloneDX SBOM when `codehub analyze --sbom` has run. | +| `coverage/` | Coverage bridge artefacts when `--coverage` has run. | + +Safe to delete and rebuild at any time via `codehub clean` + +`codehub analyze`. + +## Registry: `~/.codehub/registry.json` + +The registry maps each registered repo to its index path. It is +consulted by: + +- Every per-repo MCP tool that accepts an optional `repo` argument. +- `codehub list`, `codehub status`, `codehub clean --all`. +- `codehub group create` when resolving repo names. + +`CODEHUB_HOME` relocates the parent directory. + +## `codehub setup` targets + +Each editor writer has a fixed target path and merges a `codehub` +entry non-destructively: + +| Editor | Path | Format | +|---|---|---| +| `claude-code` | `/.mcp.json` | JSON | +| `cursor` | `~/.cursor/mcp.json` | JSON | +| `codex` | `~/.codex/config.toml` | TOML | +| `windsurf` | `~/.codeium/windsurf/mcp_config.json` | JSON | +| `opencode` | `/opencode.json` | JSON | + +`--undo` removes only the `codehub` entry each writer added; other +entries are preserved. diff --git a/packages/docs/src/content/docs/reference/docmeta-schema.mdx b/packages/docs/src/content/docs/reference/docmeta-schema.mdx new file mode 100644 index 0000000..ab621fd --- /dev/null +++ b/packages/docs/src/content/docs/reference/docmeta-schema.mdx @@ -0,0 +1,98 @@ +--- +title: ".docmeta.json schema" +description: "Manifest written by Phase E of codehub-document. Drives --refresh and cross-reference assembly." +--- + +import { Aside, Code } from "@astrojs/starlight/components"; + +`codehub-document` writes a `.docmeta.json` sidecar alongside the generated +Markdown tree at the end of every Phase E run. The file is the source of truth +for `--refresh` and for `codehub status` staleness reporting. + + + +## Schema (v1) + + + +### Top-level fields + +| Field | Type | Meaning | +|---|---|---| +| `$schema` | string | JSON Schema URL for v1. Locked. | +| `generated_at` | ISO-8601 | When Phase E completed. | +| `codehub_graph_hash` | `sha256:` | Taken from `list_repos` at orchestration start. The hash that anchors this doc tree. | +| `mode` | `"single-repo" \| "group"` | Whether the tree was produced by single-repo or group invocation. | +| `repo` | string \| null | The target repo (single mode) or the group root's registered repo reference (group mode). | +| `group` | string \| null | The group name (group mode only). | +| `staleness_at` | ISO-8601 | Lifted from the last MCP response's `_meta.codehub/staleness` envelope observed during assembly. | +| `sections[]` | array | One entry per generated Markdown file. | +| `cross_repo_refs[]` | array | Cross-repo links computed by Phase E. Only populated in group mode. | +| `frontmatter_removed[]` | string[] | Paths where Phase E stripped stray YAML frontmatter. Normally empty. | + +### `sections[]` entries + +| Field | Type | Meaning | +|---|---|---| +| `path` | string | Relative path from the docs root. | +| `agent` | string | The subagent that wrote this section (`doc-architecture`, `doc-reference`, etc.). Identifies ownership for `--refresh` dispatch. | +| `sources[]` | string[] | Source-file paths this section cites. Used by `--refresh` to decide staleness via mtime comparison. | +| `mtime` | ISO-8601 | When this section file was last written. | +| `citation_count` | number | Total backtick citations extracted by Phase E. | +| `mermaid_count` | number | Fenced ```` ```mermaid ```` blocks detected. | + +### `cross_repo_refs[]` entries (group mode only) + +| Field | Type | Meaning | +|---|---|---| +| `repo` | string | The sibling repo being linked. | +| `from_doc` | string | Relative path (from the group docs root) of the source doc. | +| `to_doc` | string | Relative path into the sibling repo's generated docs. | +| `contract_count` | number | Number of contracts sharing source citations across this cross-repo pair. Computed from `group_contracts`. | + +## How `--refresh` uses the schema + +1. Load `.docmeta.json`. +2. Compare the manifest's `codehub_graph_hash` against `list_repos`. If they match exactly, skip to step 5. +3. For each section, `stat` every `sources[i]`. If `max(source_mtime) > section.mtime`, mark it stale. +4. Collect stale sections + owners (`section.agent`); dispatch only the owning subagents with a `sections_to_refresh` list. +5. Always re-run Phase E (cross-reference assembly is cheap and idempotent). + +See [`references/cross-reference-spec.md`](https://github.com/theagenticguy/opencodehub/blob/main/plugins/opencodehub/skills/codehub-document/references/cross-reference-spec.md) inside the plugin for the Phase E algorithm. + +## Validation + +The JSON Schema is locked at v1. Breaking changes bump to v2 and keep v1 readers working for one release cycle. Run-time validation lives in `packages/analysis/src/docmeta.ts` (written as part of spec 001 Act phase). + +## See also + +- [ADR 0009 — Artifact output conventions](/opencodehub/architecture/adrs/#adr-0009--artifact-output-conventions) +- [Skills index](/opencodehub/skills/) +- [`codehub-document` skill](/opencodehub/skills/codehub-document/) diff --git a/packages/docs/src/content/docs/reference/error-codes.md b/packages/docs/src/content/docs/reference/error-codes.md new file mode 100644 index 0000000..7243879 --- /dev/null +++ b/packages/docs/src/content/docs/reference/error-codes.md @@ -0,0 +1,50 @@ +--- +title: Error codes +description: The fixed set of MCP error codes returned under structuredContent.error. +sidebar: + order: 30 +--- + +Every MCP tool that fails gracefully (i.e. the tool ran but the +operation could not complete) returns a uniform envelope under +`structuredContent.error` with `isError: true`. Protocol-level +failures (unknown tool name, malformed JSON-RPC) raise the SDK's +`McpError` instead and are not enumerated here. + +The canonical list lives at +[`packages/mcp/src/error-envelope.ts`](https://github.com/theagenticguy/opencodehub/blob/main/packages/mcp/src/error-envelope.ts). + +## Codes + +| Code | When it fires | Typical remediation | +|---|---|---| +| `STALENESS` | The index lags `HEAD` far enough to mistrust results. | `codehub analyze` (or `--force`). | +| `INVALID_INPUT` | A tool argument failed schema validation. | Correct the call; check required fields. | +| `NOT_FOUND` | The target symbol, repo, or group does not exist. | Confirm the name; run `codehub list` for repos. | +| `DB_ERROR` | DuckDB returned an error during the query. | Check `codehub doctor`; inspect `.codehub/graph.duckdb`. | +| `SCHEMA_MISMATCH` | The index was produced by a different CLI version with an incompatible schema. | `codehub analyze --force` to rebuild. | +| `RATE_LIMITED` | A downstream service (embedder, summariser) rate-limited the request. | Retry with backoff; reduce concurrency. | +| `INTERNAL` | Catch-all for unhandled exceptions reaching the tool boundary. | File an issue with the error `message`. | +| `NO_INDEX` | The repo has no `.codehub/` directory. | `codehub analyze `. | +| `AMBIGUOUS_REPO` | More than one repo is indexed and no `repo` argument was supplied. | Pass `repo` to the tool call. | + +## Envelope shape + +```json title="error envelope" +{ + "isError": true, + "content": [ + { "type": "text", "text": "Error (AMBIGUOUS_REPO): ...\nHint: ..." } + ], + "structuredContent": { + "error": { + "code": "AMBIGUOUS_REPO", + "message": "Multiple repos registered; specify `repo`.", + "hint": "One of: acme-api, acme-web" + } + } +} +``` + +Clients should key on `structuredContent.error.code` to decide whether +to retry, disambiguate, or abort. diff --git a/packages/docs/src/content/docs/reference/languages.md b/packages/docs/src/content/docs/reference/languages.md new file mode 100644 index 0000000..3290503 --- /dev/null +++ b/packages/docs/src/content/docs/reference/languages.md @@ -0,0 +1,72 @@ +--- +title: Supported languages +description: The 15 registered languages, which have SCIP indexers, and the WASM fallback. +sidebar: + order: 40 +--- + +Languages are registered at compile time in a `satisfies Record` table. Omitting a registered language raises a +build-time TypeScript error, so the table and this page cannot drift. + +## Registered languages (15) + +| Language | tree-sitter parse | SCIP indexer | +|---|---|---| +| TypeScript | yes | yes | +| TSX | yes | yes (via TypeScript) | +| JavaScript | yes | yes (via TypeScript) | +| Python | yes | yes | +| Go | yes | yes | +| Rust | yes | yes | +| Java | yes | yes | +| C# | yes | — | +| C | yes | — | +| C++ | yes | — | +| Ruby | yes | — | +| Kotlin | yes | — | +| Swift | yes | — | +| PHP | yes | — | +| Dart | yes | — | + +The five languages with a SCIP indexer get precise cross-file reference +resolution (ADR 0005). The other ten rely on tree-sitter's +symbol-level resolution, which is good enough for blast-radius within +a single module and degrades gracefully across module boundaries. + +## Native bindings and the WASM fallback + +Every grammar is loaded via native tree-sitter bindings by default. +Native bindings are faster but require a working C/C++ toolchain +(`node-gyp` + MSVC on Windows, `clang` + headers on macOS, `gcc` + +headers on Linux). They are compiled on install from source pins in +`packages/ingestion/package.json`. + +If native bindings fail to load — common on some minimal Linux +containers and on Windows without the Build Tools — run with +`--wasm-only` or export `OCH_WASM_ONLY=1`: + +```bash title="force WASM for every grammar" +codehub analyze --wasm-only +``` + +WASM is slightly slower but has no native dependency. The web surface +of OpenCodeHub always runs in WASM-only mode. + +## Adding a language + +Four steps, all committed together: + +1. Pin the tree-sitter grammar in `packages/ingestion/package.json`. +2. Implement `LanguageProvider` in + `packages/ingestion/src/providers/.ts`. +3. Add the entry to the registry in + `packages/ingestion/src/providers/registry.ts` — TypeScript fails + the build if the key is missing. +4. Add fixture tests under + `packages/ingestion/test/fixtures//`, using the + `parseFixture` helper from `test-helpers.ts`. + +See +[adding a language provider](/opencodehub/contributing/adding-a-language-provider/) +for the full walkthrough. diff --git a/packages/docs/src/content/docs/skills/codehub-contract-map.mdx b/packages/docs/src/content/docs/skills/codehub-contract-map.mdx new file mode 100644 index 0000000..3f2ddb2 --- /dev/null +++ b/packages/docs/src/content/docs/skills/codehub-contract-map.mdx @@ -0,0 +1,89 @@ +--- +title: "codehub-contract-map" +description: "Group-only. Consumer/producer contract matrix across a repo group, with Mermaid flow." +--- + +import { Aside } from "@astrojs/starlight/components"; + +Standalone group-only skill. Renders `group_contracts` into a Markdown + +Mermaid artifact. Fires on direct invocations ("map the contracts") +without needing the full `codehub-document` orchestration. + + + +## Frontmatter + +```yaml +name: codehub-contract-map +argument-hint: " [--output ] [--committed]" +color: magenta +model: sonnet +``` + +## Preconditions + +1. A `` positional argument is required. Missing or unknown group: + `Contract map requires a named group — run 'codehub group list' to see registered groups.` +2. Every member repo must be `fresh` per `mcp__opencodehub__group_status`. Stale members abort with named repos. + +## Process + +1. `mcp__opencodehub__group_list` — confirm ``. +2. `mcp__opencodehub__group_status({group})` — confirm freshness per member. +3. `mcp__opencodehub__group_contracts({group})` — the spine. +4. If zero contracts: write the artifact with a "No inter-repo contracts detected" banner. **Don't error** (spec 001 AC-5-5). +5. `mcp__opencodehub__group_query({group, text: "api handlers"})` — disambiguate producer-side locations. +6. `mcp__opencodehub__route_map({repo})` per member — for handler citations. +7. Build the N×N consumer/producer matrix + Mermaid flow + notable-contracts list. +8. Write to the resolved output path. + +## Output shape + +```markdown +# · Contract map + +## Contracts matrix +Rows = producers, columns = consumers. Cell = contract count. + +| | billing | core | web | +|-------|---------|------|-----| +| billing | — | 3 | 5 | +| core | — | — | 12 | +| web | — | — | — | + +## Flow +```mermaid +flowchart LR + web --> billing : 5 + web --> core : 12 + billing --> core : 3 +``` + +## Notable contracts +- **`web:packages/checkout/src/api.ts:22`** → **`billing:packages/api/src/handlers/invoice.ts:45`** + - Method: `POST /v1/invoices` + - Shape: `{amount, userId, idempotencyKey}` +... +``` + + + +## Arguments + +| Flag | Meaning | +|---|---| +| `` (required) | The group to map. Must appear in `group_list`. | +| `--output ` | Override output path. | +| `--committed` | Write to `docs//contracts.md` instead of `.codehub/groups//contracts.md`. | + +## Related + +- [codehub-document](/opencodehub/skills/codehub-document/) — full group-mode docs +- [ADR 0007 — Artifact factory](/opencodehub/architecture/adrs/#adr-0007--artifact-factory) +- [Skills index](/opencodehub/skills/) diff --git a/packages/docs/src/content/docs/skills/codehub-document.mdx b/packages/docs/src/content/docs/skills/codehub-document.mdx new file mode 100644 index 0000000..8e554d6 --- /dev/null +++ b/packages/docs/src/content/docs/skills/codehub-document.mdx @@ -0,0 +1,121 @@ +--- +title: "codehub-document" +description: "Primary artifact generator. Single-repo and group mode, 4-phase orchestration, .docmeta.json sidecar." +--- + +import { Aside, Tabs, TabItem } from "@astrojs/starlight/components"; + +Primary artifact generator. Applies the proven four-phase `/document` pattern +to OpenCodeHub's graph and extends it with first-class **group mode**. + +Writes a tree of cross-linked Markdown under `.codehub/docs/` (single-repo) +or `.codehub/groups//docs/` (group mode) plus a `.docmeta.json` +sidecar that drives `--refresh`. + +## Frontmatter + +```yaml +name: codehub-document +argument-hint: "[output-dir] [--group ] [--committed] [--refresh] [--section ]" +color: indigo +model: sonnet +``` + + + +## Preconditions + +1. `mcp__opencodehub__list_repos` returns the target. Otherwise: run `codehub analyze`. +2. `codehub status` reports fresh. Otherwise: run `codehub analyze`. +3. Group mode only: every member repo must be `fresh` per `mcp__opencodehub__group_status`. Stale members abort with named repos. + +## Four-phase orchestration + + + + Inline, no subagent. Writes two shared-context files on disk: + + - **`/.context.md`** (hard 200-line cap) — repo profile, top communities, top processes, routes, MCP tools, owners summary, staleness envelope. Group mode adds the manifest + contracts matrix + freshness table. + - **`/.prefetch.md`** — newline-delimited JSON ledger of tool calls with `{tool, args, sha256, keys, cached_at, truncated}`. Subagents read this instead of re-calling tools. + + Prompt dedup via filesystem, not copy-paste. + + + Four subagents dispatched in a single message: + + - `doc-architecture` → `architecture/{system-overview,module-map,data-flow}.md` + - `doc-reference` → `reference/{public-api,cli,mcp-tools}.md` + - `doc-behavior` → `behavior/{processes,state-machines}.md` + - `doc-analysis` → `analysis/{risk-hotspots,ownership,dead-code}.md` + + In group mode, fan-out multiplies by member count (4 × N subagents). + Claude Code's concurrent-Agent ceiling is ~10 per message — groups of + 3+ repos batch by role. + + + Two subagents in parallel: + + - `doc-diagrams` → `diagrams/{architecture,behavioral,structural}/*.md` + - `doc-cross-repo` → `cross-repo/{portfolio-map,contracts-matrix,dependency-flow}.md` *(group mode only)* + + Skipped silently in single-repo mode. + + + **Deterministic Markdown assembly. No LLM call.** + + 1. Regex over backtick `path:LOC` (or `repo:path:LOC`) citations. + 2. Build co-occurrence index: `source_file → [docs_citing_it]`. + 3. For any two docs sharing ≥ 2 common sources, append `## See also` footers. + 4. In group mode: add `## See also (other repos in group)` to every `cross-repo/*.md`. + 5. Write `README.md` (landing page with determinism disclaimer) + `.docmeta.json`. + + Same inputs, same output. See [`.docmeta.json` schema](/opencodehub/reference/docmeta-schema/). + + + +## Arguments + +| Flag | Meaning | +|---|---| +| `[output-dir]` | Where to write. Default `.codehub/docs/` (gitignored). With `--committed`, default flips to `docs/codehub/`. | +| `--group ` | Enable group mode. Phase 0 calls `group_list` + `group_status` + `group_contracts` + `group_query`. Phase CD dispatches `doc-cross-repo`. | +| `--committed` | Write to a committed path instead of `.codehub/docs/`. Does not touch `.gitignore`. | +| `--refresh` | Regenerate only sections whose `sources[]` mtimes are newer than the section's `mtime`. Phase E always re-runs. | +| `--section ` | Regenerate one named section (e.g., `architecture/system-overview`). | + +## Invocation examples + +```bash +# Single-repo, default gitignored output +/codehub-document + +# Group mode with an explicit output +/codehub-document docs/platform --group platform --committed + +# Refresh stale sections only +/codehub-document --refresh + +# One-section regenerate +/codehub-document --section architecture/system-overview +``` + +## Output contract + +See [ADR 0009](/opencodehub/architecture/adrs/#adr-0009--artifact-output-conventions) for the full contract. + +- No YAML frontmatter on outputs. +- Every factual claim carries a backtick `path:LOC` citation (or `repo:path:LOC` in group mode). +- Mermaid diagrams only (no SVG/PNG). +- `.docmeta.json` is the source of truth for `--refresh` and staleness. + +## Related + +- [ADR 0007 — Artifact factory](/opencodehub/architecture/adrs/#adr-0007--artifact-factory) +- [ADR 0008 — Document pattern port](/opencodehub/architecture/adrs/#adr-0008--document-pattern-port) +- [ADR 0009 — Output conventions](/opencodehub/architecture/adrs/#adr-0009--artifact-output-conventions) +- [`.docmeta.json` schema](/opencodehub/reference/docmeta-schema/) +- [Skills index](/opencodehub/skills/) diff --git a/packages/docs/src/content/docs/skills/codehub-onboarding.mdx b/packages/docs/src/content/docs/skills/codehub-onboarding.mdx new file mode 100644 index 0000000..10d9074 --- /dev/null +++ b/packages/docs/src/content/docs/skills/codehub-onboarding.mdx @@ -0,0 +1,86 @@ +--- +title: "codehub-onboarding" +description: "ONBOARDING.md with a graph-centrality-ranked reading order and an end-to-end process walk." +--- + +import { Aside } from "@astrojs/starlight/components"; + +Produces a single ONBOARDING.md. The wedge is the **ranked reading order** +drawn from graph centrality — a generic README scaffold cannot produce this. + +## Frontmatter + +```yaml +name: codehub-onboarding +argument-hint: "[output-path] [--committed]" +color: green +model: sonnet +``` + +## Preconditions + +- `mcp__opencodehub__list_repos` must return the target. +- `codehub status` must be fresh. + +Both refuse loudly with a one-line remediation hint per spec 001 AC-3-1. + +## Process + +1. `mcp__opencodehub__project_profile` — languages, stacks, entry points. +2. `mcp__opencodehub__route_map` / `mcp__opencodehub__tool_map` — HTTP / MCP surface. +3. `mcp__opencodehub__sql` for top-centrality nodes: + ```sql + SELECT name, file_path, in_degree + out_degree AS centrality + FROM nodes + WHERE kind IN ('File','Module','Class') + ORDER BY centrality DESC + LIMIT 15 + ``` +4. `mcp__opencodehub__context` on the top 8 for one-line summaries. +5. `mcp__opencodehub__owners` on top 3 folders → "ask these humans" table. +6. Dispatch one specialty `doc-onboarding` subagent. +7. Assemble ONBOARDING.md and write to the resolved output path. + +## Output shape + +```markdown +# · Onboarding + +## TL;DR +2 sentences — what this repo does + the mental model to hold. + +## Stack +| Layer | Tech | Source | + +## Read these 10 files first (in order) +1. `packages/cli/src/bin.ts` — CLI entry point. (45 LOC) +2. `packages/mcp/src/server.ts` — MCP bootstrap. (320 LOC) +... (ranked by centrality) + +## Walk one process end-to-end +(the highest-step-count process, traced step by step) + +## Ask these humans +| Area | Owner | Share | + +## Next steps +- Concrete first actions. +``` + +## Arguments + +| Flag | Meaning | +|---|---| +| `[output-path]` | Where to write. Default: `.codehub/ONBOARDING.md` (gitignored). With `--committed`: `docs/ONBOARDING.md`. | +| `--committed` | Opt in to a committed path. | + + + +## Related + +- [codehub-document](/opencodehub/skills/codehub-document/) — for the full architecture book +- [Skills index](/opencodehub/skills/) diff --git a/packages/docs/src/content/docs/skills/codehub-pr-description.mdx b/packages/docs/src/content/docs/skills/codehub-pr-description.mdx new file mode 100644 index 0000000..1f08d60 --- /dev/null +++ b/packages/docs/src/content/docs/skills/codehub-pr-description.mdx @@ -0,0 +1,72 @@ +--- +title: "codehub-pr-description" +description: "Draft a PR body from detect_changes + verdict + owners + findings-delta. Refuses on a clean tree." +--- + +Linear skill. No subagents. Sonnet. Writes a Markdown PR body you can +paste into `gh pr create --body-file` (or let the Claude Code session +drive the GitHub CLI directly). + +## Frontmatter + +```yaml +name: codehub-pr-description +argument-hint: "[--base ] [--head ] [--out ]" +color: teal +model: sonnet +``` + +## Preconditions + +- `git diff --name-only ..` must return ≥ 1 path. **Refuses on a clean tree** with `No diff detected — resolve base/head or stage changes.` + +## Process + +1. Resolve `--base` (default `main`) and `--head` (default `HEAD`). +2. `mcp__opencodehub__detect_changes({base, head})` → affected symbols + processes. +3. `mcp__opencodehub__verdict({base, head})` → 5-tier merge recommendation. +4. `mcp__opencodehub__owners({paths})` → required reviewers per path. +5. `mcp__opencodehub__list_findings_delta({base, head})` → new / resolved scanner findings. +6. For verdict tier ≥ 3: `mcp__opencodehub__impact({symbol, direction: "downstream", depth: 2})` — spell out who breaks. +7. For public API changes: `mcp__opencodehub__api_impact({route})` when the diff touches a handler. +8. Assemble the Markdown body and write to `` (default `.codehub/pr/PR-.md`). + +## Output shape + +```markdown +# + +## Summary +2–3 sentences — what changes, why. + +## Verdict +**Tier