feat!: replace LSP oracle with SCIP indexers (TS/Py/Go/Rust/Java)#32
Conversation
BREAKING CHANGE: @opencodehub/lsp-oracle is deleted. Oracle-edge
provenance prefixes change from `{pyright,typescript-language-server,
gopls,rust-analyzer}@<ver>` to `scip:<indexer>@<ver>`. The
`+lsp-unconfirmed` demoted-reason suffix becomes `+scip-unconfirmed`.
What changes:
- New `@opencodehub/scip-ingest` package: hand-rolled SCIP protobuf
reader, caller→callee derivation via enclosing-range attribution,
graph materialization (blast-radius / reachability / SCC), and
per-language indexer runners (scip-typescript, scip-python, scip-go,
rust-analyzer --scip, scip-java).
- Single `scip-index` ingestion phase replaces the four per-language
LSP upgrade phases. Preserves the `confidence=1.0 + reason prefix`
contract that confidence-demote / summarize / mcp-confidence /
cli-analyze auto-cap rely on.
- Gym replay harness migrated to the same SCIP rails: ScipClient runs
the indexer once, parses, pre-builds occurrence + definition
lookups, answers queryCallers/References/Implementations in-memory.
- New output-side `codehub pack` CLI + `pack_codebase` MCP tool
wrapping repomix. ADR 0005 documents why repomix cannot replace the
tree-sitter input-side chunker (per-file blobs only, no symbol
metadata, tokenizer mismatch, tsx/kotlin gap).
Numbers:
- Net −12,037 LOC across the diff (−15,572 deletions, +3,535 additions).
- 1,414/1,414 tests pass across 14 packages.
- End-to-end smoke on the bundled calcpkg SCIP fixture: add() tops
blast ranking (bwd_reach=9), matching the POC.
Follow-ups (captured in .erpaval/sessions/session-f8a300bc/lessons.yaml):
regenerate gym corpus baselines against SCIP indexers, retarget
.github/workflows/gym.yml caches, supersede ADR 0003 with a
SCIP-indexer pin ADR, extend the parser to decode
SymbolInformation.relationships for first-class IMPLEMENTS edges.
Gym regression:
|
Gym regression:
|
Gym regression:
|
Follow-ups to the SCIP rip-and-replace (PR #32). The gym matrix now actually exercises the SCIP indexers end-to-end on every PR: CI workflow - .github/workflows/gym.yml rewritten around a pinned SCIP-indexer env block (scip-typescript@0.4.0, scip-python@0.6.6, scip-go@v0.2.3, rust-analyzer stable component, scip-java@0.12.3 when a corpus lands). Installs scip-python / scip-typescript via npm, scip-go via the scip-code/ fork path (go.mod moved mid-2025), rust-analyzer via rustup component. Builds scripts are opt-in via CODEHUB_ALLOW_BUILD_SCRIPTS=1. - Python jobs install fixture deps into a shared venv before indexing so scip-python can resolve pip-registered packages. Baselines + corpora - All 6 corpus YAMLs flipped to SCIP tool names (scip-python, scip-typescript, scip-go, rust-analyzer). Corpus path for sdk-python fixed from `sdk-python` to `python/sdk-python` to match the other corpus layouts. - packages/gym/baselines/manifest.jsonl regenerated end-to-end against the five SCIP indexers (62 records, 32 scored cases, 30 waived incl. 5 auto-waives for targets with zero matches inside the fixture). - All 62 corpus `expected:` sets rewritten from the new manifest via a new PEP 723 helper at `packages/gym/baselines/scripts/refresh-expected.py` (uv-runnable, auto-waives empty-result cases). - packages/gym/baselines/performance.json toolchain section flipped to SCIP indexers; fixture numbers are refreshed on each CI run via run-smoke.mjs. - run-analyze-with-stats.mjs + run-smoke.mjs renamed lspPhaseEdges -> scipPhaseEdges; reason-prefix SQL switched to `scip:%`. Parser + factory - scip-ingest parser now decodes SymbolInformation.relationships so IMPLEMENTS / TYPE_OF edges are first-class (DerivedRelation), no longer best-effort. scip-factory.queryImplementations uses them. - scip-factory gained a symbolName-based resolver path so corpus targets that pin (line=1, col=1) + symbolName still resolve, and queryCallers falls back to occurrence-containment attribution for class / trait callees that aren't function-like. - Monorepo TS runner now resolves tsconfig.json under `app/` (or any conventional subdir) and passes it as a positional project arg to scip-typescript so emitted doc paths stay rooted at projectRoot. - sdk-python registered as a proper git submodule pinned to 5a6df59502dc618781b85e80b01706a19cd45828. Docs - ADR 0003 marked superseded. - ADR 0006 added — SCIP indexer pin table + bump procedure. - OBJECTIVES.md / SPECS.md restated in SCIP terms (phase DAG, oracle contract, +scip-unconfirmed suffix). - plugins/opencodehub/skills/{opencodehub-guide,opencodehub-impact-analysis}/SKILL.md restated the confidenceBreakdown + SQL filter for SCIP prefixes. - packages/gym/README.md + corpus/*/README.md restated. - mise.toml dropped test:integration + validate:lsp-oracle; gym:* task descriptions updated; gym:refresh-expected task added. Verification - pnpm run check: green (lint + typecheck + 1,414 tests + banned-strings). - Local end-to-end gym replay on all 6 corpora: go/scip-go/callers cases=2 F1=1.000 go/scip-go/references cases=1 F1=1.000 python/scip-python cases=12 F1=1.000 rust/rust-analyzer cases=8 F1=1.000 typescript/scip-typescript cases=9 F1=1.000 gates: all passed. 32 scored, 0 failed, 30 waived.
Gym regression:
|
- gym (go) failed because scip-go v0.2.3 requires Go >= 1.25 (module declares it in its go.mod); the workflow was still pinned to Go 1.23 from the gopls era. Bump to 1.26 matches ADR 0006. - gym (rust) failed because rust-analyzer scip reads stdlib for cross-crate resolution and needs the rust-src component, not just rust-analyzer. Install both via dtolnay/rust-toolchain.
## Summary
- Rip `@opencodehub/lsp-oracle` and the four per-language LSP upgrade
phases; replace with a single `scip-index` phase backed by the new
`@opencodehub/scip-ingest` package. Full coverage of TypeScript, Python,
Go, Rust, and Java via the native SCIP indexers (scip-typescript,
scip-python, scip-go, rust-analyzer --scip, scip-java).
- Gym replay harness migrated to the same SCIP rails (`ScipClient` runs
the indexer once, pre-builds in-memory occurrence + definition lookups,
answers queryCallers/References/Implementations without re-decoding).
- New output-side `codehub pack` CLI + `pack_codebase` MCP tool wrapping
`repomix --compress`. ADR 0005 documents why repomix cannot replace the
tree-sitter input-side chunker (per-file text blobs, tokenizer mismatch,
missing tsx/kotlin coverage).
**BREAKING**: oracle-edge provenance prefixes change from
`{pyright,typescript-language-server,gopls,rust-analyzer}@<ver>` to
`scip:<indexer>@<ver>`. The `+lsp-unconfirmed` demoted-reason suffix
becomes `+scip-unconfirmed`. `LSP_PROVENANCE_PREFIXES` is kept as a
deprecated alias for `SCIP_PROVENANCE_PREFIXES` for one release.
## Numbers
- Net **−12,037 LOC** (−15,572 deletions, +3,535 additions) across 81
files.
- 1,414/1,414 tests pass across 14 packages.
- `pnpm run check` clean (lint + typecheck + tests + banned-strings).
- End-to-end smoke on the bundled `calcpkg.scip` fixture: `add()` tops
blast ranking with 9 backward callers — matches the reference POC
ranking.
## Architecture
```
source tree
├─ tree-sitter parse (unchanged: scan / parse / structure /
│ accesses / cross-file / mro / complexity)
└─ scip-index phase (NEW)
├─ detectLanguages(repoRoot)
├─ runIndexer(lang, …) → .opencodehub/scip/<lang>.scip
├─ parseScipIndex(buf) → ScipIndex
├─ deriveIndex(index) → {symbols, edges}
└─ CodeRelation(confidence=1.0, reason=\"scip:<indexer>@<ver>\")
│
▼
confidence-demote → summarize / mcp-confidence / analyze-auto-cap
```
## Follow-ups (non-blocking)
Captured in `.erpaval/sessions/session-f8a300bc/lessons.yaml`:
- Regenerate gym corpus baselines (`packages/gym/corpus/**/*.yaml`,
`baselines/manifest.jsonl`) against SCIP indexers.
- Retarget `.github/workflows/gym.yml` from
gopls/pyright/tsserver/rust-analyzer caches to scip-* caches.
- Supersede `docs/adr/0003-ci-toolchain-pins.md` with a SCIP-indexer pin
ADR.
- Extend the scip-ingest parser to decode
`SymbolInformation.relationships` so IMPLEMENTS/EXTENDS edges are
first-class (today the minimal reader best-efforts them).
## Test plan
- [ ] Manual: `pnpm install && pnpm run check` on a clean clone.
- [ ] Manual: `pnpm --filter @opencodehub/scip-ingest test` against the
bundled `calcpkg.scip` fixture (6 tests, covers parse / derive /
materialize).
- [ ] Manual: `codehub analyze` on a TS monorepo (requires
`scip-typescript` on PATH). Expect `scip-index` phase to emit edges with
`reason: scip:scip-typescript@<ver>`.
- [ ] Manual: `codehub pack` on any repo — produces
`.codehub/pack/repo.xml`.
- [ ] Manual: MCP `pack_codebase` tool via `codehub mcp` stdio.
- [ ] CI: full `pnpm -r test` matrix green (gym.yml changes deferred per
follow-ups — existing LSP-matrix workflow may break on this branch; that
is expected and captured in the follow-up list).
## Risk
- **Gym corpus baselines** were captured against the old LSP clients.
They still load and still drive the new SCIP replay harness, but
per-case expected result sets need regeneration. Failures on the
`gym.yml` workflow today are content-level, not structural. Mitigated by
not running gym replay in required CI for this PR.
- **scip-java / rust-analyzer execute build scripts** during indexing.
Gated behind `CODEHUB_ALLOW_BUILD_SCRIPTS=1` on untrusted workspaces;
skip reason surfaces in the phase output otherwise.
Summary
@opencodehub/lsp-oracleand the four per-language LSP upgrade phases; replace with a singlescip-indexphase backed by the new@opencodehub/scip-ingestpackage. Full coverage of TypeScript, Python, Go, Rust, and Java via the native SCIP indexers (scip-typescript, scip-python, scip-go, rust-analyzer --scip, scip-java).ScipClientruns the indexer once, pre-builds in-memory occurrence + definition lookups, answers queryCallers/References/Implementations without re-decoding).codehub packCLI +pack_codebaseMCP tool wrappingrepomix --compress. ADR 0005 documents why repomix cannot replace the tree-sitter input-side chunker (per-file text blobs, tokenizer mismatch, missing tsx/kotlin coverage).BREAKING: oracle-edge provenance prefixes change from
{pyright,typescript-language-server,gopls,rust-analyzer}@<ver>toscip:<indexer>@<ver>. The+lsp-unconfirmeddemoted-reason suffix becomes+scip-unconfirmed.LSP_PROVENANCE_PREFIXESis kept as a deprecated alias forSCIP_PROVENANCE_PREFIXESfor one release.Numbers
pnpm run checkclean (lint + typecheck + tests + banned-strings).calcpkg.scipfixture:add()tops blast ranking with 9 backward callers — matches the reference POC ranking.Architecture
Follow-ups (non-blocking)
Captured in
.erpaval/sessions/session-f8a300bc/lessons.yaml:packages/gym/corpus/**/*.yaml,baselines/manifest.jsonl) against SCIP indexers..github/workflows/gym.ymlfrom gopls/pyright/tsserver/rust-analyzer caches to scip-* caches.docs/adr/0003-ci-toolchain-pins.mdwith a SCIP-indexer pin ADR.SymbolInformation.relationshipsso IMPLEMENTS/EXTENDS edges are first-class (today the minimal reader best-efforts them).Test plan
pnpm install && pnpm run checkon a clean clone.pnpm --filter @opencodehub/scip-ingest testagainst the bundledcalcpkg.scipfixture (6 tests, covers parse / derive / materialize).codehub analyzeon a TS monorepo (requiresscip-typescripton PATH). Expectscip-indexphase to emit edges withreason: scip:scip-typescript@<ver>.codehub packon any repo — produces.codehub/pack/repo.xml.pack_codebasetool viacodehub mcpstdio.pnpm -r testmatrix green (gym.yml changes deferred per follow-ups — existing LSP-matrix workflow may break on this branch; that is expected and captured in the follow-up list).Risk
gym.ymlworkflow today are content-level, not structural. Mitigated by not running gym replay in required CI for this PR.CODEHUB_ALLOW_BUILD_SCRIPTS=1on untrusted workspaces; skip reason surfaces in the phase output otherwise.