Skip to content

feat: build-time schema generation in Go (tree-sitter)#2782

Merged
tempusfrangit merged 12 commits intomainfrom
feat/go-schema-gen
Feb 27, 2026
Merged

feat: build-time schema generation in Go (tree-sitter)#2782
tempusfrangit merged 12 commits intomainfrom
feat/go-schema-gen

Conversation

@tempusfrangit
Copy link
Member

Summary

Replaces runtime Python schema generation with a pure Go implementation using tree-sitter, eliminating the need for a separate Rust binary (cog-schema-gen) or container-based schema extraction for SDK >= 0.17.0.
This is the alternative approach to #2774 (which embedded a Rust binary via go:embed), motivated by reviewer feedback about embedding complexity.

What Changed

Go CLI — Static Schema Generation

  • pkg/schema/python/parser.go — Tree-sitter AST walker that parses Python predict/train files into PredictorInfo. Handles Union[str, None], typing.List, Literal, enums, all default types, async iterators, etc.
  • pkg/schema/openapi.go — Full OpenAPI 3.0.2 generation from PredictorInfo with deterministic key ordering and compact JSON (for Docker label efficiency)
  • pkg/schema/generator.go — Public API: Generate(), GenerateFromSource(), GenerateCombined(), MergeSchemas(), plus COG_OPENAPI_SCHEMA env var bypass
  • pkg/schema/types.go — Type system with Union handling, TitleCase helpers, field resolution
  • pkg/image/build.go — Pre-build static schema gen (Go tree-sitter) for SDK >= 0.17.0, post-build legacy fallback (container-based) for SDK < 0.17.0

Coglet (Rust) — Bundled Schema + Edge Validation

  • Removed runtime schema generationschema() removed from PredictHandler trait; coglet now reads .cog/openapi_schema.json from disk at startup
  • input_validation.rs — Validates prediction/training inputs against the OpenAPI schema at the HTTP edge before dispatching to Python. Returns 422 with pydantic-compatible error format
  • Training route idempotency — Training routes now have proper ID mismatch checks and existing-state returns (not just delegating to prediction routes)
  • COG_MAX_CONCURRENCY read from env var instead of Python introspection

Build Infrastructure

  • CGo enabled globally — Required for go-tree-sitter (C bindings). Uses zig for Linux cross-compilation, native clang for macOS
  • Release builds on macos-14 — Single ARM64 macOS runner cross-compiles all 4 targets
  • Dockerfile changescogEnvVars() injects COG_PREDICT_TYPE_STUB, COG_TRAIN_TYPE_STUB, COG_MAX_CONCURRENCY; coglet install is conditional

Tests

  • 100+ Go unit tests across parser, OpenAPI generator, and schema generator
  • 9 input validation unit tests (Rust)
  • 2 integration testsstatic_schema_gen.txtar (end-to-end with Docker label verification) and legacy_sdk_schema.txtar (SDK < 0.17.0 fallback)

Design Decisions

  • Go tree-sitter over Rust binary — Direct library call, no embed/exec/temp files. CGo is the only tradeoff.
  • Compact JSONjson.Marshal not MarshalIndent for Docker labels
  • Belt-and-suspenders validation — Rust validates at HTTP edge for fast rejection; Python check_input() still runs in worker (removed in follow-up PR)
  • SDK version gating — Static path for >= 0.17.0, legacy container-based path for older SDKs

Follow-up Work (Stacked PR)

  • _adt.py/_inspector.py elimination + full Rust input coercion
  • Dead Python SDK code deletion

Implement pkg/schema/ package for build-time extraction of Cog predictor
signatures via tree-sitter (smacker/go-tree-sitter, CGo). Replaces runtime
Python-based schema generation with static analysis.

- pkg/schema/types.go: core type system (OrderedMap, PrimitiveType, FieldType,
  InputField, OutputType, TypeAnnotation, etc.)
- pkg/schema/errors.go: typed SchemaError with error kinds
- pkg/schema/python/parser.go: tree-sitter AST walker that extracts imports,
  module-scope constants, BaseModel subclasses, InputRegistry patterns,
  function signatures, Input() metadata, and type annotations
- pkg/schema/python/parser_test.go: 50 tests covering basic predictors,
  Input() constraints, choices (literal, module var, dict keys/values, concat),
  optional/list/iterator/BaseModel outputs, train mode, InputRegistry
  (attribute + method), default_factory hard error, and error cases
Port schema.rs to Go: GenerateOpenAPISchema() produces a complete
OpenAPI spec from parsed predictor info. Includes input schema with
constraints, choices/enums, output types (single/list/iterator/concat/
object), fixed components (request/response/status/validation), and
post-processing (removeTitleNextToRef, fixNullableAnyOf).

Also adds OutputType.JSONType() to types.go, fixes pre-existing lint
issues in parser.go and errors.go.
Generate() reads a predict ref (e.g. predict.py:Predictor), loads the
source file, parses it, and produces OpenAPI JSON. Uses a Parser function
type to avoid import cycle between schema and schema/python.

COG_OPENAPI_SCHEMA env var allows bringing a pre-built schema file,
skipping all parsing and generation.
- Pre-build static schema gen (Go tree-sitter) for SDK >= 0.17.0
- Post-build legacy schema gen (container) for SDK < 0.17.0
- Add cogEnvVars() to Dockerfile: COG_PREDICT_TYPE_STUB, COG_TRAIN_TYPE_STUB, COG_MAX_CONCURRENCY
- Add isLegacySDKVersion() and conditional coglet install (SDK dep handles it when unpinned)
- Add GenerateCombined() and MergeSchemas() for predict+train schemas
- Export BaseVersionRe and add DetectLocalSDKVersion() to wheels package
- Update test expectations for new ENV lines and coglet install behavior
- Replace handler.schema() with load_bundled_schema() reading .cog/openapi_schema.json
- Remove schema() from PredictHandler trait (coglet-core)
- Remove schema() impl from PythonPredictHandler (worker_bridge.rs)
- Remove schema() method from PythonPredictor (predictor.rs)
- Read COG_MAX_CONCURRENCY from env instead of importing cog.config
- Update SDK detection to check cog.BasePredictor instead of cog._adt
- CGO_ENABLED=1 globally (goreleaser, mise.toml, CI env)
- goreleaser overrides: zig cc for linux/amd64 and linux/arm64
- darwin targets use native clang (zig lacks macOS SDK stubs)
- CI: mlugg/setup-zig@v2 in build-cog and release-dry-run jobs
- legacy_sdk_schema IT: verifies SDK < 0.17.0 falls back to runtime schema gen
- static_schema_gen IT: verifies Go tree-sitter schema in Docker label end-to-end
- Parser tests: falsy defaults (False, 0, 0.0, ""), no-input predictor,
  async iterators, typing.List, typing.Union[X, None], list[Path], all-optional
- Support typing.Union[X, None] via comma-aware generic parsing
- Compact JSON for schema output (no wasted bytes in Docker labels)
Validates prediction and training inputs against the OpenAPI schema at the
Rust HTTP layer before dispatching to the Python worker. This catches missing
required fields and unknown fields early with pydantic-compatible error
responses (422 with detail array).

- Add InputValidator with jsonschema validation,  inlining, and
  additionalProperties enforcement
- Wire validators into PredictionService (separate Input/TrainingInput)
- Validate in create_prediction_with_id before slot acquisition
- Rewrite training routes with proper idempotency handling (ID mismatch
  check, existing state return) instead of delegating to prediction routes
- Add 7 unit tests for InputValidator, 2 for training idempotency
@tempusfrangit tempusfrangit requested a review from a team as a code owner February 27, 2026 21:17
The CLI sends all -i values as strings and relies on the schema to coerce
them to the correct types (integer, number, etc.). The static schema gen
emits allOf:[{$ref: ...}] for enum/choices fields, where the referenced
schema carries the concrete type but the wrapper does not. Without resolving
these wrappers, the CLI sends string '42' for integer fields and '3' for
integer enum choices, which the edge validator correctly rejects.

Also adds NewInputsForMode() to look up TrainingInput schema for train mode,
with fallback to Input for legacy schemas.
Copy link
Contributor

@markphelps markphelps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me! my one suggestion is we add fuzz testing but can do that in a follow up PR

Copy link
Member

@michaeldwan michaeldwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, much simpler than the embedded rust verison. nits about extracting some logic out into smaller functions, but not blocking

// --- Pre-build static schema generation ---
// When using the static path, generate schema BEFORE the Docker build so we
// fail fast on schema errors and the schema file is in the build context.
var schemaJSON []byte
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wish we could break some of this out to smaller functions. I can see this important looking code mangled by accident, or lingering forever because the agents are afraid to delete stuff or break backwards compatibility the humans already forgot about

@tempusfrangit tempusfrangit merged commit 484a30a into main Feb 27, 2026
37 checks passed
@tempusfrangit tempusfrangit deleted the feat/go-schema-gen branch February 27, 2026 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants