feat: build-time schema generation in Go (tree-sitter)#2782
Merged
tempusfrangit merged 12 commits intomainfrom Feb 27, 2026
Merged
feat: build-time schema generation in Go (tree-sitter)#2782tempusfrangit merged 12 commits intomainfrom
tempusfrangit merged 12 commits intomainfrom
Conversation
Implement pkg/schema/ package for build-time extraction of Cog predictor signatures via tree-sitter (smacker/go-tree-sitter, CGo). Replaces runtime Python-based schema generation with static analysis. - pkg/schema/types.go: core type system (OrderedMap, PrimitiveType, FieldType, InputField, OutputType, TypeAnnotation, etc.) - pkg/schema/errors.go: typed SchemaError with error kinds - pkg/schema/python/parser.go: tree-sitter AST walker that extracts imports, module-scope constants, BaseModel subclasses, InputRegistry patterns, function signatures, Input() metadata, and type annotations - pkg/schema/python/parser_test.go: 50 tests covering basic predictors, Input() constraints, choices (literal, module var, dict keys/values, concat), optional/list/iterator/BaseModel outputs, train mode, InputRegistry (attribute + method), default_factory hard error, and error cases
Port schema.rs to Go: GenerateOpenAPISchema() produces a complete OpenAPI spec from parsed predictor info. Includes input schema with constraints, choices/enums, output types (single/list/iterator/concat/ object), fixed components (request/response/status/validation), and post-processing (removeTitleNextToRef, fixNullableAnyOf). Also adds OutputType.JSONType() to types.go, fixes pre-existing lint issues in parser.go and errors.go.
Generate() reads a predict ref (e.g. predict.py:Predictor), loads the source file, parses it, and produces OpenAPI JSON. Uses a Parser function type to avoid import cycle between schema and schema/python. COG_OPENAPI_SCHEMA env var allows bringing a pre-built schema file, skipping all parsing and generation.
- Pre-build static schema gen (Go tree-sitter) for SDK >= 0.17.0 - Post-build legacy schema gen (container) for SDK < 0.17.0 - Add cogEnvVars() to Dockerfile: COG_PREDICT_TYPE_STUB, COG_TRAIN_TYPE_STUB, COG_MAX_CONCURRENCY - Add isLegacySDKVersion() and conditional coglet install (SDK dep handles it when unpinned) - Add GenerateCombined() and MergeSchemas() for predict+train schemas - Export BaseVersionRe and add DetectLocalSDKVersion() to wheels package - Update test expectations for new ENV lines and coglet install behavior
- Replace handler.schema() with load_bundled_schema() reading .cog/openapi_schema.json - Remove schema() from PredictHandler trait (coglet-core) - Remove schema() impl from PythonPredictHandler (worker_bridge.rs) - Remove schema() method from PythonPredictor (predictor.rs) - Read COG_MAX_CONCURRENCY from env instead of importing cog.config - Update SDK detection to check cog.BasePredictor instead of cog._adt
- CGO_ENABLED=1 globally (goreleaser, mise.toml, CI env) - goreleaser overrides: zig cc for linux/amd64 and linux/arm64 - darwin targets use native clang (zig lacks macOS SDK stubs) - CI: mlugg/setup-zig@v2 in build-cog and release-dry-run jobs
- legacy_sdk_schema IT: verifies SDK < 0.17.0 falls back to runtime schema gen - static_schema_gen IT: verifies Go tree-sitter schema in Docker label end-to-end - Parser tests: falsy defaults (False, 0, 0.0, ""), no-input predictor, async iterators, typing.List, typing.Union[X, None], list[Path], all-optional - Support typing.Union[X, None] via comma-aware generic parsing - Compact JSON for schema output (no wasted bytes in Docker labels)
Validates prediction and training inputs against the OpenAPI schema at the Rust HTTP layer before dispatching to the Python worker. This catches missing required fields and unknown fields early with pydantic-compatible error responses (422 with detail array). - Add InputValidator with jsonschema validation, inlining, and additionalProperties enforcement - Wire validators into PredictionService (separate Input/TrainingInput) - Validate in create_prediction_with_id before slot acquisition - Rewrite training routes with proper idempotency handling (ID mismatch check, existing state return) instead of delegating to prediction routes - Add 7 unit tests for InputValidator, 2 for training idempotency
The CLI sends all -i values as strings and relies on the schema to coerce
them to the correct types (integer, number, etc.). The static schema gen
emits allOf:[{$ref: ...}] for enum/choices fields, where the referenced
schema carries the concrete type but the wrapper does not. Without resolving
these wrappers, the CLI sends string '42' for integer fields and '3' for
integer enum choices, which the edge validator correctly rejects.
Also adds NewInputsForMode() to look up TrainingInput schema for train mode,
with fallback to Input for legacy schemas.
markphelps
approved these changes
Feb 27, 2026
Contributor
markphelps
left a comment
There was a problem hiding this comment.
looks good to me! my one suggestion is we add fuzz testing but can do that in a follow up PR
michaeldwan
approved these changes
Feb 27, 2026
Member
michaeldwan
left a comment
There was a problem hiding this comment.
looks good, much simpler than the embedded rust verison. nits about extracting some logic out into smaller functions, but not blocking
| // --- Pre-build static schema generation --- | ||
| // When using the static path, generate schema BEFORE the Docker build so we | ||
| // fail fast on schema errors and the schema file is in the build context. | ||
| var schemaJSON []byte |
Member
There was a problem hiding this comment.
wish we could break some of this out to smaller functions. I can see this important looking code mangled by accident, or lingering forever because the agents are afraid to delete stuff or break backwards compatibility the humans already forgot about
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces runtime Python schema generation with a pure Go implementation using tree-sitter, eliminating the need for a separate Rust binary (
cog-schema-gen) or container-based schema extraction for SDK >= 0.17.0.This is the alternative approach to #2774 (which embedded a Rust binary via
go:embed), motivated by reviewer feedback about embedding complexity.What Changed
Go CLI — Static Schema Generation
pkg/schema/python/parser.go— Tree-sitter AST walker that parses Python predict/train files intoPredictorInfo. HandlesUnion[str, None],typing.List,Literal, enums, all default types, async iterators, etc.pkg/schema/openapi.go— Full OpenAPI 3.0.2 generation fromPredictorInfowith deterministic key ordering and compact JSON (for Docker label efficiency)pkg/schema/generator.go— Public API:Generate(),GenerateFromSource(),GenerateCombined(),MergeSchemas(), plusCOG_OPENAPI_SCHEMAenv var bypasspkg/schema/types.go— Type system with Union handling, TitleCase helpers, field resolutionpkg/image/build.go— Pre-build static schema gen (Go tree-sitter) for SDK >= 0.17.0, post-build legacy fallback (container-based) for SDK < 0.17.0Coglet (Rust) — Bundled Schema + Edge Validation
schema()removed fromPredictHandlertrait; coglet now reads.cog/openapi_schema.jsonfrom disk at startupinput_validation.rs— Validates prediction/training inputs against the OpenAPI schema at the HTTP edge before dispatching to Python. Returns 422 with pydantic-compatible error formatCOG_MAX_CONCURRENCYread from env var instead of Python introspectionBuild Infrastructure
go-tree-sitter(C bindings). Uses zig for Linux cross-compilation, native clang for macOSmacos-14— Single ARM64 macOS runner cross-compiles all 4 targetscogEnvVars()injectsCOG_PREDICT_TYPE_STUB,COG_TRAIN_TYPE_STUB,COG_MAX_CONCURRENCY; coglet install is conditionalTests
static_schema_gen.txtar(end-to-end with Docker label verification) andlegacy_sdk_schema.txtar(SDK < 0.17.0 fallback)Design Decisions
json.MarshalnotMarshalIndentfor Docker labelscheck_input()still runs in worker (removed in follow-up PR)Follow-up Work (Stacked PR)
_adt.py/_inspector.pyelimination + full Rust input coercion