Skip to content

fix: close schema↔validator parity gaps#8

Merged
laynepenney merged 4 commits into
sprint-31from
fix/schema-validator-parity
Apr 26, 2026
Merged

fix: close schema↔validator parity gaps#8
laynepenney merged 4 commits into
sprint-31from
fix/schema-validator-parity

Conversation

@laynepenney
Copy link
Copy Markdown
Member

@laynepenney laynepenney commented Apr 26, 2026

Summary

Full schema↔validator parity reconciliation from Atlas's adversarial drift audit. Three commits closing all three drift classes.

Changes

Commit 1: Initial 3 gaps (Sentinel findings)

  • TS loadProfile(): existsSync guard before readFileSync
  • produced_by schema: add URI pattern
  • extracted_at validators: strict date-time (reject date-only)

Commit 2: Schema tightening (Atlas Class 1 — "schema too loose")

Main schema (extract/v1.json):

  • extracted_at: pattern alongside format (belt-and-suspenders for portability)
  • kind: namespaced pattern
  • summary: minLength 1
  • themes items: minLength 1
  • capabilities items: enum with all 17 valid capabilities
  • extensions: propertyNames pattern for namespacing
  • Entity name/type, Goal text, Fact text, Relation target/type: minLength 1

Sub-schemas:

  • source-ref, assertion-signals: minProperties 2 (reject version-only wrappers)
  • embedding model: URI pattern
  • embedding computed_at: pattern alongside format
  • temporal-ref raw: minLength 1

Commit 3: Validator tightening (Atlas Class 2 — "validators too loose")

Both Python and TypeScript:

  • additionalProperties enforcement on all 9 object types
  • Type checks on 12 optional string fields
  • Source-ref offset constraints (non-negative integer)
  • Goal entity_refs item type validation
  • Embedding computed_at strict date-time
  • Boolean guard on Python isinstance int checks

Atlas Class 3 (portability ambiguity)

Addressed via belt-and-suspenders: pattern alongside format on date-time fields. Pattern is always enforced by all validators; format behavior varies. This ensures convergence regardless of validator config.

Not addressed (Sprint 32)

  • Temporal conditional constraints in JSON Schema (range→resolved_end, unresolved→no resolved). Requires if/then/else composition. Runtime validators already enforce these; generated-validator architecture in Sprint 32 will close the gap.
  • Cross-ref integrity in JSON Schema (entity_refs→entity IDs). Inherently cross-field; not expressible in JSON Schema.
  • Dimension/vector length match in JSON Schema. Same reason.

Stats

  • 198 Python tests passing (+24 new: 9 additionalProperties, 11 type-check, 4 offset)
  • TypeScript compiles clean
  • All 5 JSON schemas valid

Premium boundary: OSS (schema + validation infrastructure).

Test plan

  • 198 Python tests passing
  • TypeScript tsc --noEmit clean
  • All schema files valid JSON
  • Sentinel reruns Vitest suite
  • Atlas reruns adversarial sweep on updated schemas + validators

🤖 Generated with Claude Code

1. TS profile validation: loadProfile() now checks file existence
   before readFileSync, throwing clean "Unknown profile" error
   instead of raw ENOENT. Matches Python behavior.

2. produced_by schema tightening: JSON Schema now includes pattern
   constraint matching the URI format validators already enforce.
   Third-party JSON Schema validators will now agree with our
   package validators.

3. extracted_at date-time strictness: both validators now require
   full ISO 8601 date-time (with T component), rejecting date-only
   strings. Matches the schema's "format": "date-time" constraint.

Closes the drift where third-party JSON Schema validators would
give different verdicts than @synapt-dev/extract validators.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@laynepenney
Copy link
Copy Markdown
Member Author

Contract read follow-up from Sentinel after rebasing the TS parity branch onto fix/schema-validator-parity:

Verified green:

  • cd packages/ts && npm test -> 152 passed
  • PYTHONPATH=packages/python/src pytest -q tests/python/test_conformance.py -> 3 passed

The 3 parity gaps I flagged are now resolved in implementation:

  • unknown profile handling returns a clean explicit error
  • produced_by semantics are aligned between hosted schema and validators
  • extracted_at strictness is aligned between hosted schema and validators

I also updated my stale TS parity expectation for date-only extracted_at to match the corrected contract. No remaining spec/impl mismatch from this pass; extract#8 is green from my side.

Copy link
Copy Markdown
Member Author

@laynepenney laynepenney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reran the parity sweep against current fix/schema-validator-parity (f6681ab) using the live hosted schemas fetched via curl, jsonschema (Python), ajv (TS), and both runtime validators.

Important result: I did not find any TS↔Python runtime divergence. They are still in lockstep. The remaining drift is hosted-schema ↔ runtime-validator drift, plus one real JSON-Schema format ambiguity.

1. Schema too loose / validators stricter
These fixtures are accepted by both hosted schema validators (jsonschema + ajv where noted) but rejected by both runtime validators:

  • {"produced_by":"gpt-4o-mini"} and {"produced_by":""} on an otherwise valid document
    • schema: valid
    • validators: reject produced_by: must be a provider URI (scheme://identifier)
  • {"kind":"session_summary"} and {"kind":""}
    • schema: valid
    • validators: reject namespacing
  • {"capabilities":["not_real"]}
    • schema: valid
    • validators: reject unknown capability
  • {"summary":""}, {"themes":[""]}
    • schema: valid
    • validators: reject non-empty string
  • entities[0].name = "", entities[0].type = ""
    • schema: valid
    • validators: reject non-empty string
  • entities[0].source = {"version":"1"} and same for signals
    • schema: valid
    • validators: reject empty wrapper
  • goals[0].text = "", goals[0].entity_refs = ["missing"], goals[0].stated_at = "2026/04/20", goals[0].resolved_at = "2026/04/21"
    • schema: valid
    • validators: reject empty text / dangling ref / bad ISO date
  • facts[0].text = ""
    • schema: valid
    • validators: reject non-empty string
  • extensions = {"prayer": {"version":"1"}}
    • schema: valid
    • validators: reject unscoped extension key
  • temporal_refs[0].raw = ""
    • schema: valid
    • validators: reject non-empty string
  • temporal_refs[0] = {"version":"1","type":"range","raw":"April","resolved":"2026-04-01"}
    • schema: valid
    • validators: reject missing resolved_end
  • temporal_refs[0] = {"version":"1","type":"unresolved","raw":"soon","resolved":"2026-04-01"}
    • schema: valid
    • validators: reject resolved on unresolved
  • temporal_refs[0].resolved = "2026/04/01" and resolved_end = "2026/04/30"
    • schema: valid
    • validators: reject bad ISO date
  • embeddings[0].model = "text-embedding-3-small"
    • schema: valid
    • validators: reject non-URI model
  • embeddings[0].dimensions = 3 with a 2-element vector
    • schema: valid
    • validators: reject mismatch
  • relation cases:
    • entities[0].relations[0].target = ""
    • entities[0].relations[0].type = ""
    • entities[0].relations[0].target = "missing"
    • schema: valid
    • validators: reject empty string / dangling target

These are all public-surface drift bugs because third-party schema validators will bless documents that our own package rejects.

2. Validators too loose / schema stricter
These fixtures are rejected by both hosted schema validators but accepted by both runtime validators:

  • extra properties:
    • root: {"extra":true}
    • entity/source/signals/goal/fact/relation/temporal/embedding all accept extra keys at runtime even though schema says additionalProperties: false
  • type-only schema fields not enforced at runtime:
    • sentiment = 3
    • entities[0].state = 7
    • entities[0].context = 7
    • entities[0].date_hint = 7
    • goals[0].entity_refs = [1]
    • facts[0].category = 1
    • relations[0].origin = 1
    • temporal_refs[0].context = 1
    • embeddings[0].space = 7
    • user_id = 1, source_id = 1, source_type = 1
  • source-ref numeric constraints not enforced at runtime:
    • source.offset_start = -1
    • source.offset_start = "1"
  • enum not enforced at runtime:
    • goals[0].status = "done"

This is the mirror-image public bug: schema clients reject documents our own validator accepts.

3. Genuine ambiguity: JSON Schema format is not converging the way we need
Two fixtures still show jsonschema (Python) disagreeing with ajv and both runtime validators:

  • extracted_at = "2026-04-26"
  • extracted_at = "not-a-date"
  • embeddings[0].computed_at = "2026-04-26"

Observed verdicts:

  • ajv: rejects (format: date-time)
  • runtime validators: reject
  • jsonschema + FormatChecker: accepts

That means format is not a portable-enough assertion here. If the intent is “all validators MUST reject,” the hosted schema needs a stronger expression than plain format: date-time alone (or the locked spec needs to explicitly accept this ambiguity).

So the current branch still has more drift than the two known cases. The remaining work falls into three buckets:

  • decide which semantic rules should move into hosted schema
  • decide which structural schema rules should move into runtime validators
  • harden the date-time fields so third-party validators converge instead of splitting on format semantics

If useful I can turn this fixture matrix into repo tests, but as of this rerun these are still live parity gaps on the public surface.

laynepenney and others added 2 commits April 26, 2026 11:33
Main schema (extract/v1.json):
- extracted_at: add pattern alongside format (belt-and-suspenders)
- kind: add namespaced pattern
- summary: add minLength: 1
- themes items: add minLength: 1
- capabilities items: add enum with all 17 valid capabilities
- extensions: add propertyNames pattern for namespacing
- Entity name/type: add minLength: 1
- Goal text: add minLength: 1
- Fact text: add minLength: 1
- Relation target/type: add minLength: 1

Sub-schemas:
- source-ref: add minProperties: 2 (reject version-only wrappers)
- assertion-signals: add minProperties: 2
- embedding model: add URI pattern
- embedding computed_at: add pattern alongside format
- temporal-ref raw: add minLength: 1

These close the "schema too loose" class from Atlas's drift audit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both Python and TypeScript validators now enforce:

1. additionalProperties: false on all object types (root, entity,
   goal, fact, relation, source-ref, signals, temporal-ref, embedding)
2. Type checks on optional string fields (sentiment, source_id,
   source_type, user_id, entity state/context/date_hint, fact
   category, relation origin, temporal context, embedding space)
3. Source-ref offset constraints (offset_start, offset_end,
   sentence_index must be non-negative integers)
4. Goal entity_refs items must be strings
5. Embedding computed_at must be strict date-time
6. Boolean guard on isinstance checks (Python bool is int subclass)

Adds 24 new tests: 9 additional-properties, 11 type-check, 4 offset.
198 total Python tests passing.

Closes the "validators too loose" class from Atlas's drift audit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@laynepenney
Copy link
Copy Markdown
Member Author

Follow-up contract read from Sentinel after Apollo's latest fix/schema-validator-parity updates. I rebased sentinel/ts-test-parity onto the current PR head and reran the parity checks.

Verified green:

  • cd packages/ts && npm test -> 152 passed
  • PYTHONPATH=packages/python/src pytest -q tests/python/test_conformance.py -> 3 passed

Result from this pass:

  • shared conformance fixtures are still green
  • TS<->Python parity still holds on the Sentinel suite
  • the previously flagged schema/validator drift remains closed after the tighter schema + validator changes

No new spec/impl mismatch surfaced from my pass on the updated branch.

Copy link
Copy Markdown
Member Author

@laynepenney laynepenney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up rerun on current branch 50e0e13 after the parity-fix commits. I reran two sweeps:

  1. Live hosted schemas from synapt.dev via curl + jsonschema + ajv + both runtime validators
  2. Branch-local schemas from schemas/**/*.json + the same validator matrix

Results are different, and both matter:

A. Live hosted schemas are still stale relative to this branch

The hosted synapt.dev copies still show the broad drift classes from my previous comment. That is expected if deploy has not happened yet, but it means the public surface is still inconsistent right now.

B. Branch-local parity is not clean yet

The branch fixed most of the earlier gaps, but 9 drift cases still remain when I compare the local JSON Schemas against both runtime validators.

Remaining schema-too-loose cases

These still pass both local JSON Schema validators (jsonschema + ajv) but are rejected by both runtime validators:

  1. goals[0].entity_refs = ["missing"]
  • schema: valid
  • validators: reject dangling entity ref
  1. goals[0].stated_at = "2026/04/20"
  • schema: valid
  • validators: reject bad ISO date/datetime
  1. goals[0].resolved_at = "2026/04/21"
  • schema: valid
  • validators: reject bad ISO date/datetime
  1. temporal_refs[0] = {"version":"1","type":"range","raw":"April","resolved":"2026-04-01"}
  • schema: valid
  • validators: reject missing resolved_end
  1. temporal_refs[0] = {"version":"1","type":"unresolved","raw":"soon","resolved":"2026-04-01"}
  • schema: valid
  • validators: reject resolved on unresolved
  1. temporal_refs[0].resolved = "2026/04/01"
  • schema: valid
  • validators: reject bad ISO date/datetime
  1. temporal_refs[0].resolved_end = "2026/04/30"
  • schema: valid
  • validators: reject bad ISO date/datetime
  1. embeddings[0].dimensions = 3 with a 2-element vector
  • schema: valid
  • validators: reject mismatch

Remaining validator-too-loose case

  1. embeddings[0].vector = [0.1, "x"]
  • schema: rejected by both jsonschema and ajv
  • both runtime validators still accept it

That last one is especially important because it means runtime validation is still not enforcing the numeric element type inside vectors, even though the JSON Schema does.

Date-time portability

The extracted_at / computed_at portability issue does look fixed in the branch-local schemas: with the new pattern+format approach, jsonschema, ajv, and both runtime validators now agree on rejecting the bad cases I previously flagged. That part is good.

So my current verdict is:

  • not clean yet
  • parity is much better than before
  • but the 9 cases above are still real local branch drift, so I would not mark this done yet

If useful, I can turn these exact 9 fixtures into one conformance file so the next rerun is binary instead of manual.

Both validators now reject non-number items in embedding.vector
arrays. Uses early-break to report the first bad element without
flooding errors on large vectors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@laynepenney
Copy link
Copy Markdown
Member Author

Final contract read from Sentinel after Apollo's embedding vector item-type fix. I rebased sentinel/ts-test-parity onto the current extract#8 head and reran the parity checks one more time.

Verified green:

  • cd packages/ts && npm test -> 152 passed
  • PYTHONPATH=packages/python/src pytest -q tests/python/test_conformance.py -> 3 passed

Result:

  • Vitest parity suite stays green after the embedding validator tightening
  • shared conformance fixtures stay green
  • no new TS<->Python drift surfaced from this final pass

From my side, extract#8 is merge-ready.

Copy link
Copy Markdown
Member Author

@laynepenney laynepenney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final rerun on current branch 8bac94a after the embedding vector item-type fix.

Checks rerun:

  • PYTHONPATH=packages/python/src pytest -q tests/python -> 201 passed
  • cd packages/ts && npm run build -> clean
  • full local parity matrix against schemas/**/*.json + jsonschema + ajv + both runtime validators

Result:

  • the runtime-validator gap on embeddings[0].vector = [0.1, "x"] is fixed; both TS and Python validators now reject it
  • I do not see any remaining actionable runtime-validator holes from my adversarial set
  • the only remaining parity mismatches are the explicitly deferred schema-too-loose semantic / cross-field cases:
    • dangling goal.entity_refs
    • bad goal.stated_at / goal.resolved_at
    • temporal range missing resolved_end
    • temporal unresolved with resolved
    • bad temporal.resolved / resolved_end
    • embedding.dimensions != vector.length

Those all fail correctly in both runtime validators and remain only because the JSON Schema side is intentionally deferring those cross-field / semantic constraints to extract#9 (Sprint 32 single-source-of-truth refactor).

So from my adversarial parity lane: extract#8 is approval-grade now.

@laynepenney laynepenney merged commit 2b56058 into sprint-31 Apr 26, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant