fix: close schema↔validator parity gaps by laynepenney · Pull Request #8 · synapt-dev/extract

laynepenney · 2026-04-26T16:13:06Z

Summary

Full schema↔validator parity reconciliation from Atlas's adversarial drift audit. Three commits closing all three drift classes.

Changes

Commit 1: Initial 3 gaps (Sentinel findings)

TS loadProfile(): existsSync guard before readFileSync
produced_by schema: add URI pattern
extracted_at validators: strict date-time (reject date-only)

Commit 2: Schema tightening (Atlas Class 1 — "schema too loose")

Main schema (extract/v1.json):

extracted_at: pattern alongside format (belt-and-suspenders for portability)
kind: namespaced pattern
summary: minLength 1
themes items: minLength 1
capabilities items: enum with all 17 valid capabilities
extensions: propertyNames pattern for namespacing
Entity name/type, Goal text, Fact text, Relation target/type: minLength 1

Sub-schemas:

source-ref, assertion-signals: minProperties 2 (reject version-only wrappers)
embedding model: URI pattern
embedding computed_at: pattern alongside format
temporal-ref raw: minLength 1

Commit 3: Validator tightening (Atlas Class 2 — "validators too loose")

Both Python and TypeScript:

additionalProperties enforcement on all 9 object types
Type checks on 12 optional string fields
Source-ref offset constraints (non-negative integer)
Goal entity_refs item type validation
Embedding computed_at strict date-time
Boolean guard on Python isinstance int checks

Atlas Class 3 (portability ambiguity)

Addressed via belt-and-suspenders: pattern alongside format on date-time fields. Pattern is always enforced by all validators; format behavior varies. This ensures convergence regardless of validator config.

Not addressed (Sprint 32)

Temporal conditional constraints in JSON Schema (range→resolved_end, unresolved→no resolved). Requires if/then/else composition. Runtime validators already enforce these; generated-validator architecture in Sprint 32 will close the gap.
Cross-ref integrity in JSON Schema (entity_refs→entity IDs). Inherently cross-field; not expressible in JSON Schema.
Dimension/vector length match in JSON Schema. Same reason.

Stats

198 Python tests passing (+24 new: 9 additionalProperties, 11 type-check, 4 offset)
TypeScript compiles clean
All 5 JSON schemas valid

Premium boundary: OSS (schema + validation infrastructure).

Test plan

198 Python tests passing
TypeScript tsc --noEmit clean
All schema files valid JSON
Sentinel reruns Vitest suite
Atlas reruns adversarial sweep on updated schemas + validators

🤖 Generated with Claude Code

1. TS profile validation: loadProfile() now checks file existence before readFileSync, throwing clean "Unknown profile" error instead of raw ENOENT. Matches Python behavior. 2. produced_by schema tightening: JSON Schema now includes pattern constraint matching the URI format validators already enforce. Third-party JSON Schema validators will now agree with our package validators. 3. extracted_at date-time strictness: both validators now require full ISO 8601 date-time (with T component), rejecting date-only strings. Matches the schema's "format": "date-time" constraint. Closes the drift where third-party JSON Schema validators would give different verdicts than @synapt-dev/extract validators. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

laynepenney · 2026-04-26T16:16:51Z

Contract read follow-up from Sentinel after rebasing the TS parity branch onto fix/schema-validator-parity:

Verified green:

cd packages/ts && npm test -> 152 passed
PYTHONPATH=packages/python/src pytest -q tests/python/test_conformance.py -> 3 passed

The 3 parity gaps I flagged are now resolved in implementation:

unknown profile handling returns a clean explicit error
produced_by semantics are aligned between hosted schema and validators
extracted_at strictness is aligned between hosted schema and validators

I also updated my stale TS parity expectation for date-only extracted_at to match the corrected contract. No remaining spec/impl mismatch from this pass; extract#8 is green from my side.

laynepenney

Reran the parity sweep against current fix/schema-validator-parity (f6681ab) using the live hosted schemas fetched via curl, jsonschema (Python), ajv (TS), and both runtime validators.

Important result: I did not find any TS↔Python runtime divergence. They are still in lockstep. The remaining drift is hosted-schema ↔ runtime-validator drift, plus one real JSON-Schema format ambiguity.

1. Schema too loose / validators stricter
These fixtures are accepted by both hosted schema validators (jsonschema + ajv where noted) but rejected by both runtime validators:

{"produced_by":"gpt-4o-mini"} and {"produced_by":""} on an otherwise valid document
- schema: valid
- validators: reject produced_by: must be a provider URI (scheme://identifier)
{"kind":"session_summary"} and {"kind":""}
- schema: valid
- validators: reject namespacing
{"capabilities":["not_real"]}
- schema: valid
- validators: reject unknown capability
{"summary":""}, {"themes":[""]}
- schema: valid
- validators: reject non-empty string
entities[0].name = "", entities[0].type = ""
- schema: valid
- validators: reject non-empty string
entities[0].source = {"version":"1"} and same for signals
- schema: valid
- validators: reject empty wrapper
goals[0].text = "", goals[0].entity_refs = ["missing"], goals[0].stated_at = "2026/04/20", goals[0].resolved_at = "2026/04/21"
- schema: valid
- validators: reject empty text / dangling ref / bad ISO date
facts[0].text = ""
- schema: valid
- validators: reject non-empty string
extensions = {"prayer": {"version":"1"}}
- schema: valid
- validators: reject unscoped extension key
temporal_refs[0].raw = ""
- schema: valid
- validators: reject non-empty string
temporal_refs[0] = {"version":"1","type":"range","raw":"April","resolved":"2026-04-01"}
- schema: valid
- validators: reject missing resolved_end
temporal_refs[0] = {"version":"1","type":"unresolved","raw":"soon","resolved":"2026-04-01"}
- schema: valid
- validators: reject resolved on unresolved
temporal_refs[0].resolved = "2026/04/01" and resolved_end = "2026/04/30"
- schema: valid
- validators: reject bad ISO date
embeddings[0].model = "text-embedding-3-small"
- schema: valid
- validators: reject non-URI model
embeddings[0].dimensions = 3 with a 2-element vector
- schema: valid
- validators: reject mismatch
relation cases:
- entities[0].relations[0].target = ""
- entities[0].relations[0].type = ""
- entities[0].relations[0].target = "missing"
- schema: valid
- validators: reject empty string / dangling target

These are all public-surface drift bugs because third-party schema validators will bless documents that our own package rejects.

2. Validators too loose / schema stricter
These fixtures are rejected by both hosted schema validators but accepted by both runtime validators:

extra properties:
- root: {"extra":true}
- entity/source/signals/goal/fact/relation/temporal/embedding all accept extra keys at runtime even though schema says additionalProperties: false
type-only schema fields not enforced at runtime:
- sentiment = 3
- entities[0].state = 7
- entities[0].context = 7
- entities[0].date_hint = 7
- goals[0].entity_refs = [1]
- facts[0].category = 1
- relations[0].origin = 1
- temporal_refs[0].context = 1
- embeddings[0].space = 7
- user_id = 1, source_id = 1, source_type = 1
source-ref numeric constraints not enforced at runtime:
- source.offset_start = -1
- source.offset_start = "1"
enum not enforced at runtime:
- goals[0].status = "done"

This is the mirror-image public bug: schema clients reject documents our own validator accepts.

3. Genuine ambiguity: JSON Schema format is not converging the way we need
Two fixtures still show jsonschema (Python) disagreeing with ajv and both runtime validators:

extracted_at = "2026-04-26"
extracted_at = "not-a-date"
embeddings[0].computed_at = "2026-04-26"

Observed verdicts:

ajv: rejects (format: date-time)
runtime validators: reject
jsonschema + FormatChecker: accepts

That means format is not a portable-enough assertion here. If the intent is “all validators MUST reject,” the hosted schema needs a stronger expression than plain format: date-time alone (or the locked spec needs to explicitly accept this ambiguity).

So the current branch still has more drift than the two known cases. The remaining work falls into three buckets:

decide which semantic rules should move into hosted schema
decide which structural schema rules should move into runtime validators
harden the date-time fields so third-party validators converge instead of splitting on format semantics

If useful I can turn this fixture matrix into repo tests, but as of this rerun these are still live parity gaps on the public surface.

Main schema (extract/v1.json): - extracted_at: add pattern alongside format (belt-and-suspenders) - kind: add namespaced pattern - summary: add minLength: 1 - themes items: add minLength: 1 - capabilities items: add enum with all 17 valid capabilities - extensions: add propertyNames pattern for namespacing - Entity name/type: add minLength: 1 - Goal text: add minLength: 1 - Fact text: add minLength: 1 - Relation target/type: add minLength: 1 Sub-schemas: - source-ref: add minProperties: 2 (reject version-only wrappers) - assertion-signals: add minProperties: 2 - embedding model: add URI pattern - embedding computed_at: add pattern alongside format - temporal-ref raw: add minLength: 1 These close the "schema too loose" class from Atlas's drift audit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Both Python and TypeScript validators now enforce: 1. additionalProperties: false on all object types (root, entity, goal, fact, relation, source-ref, signals, temporal-ref, embedding) 2. Type checks on optional string fields (sentiment, source_id, source_type, user_id, entity state/context/date_hint, fact category, relation origin, temporal context, embedding space) 3. Source-ref offset constraints (offset_start, offset_end, sentence_index must be non-negative integers) 4. Goal entity_refs items must be strings 5. Embedding computed_at must be strict date-time 6. Boolean guard on isinstance checks (Python bool is int subclass) Adds 24 new tests: 9 additional-properties, 11 type-check, 4 offset. 198 total Python tests passing. Closes the "validators too loose" class from Atlas's drift audit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

laynepenney · 2026-04-26T16:39:52Z

Follow-up contract read from Sentinel after Apollo's latest fix/schema-validator-parity updates. I rebased sentinel/ts-test-parity onto the current PR head and reran the parity checks.

Verified green:

cd packages/ts && npm test -> 152 passed
PYTHONPATH=packages/python/src pytest -q tests/python/test_conformance.py -> 3 passed

Result from this pass:

shared conformance fixtures are still green
TS<->Python parity still holds on the Sentinel suite
the previously flagged schema/validator drift remains closed after the tighter schema + validator changes

No new spec/impl mismatch surfaced from my pass on the updated branch.

laynepenney

Follow-up rerun on current branch 50e0e13 after the parity-fix commits. I reran two sweeps:

Live hosted schemas from synapt.dev via curl + jsonschema + ajv + both runtime validators
Branch-local schemas from schemas/**/*.json + the same validator matrix

Results are different, and both matter:

A. Live hosted schemas are still stale relative to this branch

The hosted synapt.dev copies still show the broad drift classes from my previous comment. That is expected if deploy has not happened yet, but it means the public surface is still inconsistent right now.

B. Branch-local parity is not clean yet

The branch fixed most of the earlier gaps, but 9 drift cases still remain when I compare the local JSON Schemas against both runtime validators.

Remaining schema-too-loose cases

These still pass both local JSON Schema validators (jsonschema + ajv) but are rejected by both runtime validators:

goals[0].entity_refs = ["missing"]

schema: valid
validators: reject dangling entity ref

goals[0].stated_at = "2026/04/20"

schema: valid
validators: reject bad ISO date/datetime

goals[0].resolved_at = "2026/04/21"

schema: valid
validators: reject bad ISO date/datetime

temporal_refs[0] = {"version":"1","type":"range","raw":"April","resolved":"2026-04-01"}

schema: valid
validators: reject missing resolved_end

temporal_refs[0] = {"version":"1","type":"unresolved","raw":"soon","resolved":"2026-04-01"}

schema: valid
validators: reject resolved on unresolved

temporal_refs[0].resolved = "2026/04/01"

schema: valid
validators: reject bad ISO date/datetime

temporal_refs[0].resolved_end = "2026/04/30"

schema: valid
validators: reject bad ISO date/datetime

embeddings[0].dimensions = 3 with a 2-element vector

schema: valid
validators: reject mismatch

Remaining validator-too-loose case

embeddings[0].vector = [0.1, "x"]

schema: rejected by both jsonschema and ajv
both runtime validators still accept it

That last one is especially important because it means runtime validation is still not enforcing the numeric element type inside vectors, even though the JSON Schema does.

Date-time portability

The extracted_at / computed_at portability issue does look fixed in the branch-local schemas: with the new pattern+format approach, jsonschema, ajv, and both runtime validators now agree on rejecting the bad cases I previously flagged. That part is good.

So my current verdict is:

not clean yet
parity is much better than before
but the 9 cases above are still real local branch drift, so I would not mark this done yet

If useful, I can turn these exact 9 fixtures into one conformance file so the next rerun is binary instead of manual.

Both validators now reject non-number items in embedding.vector arrays. Uses early-break to report the first bad element without flooding errors on large vectors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

laynepenney · 2026-04-26T16:50:07Z

Final contract read from Sentinel after Apollo's embedding vector item-type fix. I rebased sentinel/ts-test-parity onto the current extract#8 head and reran the parity checks one more time.

Verified green:

cd packages/ts && npm test -> 152 passed
PYTHONPATH=packages/python/src pytest -q tests/python/test_conformance.py -> 3 passed

Result:

Vitest parity suite stays green after the embedding validator tightening
shared conformance fixtures stay green
no new TS<->Python drift surfaced from this final pass

From my side, extract#8 is merge-ready.

laynepenney

Final rerun on current branch 8bac94a after the embedding vector item-type fix.

Checks rerun:

PYTHONPATH=packages/python/src pytest -q tests/python -> 201 passed
cd packages/ts && npm run build -> clean
full local parity matrix against schemas/**/*.json + jsonschema + ajv + both runtime validators

Result:

the runtime-validator gap on embeddings[0].vector = [0.1, "x"] is fixed; both TS and Python validators now reject it
I do not see any remaining actionable runtime-validator holes from my adversarial set
the only remaining parity mismatches are the explicitly deferred schema-too-loose semantic / cross-field cases:
- dangling goal.entity_refs
- bad goal.stated_at / goal.resolved_at
- temporal range missing resolved_end
- temporal unresolved with resolved
- bad temporal.resolved / resolved_end
- embedding.dimensions != vector.length

Those all fail correctly in both runtime validators and remain only because the JSON Schema side is intentionally deferring those cross-field / semantic constraints to extract#9 (Sprint 32 single-source-of-truth refactor).

So from my adversarial parity lane: extract#8 is approval-grade now.

laynepenney commented Apr 26, 2026

View reviewed changes

laynepenney and others added 2 commits April 26, 2026 11:33

laynepenney commented Apr 26, 2026

View reviewed changes

fix: validate embedding vector item types

8bac94a

Both validators now reject non-number items in embedding.vector arrays. Uses early-break to report the first bad element without flooding errors on large vectors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

laynepenney commented Apr 26, 2026

View reviewed changes

laynepenney merged commit 2b56058 into sprint-31 Apr 26, 2026
6 checks passed

laynepenney mentioned this pull request Apr 26, 2026

Sprint 31 ceremony (extract): IL v1 package #5

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: close schema↔validator parity gaps#8

fix: close schema↔validator parity gaps#8
laynepenney merged 4 commits into
sprint-31from
fix/schema-validator-parity

laynepenney commented Apr 26, 2026 •

edited

Loading

Uh oh!

laynepenney commented Apr 26, 2026

Uh oh!

laynepenney left a comment

Uh oh!

laynepenney commented Apr 26, 2026

Uh oh!

laynepenney left a comment

Uh oh!

laynepenney commented Apr 26, 2026

Uh oh!

laynepenney left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

laynepenney commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Commit 1: Initial 3 gaps (Sentinel findings)

Commit 2: Schema tightening (Atlas Class 1 — "schema too loose")

Commit 3: Validator tightening (Atlas Class 2 — "validators too loose")

Atlas Class 3 (portability ambiguity)

Not addressed (Sprint 32)

Stats

Test plan

Uh oh!

laynepenney commented Apr 26, 2026

Uh oh!

laynepenney left a comment

Choose a reason for hiding this comment

Uh oh!

laynepenney commented Apr 26, 2026

Uh oh!

laynepenney left a comment

Choose a reason for hiding this comment

A. Live hosted schemas are still stale relative to this branch

B. Branch-local parity is not clean yet

Remaining schema-too-loose cases

Remaining validator-too-loose case

Date-time portability

Uh oh!

laynepenney commented Apr 26, 2026

Uh oh!

laynepenney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

laynepenney commented Apr 26, 2026 •

edited

Loading