Skip to content

v3.2: Guidance on searching and evaluating schemas #4743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: v3.2-dev
Choose a base branch
from

Conversation

handrews
Copy link
Member

@handrews handrews commented Jun 21, 2025

NOTE 1: This is intended to clarify requirements that already exist but have never been well-defined, both by making certain things required and stating clearly that other things are not. It is particularly relevant in light of the Encoding Object changes, although the vaguely-defined behavior predates the new features.

Some OAS features casually state that they depend on the type of data being examined, or implicitly carry ambiguity about how to determine how to parse the data.

This section attempts to provide some guidance and limits, requiring only that implementations follow the unambiguous, statically deterministic keywords $ref and allOf.

It also provides for just validating the data (when possible) and using the actual in-memory type when a schema is too complex to analyze statically.

One use of this is breaking apart schemas to use them with mixed binary and JSON-compatible data, and a new section has been added to address that.

Finally, a typo in a related section was fixed.

  • schema changes are included in this pull request
  • schema changes are needed for this pull request but not done yet
  • no schema changes are needed for this pull request

Some OAS features casually state that they depend on the type
of data being examined, or implicitly carry ambiguity about how
to determine how to parse the data.

This section attempts to provide some guidance and limits, requiring
only that implementations follow the unambiguous, statically
deterministic keywords `$ref` and `allOf`.

It also provides for just validating the data (when possible) and
using the actual in-memory type when a schema is too complex
to analyze statically.

One use of this is breaking apart schemas to use them with mixed
binary and JSON-compatible data, and a new section has been
added to address that.

Finally, a typo in a related section was fixed.
@handrews handrews added this to the v3.2.0 milestone Jun 21, 2025
@handrews handrews requested a review from a team as a code owner June 21, 2025 01:12
@handrews handrews requested a review from a team as a code owner June 21, 2025 01:12
@handrews handrews added the media and encoding Issues regarding media type support and how to encode data (outside of query/path params) label Jun 21, 2025
@handrews
Copy link
Member Author

@karenetheridge while I have your attention, do you think this is fine where it is or should it go under the Schema Object somewhere? I really could not decide.

@handrews handrews marked this pull request as draft June 22, 2025 03:57
@handrews
Copy link
Member Author

handrews commented Jun 22, 2025

I'm putting this in draft because based on @karenetheridge's feedback I'm going to rework it fairly substantially, but it's still of use when understanding how it fits with the other related PRs.

The effect of the rewrite should be the same, but I think the wording and organization will be significantly different. It's clear that the different use cases here need to be separated out and clarified. I think this ended up being a bit oddly abstract because of how I tried to split things up into PRs that don't conflict.

Move things under the Schema Object, organize by use case and
by the point in the process at which things occur, and link
directly from more parts of the spec so that the parts in
the Schema Object section can stay more focused.
@handrews
Copy link
Member Author

I have added a commit that almost totally rewrites this- you probably just want to review the whole thing and not look at the per-commit diff as it will be a mess. The new version:

  • Puts most things under the Schema Object
  • Organizes use cases by the point in the process they occur relative to schema evaluation
  • Links from elsewhere in the spec so that we do not need to include quite as much in the main part of the text

I do not think that has changed anything substantial, but it's essentially a new PR now.

@handrews handrews marked this pull request as ready for review June 22, 2025 22:08
@handrews
Copy link
Member Author

@karenetheridge I'm going to mark various threads as resolved since the text is now so different that they are confusing- please do not take that to mean I'm dismissing open questions, please just re-start whatever is needed with comments on the new text, or as new top-level comments. Apologies for the inconvenience.

handrews and others added 2 commits June 27, 2025 15:06
Co-authored-by: Karen Etheridge <ether@cpan.org>
Also clarify that there is no one set list of keywords to search
for, but rather each use case defines what is relevant.
@handrews
Copy link
Member Author

@karenetheridge I trimmed back the multi-valued type requirements as from our discussion I just see too many ways it can go wrong. Now it's just "if you have [X, "null"] treat it like X" and everything else is optional guidance. How does that sit with you?

@handrews
Copy link
Member Author

handrews commented Jul 9, 2025

@karenetheridge I'm marking various threads resolved as I think subsequent commits addressed them, and it's a lot of at least somewhat outdated discussion for folks to have to read through before tomorrow's call. Please feel free to re-raise anything that is still not addressed.

Copy link
Member

@karenetheridge karenetheridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one small edit (not a change introduced by you, but still an improvement I think).

@@ -2599,6 +2601,10 @@ Note that JSON Schema Draft 2020-12 does not require an `x-` prefix for extensio
The [`format` keyword (when using default format-annotation vocabulary)](https://www.ietf.org/archive/id/draft-bhutton-json-schema-validation-01.html#section-7.2.1) and the [`contentMediaType`, `contentEncoding`, and `contentSchema` keywords](https://www.ietf.org/archive/id/draft-bhutton-json-schema-validation-01.html#section-8.2) define constraints on the data, but are treated as annotations instead of being validated directly.
Extended validation is one way that these constraints MAY be enforced.

In addition to extended validation, annotations are the most effective way to determine whether these keywords impact the type and structure of the fully parsed data.
For example, formats such as `int64` can be applied to JSON strings, as JSON numbers have limitations that make large integers non-portable.
If annotation collection is not available, implementations MUST perform a [schema search](#searching-schemas) for these keywords, and MUST document the limitations this imposes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If annotation collection is not available, implementations MUST perform a [schema search](#searching-schemas) for these keywords, and MUST document the limitations this imposes.
If annotation collection is not available, implementations MUST perform a [schema search](#searching-schemas) for these keywords, and SHOULD document the limitations this imposes.

Copy link
Member

@karenetheridge karenetheridge Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(removed, commented in wrong section)

Copy link
Contributor

@lornajane lornajane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions from the TSC call

For example, if `foo` had the schema `{"type": "string", "format": "int64")`, the data structure used for validation would still be the same, but the application will need to convert the string `"42"` to the 64-bit integer `42`.
Similarly, the `content*` keywords can indicate further structure within a string.

Implementations MUST either use [annotation collection](#extended-validation-with-annotations) to gather this information, or perform a [schema search](#searching-schemas), and MUST document which approach it implements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Implementations MUST either use [annotation collection](#extended-validation-with-annotations) to gather this information, or perform a [schema search](#searching-schemas), and MUST document which approach it implements.
Implementations MUST either use [annotation collection](#extended-validation-with-annotations) to gather this information, or perform a [schema search](#searching-schemas), and SHOULD document which approach it implements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the meeting, if implementations don't do this, what would they do instead? If there isn't anything they can do, then I think the MUST would stand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really did not expect this PR to get hung up on a debate about how much to require implementations to document their behavior. Which I thought would be thoroughly non-controversial. Why would we not want them to do so?

So... I have no idea. I want everyone else to resolve their differences around documentation requirements so it doesn't hang up this PR, that's my opinion on the matter.


Implementations MUST document which strategy or strategies they use, as well as any known limitations.

##### Searching Schemas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about moving this section a little further up the document, who has thoughts?

1. Use a placeholder value, on the assumption that no assertions will apply to the binary data and no conditional schema keywords will cause the schema to treat the placeholder value differently (e.g. a part that could be either plain text or binary might behave unexpectedly if a string is used as a binary placeholder, as it would likely be treated as plain text and subject to different subschemas and keywords).
2. Perform [schema searches](#searching-schemas) to find the appropriate keywords (`properties`, `prefixItems`, etc.) in order to break up the subschemas and apply them separately to binary and JSON-compatible data.

Implementations MUST document which strategy or strategies they use, as well as any known limitations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Implementations MUST document which strategy or strategies they use, as well as any known limitations.
Implementations SHOULD document which strategy or strategies they use, as well as any known limitations.

Co-authored-by: Karen Etheridge <ether@cpan.org>
@handrews
Copy link
Member Author

@lornajane @karenetheridge @duncanbeevers Can y'all sort out what we should be doing on documentation requirements and why? I have no idea why MUST requirements around documenting behavior are controversial, but all I really care about is that this does not hang up this PR. It sounds like @karenetheridge is disagreeing on one? I just want a broadly applicable rule that tells me what to do here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
media and encoding Issues regarding media type support and how to encode data (outside of query/path params) schema-object
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants