Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSON schema support (v2) #5029

Merged
merged 42 commits into from
Feb 21, 2023
Merged

Add JSON schema support (v2) #5029

merged 42 commits into from
Feb 21, 2023

Conversation

dmontagu
Copy link
Contributor

@dmontagu dmontagu commented Feb 9, 2023

fix #4666

Things left to resolve:

  • URLs Created an issue for this: JSON schema: URL #5095

    • Should minLength/maxLength be set on AnyUrl? (Class was moved from pydantic to pydantic_core)
    • Are we eliminating stricturl? Is the same functionality ported to pydantic_core in some manner?
      • What is the migration path?
    • Tentative plan: If it's possible to achieve the same functionality, drop stricturl and add notes to the migration guide
  • Json type

    • Is it okay to change Json[int]'s schema from {'type': 'number'} to {'type': 'string', 'format': 'json-string'}?
    • Change has been made
  • Callable type Created an issue for this: JSON schema: Callable #5096

    • How should we handle JSON schema generation for fields with type callable? (See, e.g., tests.test_schema.test_callable_type)
    • Tentative plan: Not sure.
  • Decimal Relevant issues: Decimal JSON encoder is lossy #1511 and v2 JSON schema — parsing vs. serialization #5072

    • What should the Decimal schema be?
    • Tentative plan: Need to come to a consensus; Decimal JSON encoder is lossy #1511 may be relevant
      • Note: although serialization and parsing are different, I think for FastAPI/OpenAPI purposes it is important that the JSON schema be compatible with the serialization type. (Or else generated clients will produce type errors if an API returns a Decimal.) Happy to discuss this more.
  • __pydantic_modify_json_schema__

    • Do we retain backwards compatibility with the name __modify_schema__, or force migration to __pydantic_modify_json_schema__?
      • Plan: Backwards compatible with __modify_schema__; new syntax in __pydantic_json_schema__
      • Do signature introspection for passing kwargs
    • Do we support any args/kwargs for __pydantic_modify_json_schema__ besides the schema itself?
      • Do we pass any info to it from FieldInfo? The CoreSchema? A "source" object? (Could be useful when working with custom generics)
      • Tentative plan:
        • Because this isn't overriding anything, we can just do signature introspection for different sets of arguments
        • Need to decide how to handle this on enums and fields. I think it's a plain classmethod on enums with similar signature to what it is for models.
  • 'definitions' -> '$defs'

    • Should we change the key 'definitions' to '$defs' everywhere? This is in line with JSON schema 2020-12, but it would be nice to confirm this won't break any important tooling (e.g., openapi-generator).
      • Change has been made (see tests.test_schema.test_schema_with_refs; openapi doesn't actually use the definitions key anyway, and FastAPI will change out the $defs key as appropriate.)

@dmontagu dmontagu changed the title JSON schema support Add JSON schema support (v2) Feb 9, 2023
@dmontagu dmontagu marked this pull request as ready for review February 11, 2023 20:36
@dmontagu
Copy link
Contributor Author

@samuelcolvin there's still a lot of work to be done before the JSON schema work is "finished", but I think I am starting to add too much to reasonably review. I think it would be better to stop new work on this branch, and just get everything it adds reviewed and merged before continuing.

There are already enough decisions in here that may be controversial that I think it's probably a mistake to continue building on them before establishing more thorough agreement.

I will be happy to remove things from this branch (such as my possibly-half-baked implementation of handling discriminator), even if it means more tests fail / etc., so that we can merge whatever fraction of it seems acceptable for now. This will make it easier to keep up-to-date, and make reviewing future improvements easier as well.

pydantic/main.py Outdated Show resolved Hide resolved
@tiangolo
Copy link
Member

tiangolo commented Feb 21, 2023

@dmontagu ah, very clever!

And good point that subclassing GenerateJsonSchema might work (from the comment in the other PR). Although it's true that doing it in FastAPI would not allow users to intuitively subclass it themselves.

I think this idea of including errors in the schema with a flag should work. My only fear is including two types of things in the same object (schema and errors), but at the same time, this would only happen when enabling the flag, so whoever does that (me) would have to know what they are doing and know what to expect, so I think that might be enough.

And this is probably the simplest solution, not affecting anyone else nor changing much the return type annotations.


I was also thinking about an alternative implementation separating the errors and typing it with @overload, but thinking about it, your solution of including it in the schema would naturally preserve the exact spot of the error without needing any other tricks. So, the more I think about it, the more I get convinced your idea is better than what I was thinking originally. 🧠


Not sure if you would rather keep the conversation here in a central place, or there in the other PR to avoid my extra noise here, let me know if you would prefer to switch over there!

@samuelcolvin
Copy link
Member

Field - example, const, regex

I agree we should move to remove/change them but provide backwards compatibility with a deprecation warning, I guess we should also allow **kwargs and raise a warning for that too - I guess we should we add those **kwargs to JSON Schema for backwards compatibility

Config.schema_extra

I guess we should support it, with a warning

default values

I guess we should use the same logic as for fields with a type that can't be defined in JSON - e.g. Callable and IsSubClass - I don't really like UserWarning, but since we use it already I'm happy to keep it.

Update: @dmontagu I don't see this code, did we decide to remove it?

__pydantic_modify_json_schema__

I'm actually inclined here to change the behaviour and name to def __pydantic_json_schema__(cls, schema: Dict[str, Any], info: Info) -> Dict[str, Any], e.g. remove the slightly odd "modify" behaviour to a much more obvious "take the value, return the new value" signature that matches what we're doing with pydantic_core.

I know this is another change, however:

  • the method name has changed, we can continue to support the current behaviour with __modify_schema__ albeit with a warning
  • we can raise a warning or error if __pydantic_modify_json_schema__ "returns None" and hence avoid mistakes

Decimal

My preference would be to add a config setting for JSON Schema generation, something like mode: Literal['validation', 'serialisation'] or 'input' | 'output' which indicates whether we're build a JSON schema for what's required for MyModel(**data) vs. my_model.model_json() etc.

Then the JSON Schema for validation should be number, but should be string for serialisation where we're going to default
to returning a string as per #1511.

@tiangolo errors feature request

My solution would be this:

  • always collect the errors/warnings on the instance of GenerateJsonSchema, use a method on GenerateJsonSchema to create the error/warning, and another method to construct the JSON Schema value to use in these cases (both so they can be customised fairly easily)
  • Change the default code to generate, then warn/raise an error if the errors exist, this would allow FastAPI to do something different

So

s = schema_generator(by_alias=by_alias, ref_template=ref_template).generate(cls.__pydantic_core_schema__)

Becomes (by default)

schema_generator = schema_generator_cls(by_alias=by_alias, ref_template=ref_template)
s = s.generate(cls.__pydantic_core_schema__)
if s.errors:
    raise PydanticInvalidForJsonSchema(...)

Then FastAPI have have it's own logic to generate JSON Schema with very little duplication.

@tiangolo migration tool

see #5013

Copy link
Member

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking awesome.

I'm in favour of merging it asap and creating separate issues for outstanding problems & discussions.

pydantic/json_schema.py Outdated Show resolved Hide resolved
pydantic/schema.py Show resolved Hide resolved
pydantic/json_schema.py Outdated Show resolved Hide resolved
pydantic/json_schema.py Show resolved Hide resolved
@tiangolo
Copy link
Member

Thanks for the replies to all the points @samuelcolvin!

All looks good, and agreed it makes sense in subsequent PRs.

Field() with **kwargs

About Field() supporting **kwargs, I would like a way to make mypy/editors complain about them, so that developers can avoid using things that are not supported while writing the code and not only at runtime with a warning.

I was playing around and found a way to achieve both, I think. This works at runtime (with a deprecation warning), but shows an error in editors. It's a bit of extra code duplication because of the overload, but I guess it's the cost of backwards compatibility. 😅

Field() implementation
@overload
def Field(
    default: Any = Undefined,
    *,
    default_factory: typing.Callable[[], Any] | None = None,
    alias: str = None,
    title: str = None,
    description: str = None,
    exclude: typing.AbstractSet[int | str] | typing.Mapping[int | str, Any] | Any = None,
    include: typing.AbstractSet[int | str] | typing.Mapping[int | str, Any] | Any = None,
    gt: float = None,
    ge: float = None,
    lt: float = None,
    le: float = None,
    multiple_of: float = None,
    allow_inf_nan: bool = None,
    max_digits: int = None,
    decimal_places: int = None,
    min_items: int = None,
    max_items: int = None,
    min_length: int = None,
    max_length: int = None,
    frozen: bool = None,
    pattern: str = None,
    discriminator: str = None,
    repr: bool = True,
    strict: bool | None = None,
    json_schema_extra: dict[str, Any] | None = None,
) -> Any:
    ...

def Field(
    default: Any = Undefined,
    *,
    default_factory: typing.Callable[[], Any] | None = None,
    alias: str = None,
    title: str = None,
    description: str = None,
    examples: list[Any] = None,
    exclude: typing.AbstractSet[int | str] | typing.Mapping[int | str, Any] | Any = None,
    include: typing.AbstractSet[int | str] | typing.Mapping[int | str, Any] | Any = None,
    gt: float = None,
    ge: float = None,
    lt: float = None,
    le: float = None,
    multiple_of: float = None,
    allow_inf_nan: bool = None,
    max_digits: int = None,
    decimal_places: int = None,
    min_items: int = None,
    max_items: int = None,
    min_length: int = None,
    max_length: int = None,
    frozen: bool = None,
    pattern: str = None,
    discriminator: str = None,
    repr: bool = True,
    strict: bool | None = None,
    json_schema_extra: dict[str, Any] | None = None,
    **kwargs: Any,
) -> Any:
    """
    Used to provide extra information about a field, either for the model schema or complex validation. Some arguments
    apply only to number fields (``int``, ``float``, ``Decimal``) and some apply only to ``str``.

    :param default: since this is replacing the field's default, its first argument is used
      to set the default, use ellipsis (``...``) to indicate the field is required
    :param default_factory: callable that will be called when a default value is needed for this field
      If both `default` and `default_factory` are set, an error is raised.
    :param alias: the public name of the field
    :param title: can be any string, used in the schema
    :param description: can be any string, used in the schema
    :param examples: can be any list of json-encodable data, used in the schema
    :param exclude: exclude this field while dumping.
      Takes same values as the ``include`` and ``exclude`` arguments on the ``.dict`` method.
    :param include: include this field while dumping.
      Takes same values as the ``include`` and ``exclude`` arguments on the ``.dict`` method.
    :param gt: only applies to numbers, requires the field to be "greater than". The schema
      will have an ``exclusiveMinimum`` validation keyword
    :param ge: only applies to numbers, requires the field to be "greater than or equal to". The
      schema will have a ``minimum`` validation keyword
    :param lt: only applies to numbers, requires the field to be "less than". The schema
      will have an ``exclusiveMaximum`` validation keyword
    :param le: only applies to numbers, requires the field to be "less than or equal to". The
      schema will have a ``maximum`` validation keyword
    :param multiple_of: only applies to numbers, requires the field to be "a multiple of". The
      schema will have a ``multipleOf`` validation keyword
    :param allow_inf_nan: only applies to numbers, allows the field to be NaN or infinity (+inf or -inf),
        which is a valid Python float. Default True, set to False for compatibility with JSON.
    :param max_digits: only applies to Decimals, requires the field to have a maximum number
      of digits within the decimal. It does not include a zero before the decimal point or trailing decimal zeroes.
    :param decimal_places: only applies to Decimals, requires the field to have at most a number of decimal places
      allowed. It does not include trailing decimal zeroes.
    :param min_items: only applies to lists, requires the field to have a minimum number of
      elements. The schema will have a ``minItems`` validation keyword
    :param max_items: only applies to lists, requires the field to have a maximum number of
      elements. The schema will have a ``maxItems`` validation keyword
    :param min_length: only applies to strings, requires the field to have a minimum length. The
      schema will have a ``minLength`` validation keyword
    :param max_length: only applies to strings, requires the field to have a maximum length. The
      schema will have a ``maxLength`` validation keyword
    :param frozen: a boolean which defaults to True. When False, the field raises a TypeError if the field is
      assigned on an instance.  The BaseModel Config must set validate_assignment to True
    :param pattern: only applies to strings, requires the field match against a regular expression
      pattern string. The schema will have a ``pattern`` validation keyword
    :param discriminator: only useful with a (discriminated a.k.a. tagged) `Union` of sub models with a common field.
      The `discriminator` is the name of this common field to shorten validation and improve generated schema
    :param repr: show this field in the representation
    :param json_schema_extra: extra dict to be merged with the JSON Schema for this field
    :param strict: enable or disable strict parsing mode
    """
    current_json_schema_extra: dict[str, Any] | None = None
    if kwargs:
        print("what")
        warnings.warn(
            'Arbitrary Field keywords (**kwargs) have been deprecated, to extend the '
            'generated JSON Schema use the new dict parameter json_schema_extra, the '
            f'invalid keyword arguments are: {kwargs}',
            DeprecationWarning,
        )
        current_json_schema_extra = kwargs.copy()
    if current_json_schema_extra and json_schema_extra:
        current_json_schema_extra.update(json_schema_extra)
    else:
        current_json_schema_extra = json_schema_extra
    return FieldInfo.from_field(
        default,
        default_factory=default_factory,
        alias=alias,
        title=title,
        description=description,
        examples=examples,
        exclude=exclude,
        include=include,
        gt=gt,
        ge=ge,
        lt=lt,
        le=le,
        multiple_of=multiple_of,
        allow_inf_nan=allow_inf_nan,
        max_digits=max_digits,
        decimal_places=decimal_places,
        min_items=min_items,
        max_items=max_items,
        min_length=min_length,
        max_length=max_length,
        frozen=frozen,
        pattern=pattern,
        discriminator=discriminator,
        repr=repr,
        json_schema_extra=current_json_schema_extra,
        strict=strict,
    )

errors feature request

I think that should work too, and it could probably be another PR on top. The main important thing is that the current approach doesn't make that impossible to achieve/add later. 🎉

@dmontagu
Copy link
Contributor Author

dmontagu commented Feb 21, 2023

@samuelcolvin

My preference would be to add a config setting for JSON Schema generation, something like mode: Literal['validation', 'serialisation'] or 'input' | 'output' which indicates whether we're build a JSON schema for what's required for MyModel(**data) vs. my_model.model_json() etc.

Then the JSON Schema for validation should be number, but should be string for serialisation where we're going to default

There is an important consideration here that I discuss in #5072, which is that when generating clients based on an OpenAPI spec (one of the biggest benefits of FastAPI imo, and one of the main reasons I have used it for years), models are frequently used both as inputs (so will be "validated"), and as outputs (so will be "serialized"). And unless you make two separate models (which comes with its own issues), they will share a schema. So it's not uncommon that you'll have to have one schema for the both inputs and outputs, and therefore need to resolve this.

And even if we were okay with creating two separate schemas for the model based on whether it is an "input" or an "output" of the API — and I am not sure we should be, at least in the context of FastAPI — there's still the issue that we'd probably want a single generator instance to produce the "output" format in some places and the "input" format in others, within the same schema (at least that's the closest to how FastAPI does it today, I think). So I'm thinking it may make more sense to somehow make it an annotation on the core schema (that would be set by FastAPI/similar), as opposed to on the generator.

I'm not sure, but I'd be inclined to postpone addressing this in this PR and hopefully have some discussion in #5072

@dmontagu
Copy link
Contributor Author

dmontagu commented Feb 21, 2023

I was playing around and found a way to achieve both, I think.

@tiangolo Yeah this was my plan for how to do this. We can remove the overload down the line after people have had time to migrate, and it will still cause mypy/IDE errors for them now.

👍

@dmontagu
Copy link
Contributor Author

dmontagu commented Feb 21, 2023

@samuelcolvin

from your suggestion to @tiangolo's feature request:

schema_generator = schema_generator_cls(by_alias=by_alias, ref_template=ref_template)
s = s.generate(cls.__pydantic_core_schema__)
if s.errors:
    raise PydanticInvalidForJsonSchema(...)

Then FastAPI have have it's own logic to generate JSON Schema with very little duplication.

The main shortcoming I see with this is providing context about where the error was raised. Right now we don't have a good system for understanding the "path" that was taken through the crazy recursive function calls to produce the error, which I think may be necessary for the kind of errors @tiangolo wants to show. I am not opposed to implementing that logic in principle, but that logic would end up reflecting the structure of the CoreSchema more than the JsonSchema (when they differ anyway), which I could imagine leading to some confusion. (And would require some logic..) Maybe you've thought of a good way to do it.

Another downside to always deferring error raising is that it makes it a lot more annoying to debug when you want it to raise, and see where it was raised. If we add a deferred-error-collection mode I would suggest we still retain the ability to raise immediately (through one choice of an 'errors' kwarg or similar) so that we can easily get a stack trace to where the exception was raised when desired.

Either way, I think let's create a separate issue for this. (Actually I might just open a PR making the change I suggested above that at least retains information about where the error is within the final json schema, at least then the alternative is made concrete.)

@dmontagu dmontagu merged commit 73373c3 into pydantic:main Feb 21, 2023
@samuelcolvin
Copy link
Member

Thanks so much for this @dmontagu, amazing to have this merged.

On the error stuff, my instinct is that the traceback might not be that useful anyway, we could even store the traceback with the errors if really necessary, but I'm also not that bothered about it.

It might be easier to just add a kwarg config setting to GenerateJsonSchema to do all the things with errors that @tiangolo wants, rather than aiming for absolute flexibility and making a horrible API.

@dmontagu
Copy link
Contributor Author

@Julian I've done some work on JSON schema for discriminated unions in #5051; I know it's more OpenAPI than JSON schema, so not sure if it's something you can help with, but I would appreciate any insight there. (And thanks for your offer to look at the JSON schema stuff, whether or not you can help in this particular case.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

V2: JSON Schema
6 participants