Allow customizing core schema generation by making `GenerateSchema` public #6737

adriangb · 2023-07-18T14:36:41Z

The idea here is to make a more constrained version of schema generation customization that we can expand over time.

Selected Reviewer: @lig

cloudflare-pages · 2023-07-18T14:38:31Z

Deploying with Cloudflare Pages

Latest commit:	`6daf1c8`
Status:	✅ Deploy successful!
Preview URL:	https://c13c4805.pydantic-docs2.pages.dev
Branch Preview URL:	https://customize-str-schema.pydantic-docs2.pages.dev

View logs

adriangb · 2023-07-18T17:38:41Z

@vitalik I'm curious what you think of this approach. It is more restrictive than arbitrary type replacement but also pretty powerful and easy to understand.

vitalik · 2023-07-18T19:05:29Z

Hi @adriangb thank you for looking into this

yeah, few tests on my projects seems to work

Maybe you should also include some example to docs on how to achieve v1 behaviour globally :

from pydantic.v1.validators import str_validator


class LaxStrGenerator(SchemaGenerator):
    def str_schema(self):
        return core_schema.no_info_plain_validator_function(str_validator)


BaseModel.model_config['schema_generator'] = LaxStrGenerator

...

adriangb · 2023-07-19T08:57:52Z

Questions to think about:

Will this allow customizing generic containers (lists, dicts, etc)? I think this is something that Add a replace_types option to ConfigDict #6535 can not handle.
Will this allow customizing serialization (json_encoders replacement)? You can return a CoreSchema with serialization, e.g. {**super().str_schema(), 'serialization': ...}.
Can this be used to handle unknown types (np.array)? I think we can easily add an unknown_type() method where you return a CoreSchema or call super().unknown_type() which will eventually error.

adriangb · 2023-07-20T08:15:21Z

After talking with @dmontagu yesterday we came to the conclusion that this approach is promising, but there was a lot of overlap between the new SchemaGenerator and GenerateSchema. The new SchemaGenerator was also too simplistic, e.g. it didn't deal with Annotated or forward references. So I reworked them into a single thing by making GenerateSchema more private, cleaning it up a bit, and adding public overrideable methods like str_schema().

I added methods for some generic types (list_schema and such) and unknown types (a made up NDarray) to prove the idea. @dmontagu suggested adding handling for json_encoders in this PR since it's fundamental that the solution can handle that. I still have to do that but am taking it into account, I think it's doable.

Some open questions:

Are these the right APIs for public methods? I was thinking maybe we should always have (obj, origin, args, annotations) as the thing being passed around. At least that'd allow removing some duplication between match_type and _match_generic_type (I kept the latter private in case we want to tweak this).
Given the above, how does this interact with __prepare_pydantic_annotations__? Is there a world where this replaces it?

adriangb · 2023-07-20T08:15:58Z

@vitalik More feedback from your perspective is appreciated :)

adriangb · 2023-07-20T09:56:59Z

@dmontagu I've added json_encoders support, although I didn't really end up using the rest of this PR for that at all, so maybe it can be it's own PR if you think the implementation looks good.

adriangb · 2023-07-20T14:42:14Z

tests/test_generics.py

-    assert type(m.tuple_field) is tuple
-    assert type(m.long_tuple_field) is tuple
+    assert type(m.tuple_field) is CustomTuple
+    assert type(m.long_tuple_field) is CustomLongTuple


Several bugs / inconsistencies are getting fixed by this PR

adriangb · 2023-07-20T14:42:48Z

please review

lig · 2023-07-21T08:50:36Z

Could you please add Fix #6045 into the PR description?

lig

Looks nice 👍🏻

Also, this opens up some possibilities for testing as schema_generator basically becomes a dependency injection.

dmontagu · 2023-07-21T15:12:15Z

pydantic/_internal/_generate_schema.py

@@ -317,6 +466,8 @@ def _generate_schema_for_type(
            if metadata_schema:
                self._add_js_function(metadata_schema, metadata_js_function)

+        _add_custom_serialization_from_json_encoders(self._json_encoders, obj, schema)


Should this line go above the metadata_js_function stuff immediately prior? It seems if this line is going to modify the serialization, it would make sense to generate the JSON schema after that modification to schema has been performed.

Does it matter? We're editing the CoreSchema here directly while the js func won't get executed until later. Or maybe I'm misunderstanding. Is there a test case we can add?

dmontagu · 2023-07-21T15:14:10Z

pydantic/_internal/_generate_schema.py

+    if json_encoders is None:
+        return
+    # Check the class type and its superclasses for a matching encoder
+    for base in (obj, obj.__class__.__mro__[:-1]):


Is the :-1 here just to prevent object from being a base that gets checked? Any reason not to allow people to have an encoder for object? Lol.

I suppose I just copied that from V1

dmontagu · 2023-07-21T15:15:37Z

pydantic/_internal/_generate_schema.py

+JsonEncoders = Dict[Type[Any], Callable[[Any], Any]]
+
+
+def _add_custom_serialization_from_json_encoders(


Does it make sense for this to be a method of GenerateSchema? I'm okay either way.

Don’t think it matters too much happy to change it

samuelcolvin

Overall I think this looks great. Good work!

I think after this is merged (doesn't need to be today), that GenerateSchema should more to a public module, probably a new module pydantic/generate_schema.py

pydantic/_internal/_generate_schema.py

samuelcolvin · 2023-07-21T14:07:05Z

pydantic/_internal/_generate_schema.py

+    def _arbitrary_types(self) -> bool:
+        return self._config_wrapper.arbitrary_types_allowed
+
+    def literal_schema(self, values: list[Any]) -> CoreSchema:


please add docstrings for all these, ideally linking to the core_schema methods.

pydantic/_internal/_generate_schema.py

samuelcolvin · 2023-07-21T15:08:26Z

pydantic/_internal/_generate_schema.py

+TUPLE_TYPES: list[type] = [tuple, typing.Tuple]
+LIST_TYPES: list[type] = [list, typing.List, collections.abc.MutableSequence]
+SET_TYPES: list[type] = [set, typing.Set, collections.abc.MutableSet]
+FROZEN_SET_TYPES: list[type] = [frozenset, typing.FrozenSet, collections.abc.Set]
+DICT_TYPES: list[type] = [dict, typing.Dict, collections.abc.MutableMapping, collections.abc.Mapping]


can we make these sets?

Looks like we can, but that's not always the case. Types don't promise to be hashable and since it's an O(1) lookup either way I tend to go for a list/tuple for these sorts of things.

Case in point: https://github.com/pydantic/pydantic/actions/runs/5630806077/job/15257133423?pr=6737#step:8:174

Reverting, sorry Samuel.

pydantic/_internal/_generate_schema.py

samuelcolvin · 2023-07-21T15:11:48Z

pydantic/_internal/_generate_schema.py

+        elif obj in DICT_TYPES:
+            return self.dict_schema(*(self._get_args_resolving_forward_refs(obj) or (Any, Any)))
+        elif isinstance(obj, TypeAliasType):
+            return self._type_alias_type_schema(obj)


again should _type_schema be public?

I'm somewhat arbitrarily deciding what to make public or not. I picked the ones that I felt were important, simple or proved that something works (e.g. lists). No reason to make everything public now, let's do it a bit at a time as we understand the use cases.

pydantic/_internal/_generate_schema.py

samuelcolvin · 2023-07-21T15:15:57Z

tests/test_config.py

@@ -633,3 +641,26 @@ class Child(Parent):
        model_config: ConfigDict = {'str_to_lower': True}

    assert Child.model_config == {'extra': 'allow', 'str_to_lower': True}
+
+
+def test_json_encoders_model() -> None:


should we hunt the v1 docs and copy tests that used json_encoders? I feel like we had a lot and they probably covered a lot of weird behaviour.

I copied several of them already.

samuelcolvin · 2023-07-21T15:17:45Z

please update.

dmontagu · 2023-07-21T15:48:01Z

tests/test_types_typeddict.py

+def test_schema_generator() -> None:
+    class LaxStrGenerator(GenerateSchema):
+        def str_schema(self) -> CoreSchema:
+            return core_schema.no_info_plain_validator_function(str)


Not sure if this specifically is intended to be the solution for people who want to override the string validation, but if it is, we should make sure the JSON schema generation works. Right now, you get:

>>> ta.json_schema() pydantic.errors.PydanticInvalidForJsonSchema: Cannot generate a JsonSchema for core_schema.PlainValidatorFunctionSchema ({'type': 'no-info', 'function': <class 'str'>})

That'd be up to them to return a CoreSchema that can generate a JSON schema. I'm not sure there's much we can do.

dmontagu · 2023-07-21T15:52:37Z

pydantic/_internal/_generate_schema.py

@@ -610,50 +791,30 @@ def _generate_schema(self, obj: Any) -> core_schema.CoreSchema:  # noqa: C901

        if _typing_extra.origin_is_union(origin):
            return self._union_schema(obj)
-        elif issubclass(origin, typing.Tuple):  # type: ignore[arg-type]
+        elif origin in TUPLE_TYPES:


now that we are doing so much less issubclass and so much more origin in ..., it might make sense to refactor this into a single dict lookup rather than a bunch of if-elses. Doesn't need to happen in this PR, but I wanted to point it out

dmontagu · 2023-07-21T15:54:27Z

pydantic/_internal/_generate_schema.py

-            return core_schema.any_schema()
+        return self.match_type(obj)
+
+    def match_type(self, obj: Any) -> core_schema.CoreSchema:  # noqa: C901


is this method intended to be public? I know it says we'll evolve this below, but it might make sense to say something to the effect of "if you override this, know that it may be updated in future versions" or similar

dmontagu

Looks good to me. My only concern is making sure that we are clear about what we want to guarantee in terms of breaking changes to GenerateSchema; it feels like there could be bugs in the future where we want to make breaking changes to some of the non-leading-_ methods, not sure if you agree with that but if you do maybe it makes sense to document this more explicitly (even if we don't change the leading-_-ness).

(Also please address Samuel's feedback before merging.)

samuelcolvin · 2023-07-21T17:33:17Z

also mypy tests are failing.

adriangb · 2023-07-22T13:04:00Z

Since we're now passing obj to the methods, maybe we could go back to the issubclass checks in the matching part and then error if it's a subclass in the implementations? That would make it easier for users to implement support for subclasses of supported types (e.g. custom dict, etc.).

We could also arrange this in a hierarchy of methods that ~ reflects the type hierarchy, e.g. match_types -> collection_schema -> mapping_schema -> dict_schema, where it fails along the way e.g. if it's a custom mapping mapping_schema throws an error so you if you want to make it supported you just have to override that one method and super() everything else.

Something like this:

from inspect import isclass
from typing import Any, Collection, Dict, Mapping, get_args, get_origin

from pydantic_core import CoreSchema, core_schema


class GenerateSchema:
    def match_type(self, tp: Any) -> CoreSchema:
        if isclass(tp):
            if issubclass(tp, Collection):
                return self.collection_schema(tp, tp)

        origin = get_origin(tp)
        if isclass(origin):
            if issubclass(origin, Collection):
                return self.collection_schema(tp, origin)
        # handle non classes like ForwardRef
        raise NotImplementedError

    def collection_schema(self, tp: Any, origin: Any) -> CoreSchema:
        if issubclass(tp, Mapping):
            return self.mapping_schema(tp, origin)
        raise NotImplementedError('Subclass to support other collection types')

    def mapping_schema(self, tp: Any, origin: Any) -> CoreSchema:
        if origin is not None:
            args = get_args(tp)
            assert len(args) == 2, 'Expected Mapping to have two generic arguments'
            key_type, value_type = args
        else:
            key_type = value_type = Any
        if issubclass(tp, Dict):
            return self.dict_schema(tp, origin, key_type, value_type)
        raise NotImplementedError('Subclass to support other mapping types')

    def dict_schema(self, tp: Any, origin: Any, key_type: Any, value_type: Any) -> CoreSchema:
        if origin is not dict:
            raise NotImplementedError('Subclass to support custom dict subclasses')
        return core_schema.dict_schema(self.match_type(key_type), self.match_type(value_type))

adriangb · 2023-07-25T12:06:46Z

After some internal discussion, we came to the conclusion that exposing GenerateSchema publicly is the right thing to do, but we need to do some more refactoring work internally before we can really commit to the APIs. So we're going to just make string_schema() public for now and keep the rest of the methods private.

adriangb · 2023-07-25T12:32:40Z

please review

adriangb force-pushed the customize-str-schema branch from a36fbaa to b045d64 Compare July 20, 2023 10:09

adriangb commented Jul 20, 2023

View reviewed changes

adriangb marked this pull request as ready for review July 20, 2023 14:42

pydantic-hooky bot added the ready for review label Jul 20, 2023

pydantic-hooky bot assigned lig Jul 20, 2023

lig mentioned this pull request Jul 21, 2023

🚧 Move type replacing into GenerateSchema #6790

Closed

5 tasks

lig assigned samuelcolvin Jul 21, 2023

lig requested a review from samuelcolvin July 21, 2023 08:50

lig approved these changes Jul 21, 2023

View reviewed changes

Kludex mentioned this pull request Jul 21, 2023

Appropriate replacement for json_encoders #6726

Closed

1 task

dmontagu reviewed Jul 21, 2023

View reviewed changes

samuelcolvin reviewed Jul 21, 2023

View reviewed changes

pydantic-hooky bot added awaiting author revision and removed ready for review labels Jul 21, 2023

pydantic-hooky bot assigned adriangb and unassigned lig and samuelcolvin Jul 21, 2023

dmontagu reviewed Jul 21, 2023

View reviewed changes

dmontagu approved these changes Jul 21, 2023

View reviewed changes

adriangb mentioned this pull request Jul 22, 2023

Method to annotate __get_pydantic_core_schema__ on arbitrary classes/types #6801

Closed

13 tasks

adriangb force-pushed the customize-str-schema branch from eb4f6e3 to c3e2b75 Compare July 22, 2023 13:01

adriangb force-pushed the customize-str-schema branch from eea980c to 63c7ff5 Compare July 25, 2023 11:59

adriangb force-pushed the customize-str-schema branch 2 times, most recently from 2fc538e to 4da8b58 Compare July 25, 2023 12:16

pydantic-hooky bot added ready for review and removed awaiting author revision labels Jul 25, 2023

pydantic-hooky bot assigned lig and unassigned adriangb Jul 25, 2023

samuelcolvin changed the title ~~Allow customizing core schema generation~~ Allow customizing core schema generation by making GenerateSchema public Jul 25, 2023

adriangb added 2 commits July 25, 2023 15:57

Allow customizing core schema generation for strings

f61ba23

Add docstring

b76dc2b

samuelcolvin force-pushed the customize-str-schema branch from a4a7c7d to b76dc2b Compare July 25, 2023 14:58

samuelcolvin enabled auto-merge (squash) July 25, 2023 14:58

pre-commit 🤦

6daf1c8

samuelcolvin merged commit f8c081e into main Jul 25, 2023
46 checks passed

samuelcolvin deleted the customize-str-schema branch July 25, 2023 15:16

samuelcolvin mentioned this pull request Aug 9, 2023

Add a new "base64url" option for ser_json_bytes #7000

Closed

13 tasks

		JsonEncoders = Dict[Type[Any], Callable[[Any], Any]]


		def _add_custom_serialization_from_json_encoders(

Allow customizing core schema generation by making GenerateSchema public #6737

Allow customizing core schema generation by making GenerateSchema public #6737

Conversation

adriangb commented Jul 18, 2023 • edited by Kludex

cloudflare-pages bot commented Jul 18, 2023 • edited

Deploying with Cloudflare Pages

adriangb commented Jul 18, 2023

vitalik commented Jul 18, 2023

adriangb commented Jul 19, 2023 • edited

adriangb commented Jul 20, 2023 • edited

adriangb commented Jul 20, 2023

adriangb commented Jul 20, 2023 • edited

Choose a reason for hiding this comment

adriangb commented Jul 20, 2023

lig commented Jul 21, 2023

lig left a comment

Choose a reason for hiding this comment

dmontagu Jul 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelcolvin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelcolvin commented Jul 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmontagu left a comment • edited

Choose a reason for hiding this comment

samuelcolvin commented Jul 21, 2023

adriangb commented Jul 22, 2023 • edited

adriangb commented Jul 25, 2023

adriangb commented Jul 25, 2023

Allow customizing core schema generation by making `GenerateSchema` public #6737

Allow customizing core schema generation by making `GenerateSchema` public #6737

adriangb commented Jul 18, 2023 •

edited by Kludex

cloudflare-pages bot commented Jul 18, 2023 •

edited

adriangb commented Jul 19, 2023 •

edited

adriangb commented Jul 20, 2023 •

edited

adriangb commented Jul 20, 2023 •

edited

dmontagu Jul 21, 2023 •

edited

dmontagu left a comment •

edited

adriangb commented Jul 22, 2023 •

edited