Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow customizing core schema generation by making GenerateSchema public #6737

Merged
merged 3 commits into from Jul 25, 2023

Conversation

adriangb
Copy link
Member

@adriangb adriangb commented Jul 18, 2023

The idea here is to make a more constrained version of schema generation customization that we can expand over time.

Fix #6045

EDIT by @Kludex: Closes #6726

Selected Reviewer: @lig

@cloudflare-pages
Copy link

cloudflare-pages bot commented Jul 18, 2023

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 6daf1c8
Status: ✅  Deploy successful!
Preview URL: https://c13c4805.pydantic-docs2.pages.dev
Branch Preview URL: https://customize-str-schema.pydantic-docs2.pages.dev

View logs

@adriangb
Copy link
Member Author

@vitalik I'm curious what you think of this approach. It is more restrictive than arbitrary type replacement but also pretty powerful and easy to understand.

@vitalik
Copy link

vitalik commented Jul 18, 2023

Hi @adriangb thank you for looking into this

yeah, few tests on my projects seems to work

Maybe you should also include some example to docs on how to achieve v1 behaviour globally :

from pydantic.v1.validators import str_validator


class LaxStrGenerator(SchemaGenerator):
    def str_schema(self):
        return core_schema.no_info_plain_validator_function(str_validator)


BaseModel.model_config['schema_generator'] = LaxStrGenerator

...

@adriangb
Copy link
Member Author

adriangb commented Jul 19, 2023

Questions to think about:

  • Will this allow customizing generic containers (lists, dicts, etc)? I think this is something that Add a replace_types option to ConfigDict #6535 can not handle.
  • Will this allow customizing serialization (json_encoders replacement)? You can return a CoreSchema with serialization, e.g. {**super().str_schema(), 'serialization': ...}.
  • Can this be used to handle unknown types (np.array)? I think we can easily add an unknown_type() method where you return a CoreSchema or call super().unknown_type() which will eventually error.

@adriangb
Copy link
Member Author

adriangb commented Jul 20, 2023

After talking with @dmontagu yesterday we came to the conclusion that this approach is promising, but there was a lot of overlap between the new SchemaGenerator and GenerateSchema. The new SchemaGenerator was also too simplistic, e.g. it didn't deal with Annotated or forward references. So I reworked them into a single thing by making GenerateSchema more private, cleaning it up a bit, and adding public overrideable methods like str_schema().

I added methods for some generic types (list_schema and such) and unknown types (a made up NDarray) to prove the idea. @dmontagu suggested adding handling for json_encoders in this PR since it's fundamental that the solution can handle that. I still have to do that but am taking it into account, I think it's doable.

Some open questions:

  1. Are these the right APIs for public methods? I was thinking maybe we should always have (obj, origin, args, annotations) as the thing being passed around. At least that'd allow removing some duplication between match_type and _match_generic_type (I kept the latter private in case we want to tweak this).
  2. Given the above, how does this interact with __prepare_pydantic_annotations__? Is there a world where this replaces it?

@adriangb
Copy link
Member Author

@vitalik More feedback from your perspective is appreciated :)

@adriangb
Copy link
Member Author

adriangb commented Jul 20, 2023

@dmontagu I've added json_encoders support, although I didn't really end up using the rest of this PR for that at all, so maybe it can be it's own PR if you think the implementation looks good.

assert type(m.tuple_field) is tuple
assert type(m.long_tuple_field) is tuple
assert type(m.tuple_field) is CustomTuple
assert type(m.long_tuple_field) is CustomLongTuple
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several bugs / inconsistencies are getting fixed by this PR

@adriangb adriangb marked this pull request as ready for review July 20, 2023 14:42
@adriangb
Copy link
Member Author

please review

@lig
Copy link
Contributor

lig commented Jul 21, 2023

Could you please add Fix #6045 into the PR description?

Copy link
Contributor

@lig lig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice 👍🏻

Also, this opens up some possibilities for testing as schema_generator basically becomes a dependency injection.

@@ -317,6 +466,8 @@ def _generate_schema_for_type(
if metadata_schema:
self._add_js_function(metadata_schema, metadata_js_function)

_add_custom_serialization_from_json_encoders(self._json_encoders, obj, schema)
Copy link
Contributor

@dmontagu dmontagu Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this line go above the metadata_js_function stuff immediately prior? It seems if this line is going to modify the serialization, it would make sense to generate the JSON schema after that modification to schema has been performed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter? We're editing the CoreSchema here directly while the js func won't get executed until later. Or maybe I'm misunderstanding. Is there a test case we can add?

if json_encoders is None:
return
# Check the class type and its superclasses for a matching encoder
for base in (obj, obj.__class__.__mro__[:-1]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the :-1 here just to prevent object from being a base that gets checked? Any reason not to allow people to have an encoder for object? Lol.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I just copied that from V1

JsonEncoders = Dict[Type[Any], Callable[[Any], Any]]


def _add_custom_serialization_from_json_encoders(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for this to be a method of GenerateSchema? I'm okay either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t think it matters too much happy to change it

Copy link
Member

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks great. Good work!

I think after this is merged (doesn't need to be today), that GenerateSchema should more to a public module, probably a new module pydantic/generate_schema.py

pydantic/_internal/_generate_schema.py Outdated Show resolved Hide resolved
def _arbitrary_types(self) -> bool:
return self._config_wrapper.arbitrary_types_allowed

def literal_schema(self, values: list[Any]) -> CoreSchema:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add docstrings for all these, ideally linking to the core_schema methods.

pydantic/_internal/_generate_schema.py Outdated Show resolved Hide resolved
Comment on lines +100 to +107
TUPLE_TYPES: list[type] = [tuple, typing.Tuple]
LIST_TYPES: list[type] = [list, typing.List, collections.abc.MutableSequence]
SET_TYPES: list[type] = [set, typing.Set, collections.abc.MutableSet]
FROZEN_SET_TYPES: list[type] = [frozenset, typing.FrozenSet, collections.abc.Set]
DICT_TYPES: list[type] = [dict, typing.Dict, collections.abc.MutableMapping, collections.abc.Mapping]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make these sets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can, but that's not always the case. Types don't promise to be hashable and since it's an O(1) lookup either way I tend to go for a list/tuple for these sorts of things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pydantic/_internal/_generate_schema.py Outdated Show resolved Hide resolved
elif obj in DICT_TYPES:
return self.dict_schema(*(self._get_args_resolving_forward_refs(obj) or (Any, Any)))
elif isinstance(obj, TypeAliasType):
return self._type_alias_type_schema(obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again should _type_schema be public?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat arbitrarily deciding what to make public or not. I picked the ones that I felt were important, simple or proved that something works (e.g. lists). No reason to make everything public now, let's do it a bit at a time as we understand the use cases.

pydantic/_internal/_generate_schema.py Outdated Show resolved Hide resolved
@@ -633,3 +641,26 @@ class Child(Parent):
model_config: ConfigDict = {'str_to_lower': True}

assert Child.model_config == {'extra': 'allow', 'str_to_lower': True}


def test_json_encoders_model() -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we hunt the v1 docs and copy tests that used json_encoders? I feel like we had a lot and they probably covered a lot of weird behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied several of them already.

@samuelcolvin
Copy link
Member

please update.

def test_schema_generator() -> None:
class LaxStrGenerator(GenerateSchema):
def str_schema(self) -> CoreSchema:
return core_schema.no_info_plain_validator_function(str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this specifically is intended to be the solution for people who want to override the string validation, but if it is, we should make sure the JSON schema generation works. Right now, you get:

>>> ta.json_schema()
pydantic.errors.PydanticInvalidForJsonSchema: Cannot generate a JsonSchema for core_schema.PlainValidatorFunctionSchema ({'type': 'no-info', 'function': <class 'str'>})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be up to them to return a CoreSchema that can generate a JSON schema. I'm not sure there's much we can do.

@@ -610,50 +791,30 @@ def _generate_schema(self, obj: Any) -> core_schema.CoreSchema: # noqa: C901

if _typing_extra.origin_is_union(origin):
return self._union_schema(obj)
elif issubclass(origin, typing.Tuple): # type: ignore[arg-type]
elif origin in TUPLE_TYPES:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that we are doing so much less issubclass and so much more origin in ..., it might make sense to refactor this into a single dict lookup rather than a bunch of if-elses. Doesn't need to happen in this PR, but I wanted to point it out

return core_schema.any_schema()
return self.match_type(obj)

def match_type(self, obj: Any) -> core_schema.CoreSchema: # noqa: C901
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this method intended to be public? I know it says we'll evolve this below, but it might make sense to say something to the effect of "if you override this, know that it may be updated in future versions" or similar

Copy link
Contributor

@dmontagu dmontagu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. My only concern is making sure that we are clear about what we want to guarantee in terms of breaking changes to GenerateSchema; it feels like there could be bugs in the future where we want to make breaking changes to some of the non-leading-_ methods, not sure if you agree with that but if you do maybe it makes sense to document this more explicitly (even if we don't change the leading-_-ness).

(Also please address Samuel's feedback before merging.)

@samuelcolvin
Copy link
Member

also mypy tests are failing.

@adriangb
Copy link
Member Author

adriangb commented Jul 22, 2023

Since we're now passing obj to the methods, maybe we could go back to the issubclass checks in the matching part and then error if it's a subclass in the implementations? That would make it easier for users to implement support for subclasses of supported types (e.g. custom dict, etc.).

We could also arrange this in a hierarchy of methods that ~ reflects the type hierarchy, e.g. match_types -> collection_schema -> mapping_schema -> dict_schema, where it fails along the way e.g. if it's a custom mapping mapping_schema throws an error so you if you want to make it supported you just have to override that one method and super() everything else.

Something like this:

from inspect import isclass
from typing import Any, Collection, Dict, Mapping, get_args, get_origin

from pydantic_core import CoreSchema, core_schema


class GenerateSchema:
    def match_type(self, tp: Any) -> CoreSchema:
        if isclass(tp):
            if issubclass(tp, Collection):
                return self.collection_schema(tp, tp)

        origin = get_origin(tp)
        if isclass(origin):
            if issubclass(origin, Collection):
                return self.collection_schema(tp, origin)
        # handle non classes like ForwardRef
        raise NotImplementedError

    def collection_schema(self, tp: Any, origin: Any) -> CoreSchema:
        if issubclass(tp, Mapping):
            return self.mapping_schema(tp, origin)
        raise NotImplementedError('Subclass to support other collection types')

    def mapping_schema(self, tp: Any, origin: Any) -> CoreSchema:
        if origin is not None:
            args = get_args(tp)
            assert len(args) == 2, 'Expected Mapping to have two generic arguments'
            key_type, value_type = args
        else:
            key_type = value_type = Any
        if issubclass(tp, Dict):
            return self.dict_schema(tp, origin, key_type, value_type)
        raise NotImplementedError('Subclass to support other mapping types')

    def dict_schema(self, tp: Any, origin: Any, key_type: Any, value_type: Any) -> CoreSchema:
        if origin is not dict:
            raise NotImplementedError('Subclass to support custom dict subclasses')
        return core_schema.dict_schema(self.match_type(key_type), self.match_type(value_type))

@adriangb
Copy link
Member Author

After some internal discussion, we came to the conclusion that exposing GenerateSchema publicly is the right thing to do, but we need to do some more refactoring work internally before we can really commit to the APIs. So we're going to just make string_schema() public for now and keep the rest of the methods private.

@adriangb adriangb force-pushed the customize-str-schema branch 2 times, most recently from 2fc538e to 4da8b58 Compare July 25, 2023 12:16
@adriangb
Copy link
Member Author

please review

@pydantic-hooky pydantic-hooky bot assigned lig and unassigned adriangb Jul 25, 2023
@samuelcolvin samuelcolvin changed the title Allow customizing core schema generation Allow customizing core schema generation by making GenerateSchema public Jul 25, 2023
@samuelcolvin samuelcolvin merged commit f8c081e into main Jul 25, 2023
46 checks passed
@samuelcolvin samuelcolvin deleted the customize-str-schema branch July 25, 2023 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Appropriate replacement for json_encoders V2 - Validating int->str
5 participants