Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2 Generic Improvements #5351

Closed
2 of 3 tasks
samuelcolvin opened this issue Apr 3, 2023 · 9 comments
Closed
2 of 3 tasks

V2 Generic Improvements #5351

samuelcolvin opened this issue Apr 3, 2023 · 9 comments
Assignees

Comments

@samuelcolvin
Copy link
Member

samuelcolvin commented Apr 3, 2023

  • generic dataclasses
  • a TypedDict to gather these
  • as @dmontagu mentioned, generic tests looks slow - is this indicative of generics being slow? Can we sort it by caching core-schema if there are no generic arguments?
@dmontagu
Copy link
Contributor

dmontagu commented Apr 4, 2023

Generic tests benchmarking:

This script was written based on tests.test_generics.test_generics_work_with_many_parametrized_base_models:

import time
from typing import TypeVar, Generic
from pydantic import BaseModel


def run(n_models: int):
    t0 = time.time()

    T = TypeVar('T')
    C = TypeVar('C')

    class A(BaseModel, Generic[T, C]):
        x: T
        y: C

    class B(A[int, C], BaseModel, Generic[C]):
        pass

    models = []
    for i in range(n_models):
        class M(BaseModel):
            pass

        M.__name__ = f'M{i}'
        models.append(M)

    generics = []
    for m in models:
        Working = B[m]
        generics.append(Working)

    t1 = time.time()
    print(f"{t1 - t0:.2f} seconds elapsed")


run(3000)

On my computer, on commit af73124 (immediately before the schema building refactor), the runtime is ~1.38 seconds.

On the current main branch, I get ~2.40 seconds.

(I will do the same check for validation time shortly, though I suspect the impact is smaller.)

@dmontagu
Copy link
Contributor

dmontagu commented Apr 4, 2023

Benchmarking script for validating with generic models:

import time
from pathlib import Path
from typing import Generic, TypeVar, Union

from pydantic import BaseModel

T = TypeVar('T')

# Based on tests.test_generics.test_generic_recursive_models

class GenericModel1(BaseModel, Generic[T]):
    ref: 'GenericModel2[T]'

    model_config = dict(undefined_types_warning=False)


class GenericModel2(BaseModel, Generic[T]):
    ref: Union[T, GenericModel1[T]]

    model_config = dict(undefined_types_warning=False)


GenericModel1.model_rebuild()
GenericModelStr = GenericModel1[str]

class RecursiveModelStr1(BaseModel):
    ref: 'RecursiveModelStr2'

    model_config = dict(undefined_types_warning=False)


class RecursiveModelStr2(BaseModel):
    ref: Union[str, RecursiveModelStr1]

    model_config = dict(undefined_types_warning=False)

RecursiveModelStr1.model_rebuild()
RecursiveModelStr = RecursiveModelStr1

def run1(n_models: int):
    t0 = time.time()
    for _ in range(n_models):
        # RecursiveModelStr.model_validate(dict(ref=dict(ref=dict(ref=dict(ref='123')))))
        # GenericModelStr.model_validate(dict(ref=dict(ref=dict(ref=dict(ref='123')))))
        GenericModel1[str].model_validate(dict(ref=dict(ref=dict(ref=dict(ref='123')))))
    t1 = time.time()
    print(f"{t1 - t0:.2f} seconds elapsed")


run1(1000000)

# ~1.89s if you use a non-generic recursive model
# ~1.90s if you use a generic model and cache (i.e., don't re-call __class_getitem__)
# ~2.70s if you use a generic model and don't cache (i.e., re-call __class_getitem__)

# ~1.88s on pre-refactor (commit af73124) if you cache generic
# 3.25s on pre-refactor (commit af73124) if you don't cache generic

Good news is that it seems to be highly performant. The __class_getitem__ has also been sped up after the refactor, though it is still pretty slow relative to validation (yay pydantic-core performance). I wonder if there's more low-hanging fruit there, though obviously would advise people not be calling __class_getitem__ all the time.

@dmontagu
Copy link
Contributor

dmontagu commented Apr 4, 2023

I'll note that we can get an ~10% speedup in the previous test (where you don't "cache" the parametrized generic type) if we change pydantic._internal._generics.get_cached_generic_type_early to not compute the early_cache_key every time, and have a code path that caches for the exact types passed. This seems pretty significant given how much else is happening besides __class_getitem__, though admittedly it would require you to be using generics in a pretty hot path for it to have a macroscopic impact.

Putting an @lru_cache on the __class_getitem__ (requires some minor tweaks to get it to work, namely, making a module-level function that gets the decorator and is immediately called by __class_getitem__) gets another 10% on top of that, bringing the runtime down to ~2.1s vs. ~1.9s when using the cached type (and not even calling __class_getitem__ at all). Ideally we'd be able to get closer to the @lru_cache runtime, I'll see if there's a way to do this with weak keys.


Edit: after more investigation, I don't think there's going to be an "easy" way to create a WeakKey version of @lru_cache. It does seem there is room for performance improvements here, but I think it's probably not worth worrying about given you can always cache the created type, and the cache lookups are still pretty fast.

I will note that we can get a ~2% speedup on __class_getitem__ by creating aliases for _GENERIC_TYPE_CACHES.get etc. (reducing the number of getattr calls I guess). But I'm not sure the 2% speed improvement is worth it anyway considering how unlikely this is to ever be a hot path, and the "better" solution of caching the type avoids it anyway.

@dmontagu
Copy link
Contributor

dmontagu commented Apr 4, 2023

I found a way to speed things up to only ~10-15% slower than the pre-refactor for building generic models, and all tests pass, but I'm a bit suspicious. @samuelcolvin I think you'll probably know whether the change is actually safe, or maybe we can create a test that fails with the change.

I think if it's a safe change though, I'd be happy to consider the performance regression "resolved". There may well still be room for further improvements but I didn't see anything super obvious digging into flamegraphs etc; whereas the regression was obvious (double the number of calls of model_rebuild).

@dmontagu dmontagu changed the title V2 Generic Improovements V2 Generic Improvements Apr 6, 2023
@caniko
Copy link
Contributor

caniko commented Apr 24, 2023

I am getting a feeling that you are providing minimal support for generics. Why is that? Generics are such an important part of OOP, and my best OOP experience in Python is with pydantic. This is such a shame...

My feeling comes from this in the V2 alpha post:
While it does not raise an error at runtime yet, subclass checks for parametrized generics should no longer be used. These will result in TypeErrors and we can't promise they will work forever. However, it will be okay to do subclass checks against non-parametrized generic models

Why? Any chances for you to change your mind on this topic? To me it is a no-brainer to just write a test and make sure this never breaks. Could you at least provide the reason for making this decision? Clearly many use generics with pydantic; this seems like an odd feature to exclude.

Disclaimer
Some of my projects that use Pydantic use this functionality. This is detrimental to my biggest pydantic package that should be released next year where I do in fact check if certain models that are parametrized are subclasses of another. The use of generics in these models improve the developer and UI implementation experience by several orders of magnitude.

I hope you will thoroughly reconsider, and thank you nonetheless.


Any news on variadic generics?

@dmontagu
Copy link
Contributor

dmontagu commented Apr 24, 2023

Hi @caniko,

I am getting a feeling that you are providing minimal support for generics.

I am disappointed to hear you feel that way, I agree generics are important and have put in a lot of effort to try to improve the experience of using them in v2. My goal is that they feel more reliable/robust in v2, and are even more encouraged for use than they were in v1.

While it does not raise an error at runtime yet, subclass checks for parametrized generics should no longer be used. These will result in TypeErrors and we can't promise they will work forever. However, it will be okay to do subclass checks against non-parametrized generic models

Why? Any chances for you to change your mind on this topic? To me it is a no-brainer to just write a test and make sure this never breaks.

There are a handful of problems with this:

  • Right now, we return a concrete subclass from __class_getitem__. This isn't the behavior of standard typing generics, and causes problems with some interactions with standard typing generics. For example, this currently isn't working: List[GenericPydanticModel[T]][int]. More generally, we have copies of the typing module functions in many places with minor logic changes to handle pydantic types specially, for maintainability reasons I would really like to eliminate this (for it to work perfectly with typing it will require some changes to typing, but the typing module maintainers have indicated a willingness to make the changes we need).

  • Currently in v2, I am careful to keep the class hierarchy from growing when re-parametrizing models. As a result, reparametrizing doesn't preserve subclass relationships. So:

    from typing import Generic, TypeVar
    
    from pydantic import BaseModel
    
    T = TypeVar("T")
    S = TypeVar("S")
    
    class MyGeneric(BaseModel, Generic[T, S]):
        x: T
        y: S
    
    
    MyPartialGeneric = MyGeneric[int, S]
    MyConcreteGeneric = MyPartialGeneric[str]
    print(issubclass(MyConcreteGeneric, MyPartialGeneric))
    #> False

    Also, IDEs, etc. complain about trying to do this. Really, I don't want to be committed to a specific behavior here because I think there will be ways to improve the ergonomics and consistency with standard typing module by making changes that would break this behavior.

  • Checking subclass relationships for concrete generic models is fairly difficult to do, especially in a way that is consistent with type-checkers. In particular, consider the following:

    class CovariantModel(BaseModel, Generic[T_co]):
        x: T_co
    
    class InvariantModel(BaseModel, Generic[T]):
        x: list[T]
    
    class ContravariantModel(BaseModel, Generic[T_contra]):
        x: Callable[[T_contra], None]
    
    class MultiTypevarModel(BaseModel, Generic[T, S, R]):
        x: T
        y: list[S]
        z: Callable[[R], None]

    What should be the result of the following calls?

    • issubclass(CovariantModel[Any], CovariantModel[int])
    • issubclass(CovariantModel[int], CovariantModel[Any])
    • issubclass(CovariantModel[int], CovariantModel[float])
    • issubclass(CovariantModel[datetime.datetime], CovariantModel[datetime.date])
    • Each of the above for InvariantModel, and ContravariantModel
    • issubclass(MultiTypevarModel[int, S, R], MultiTypevarModel[T, int, R])
    • issubclass(MultiTypevarModel[int, Any, Any], MultiTypevarModel[Any, str, Any])
    • (various other permutations of the MultiTypevarModel)

    The typing module has largely eluded the need to make a decision on this by just disallowing checks for subclass relationships between different parametrized models; I think if we tried to make these subclass checks, there is no specific choice of behavior we could make that wouldn't have reasonable arguments against it.

To be clear, we intend to always support issubclass(MyGenericModel[int, str], MyGenericModel), and that will always work.

If you need to check subclass relationships between two concrete generics, I would suggest checking if their __origin__s are subclasses, and if so, you can implement whatever logic you like for comparing their __args__. I think right now this data lives on the __pydantic_generic_metadata__ attribute, but I think we may change this before the official v2 release to be more consistent with the standard typing module.

@dmontagu
Copy link
Contributor

dmontagu commented Apr 24, 2023

Also, re: variadic generics, it would be helpful if you share a unit test (ideally representative of real-world usage) that you'd like to pass, that will make it easier to work on. Might make sense to make a different issue for it and/or open a PR with just the failing unit test.

@caniko
Copy link
Contributor

caniko commented Apr 25, 2023

Thank you @dmontagu, this was very informative.

To be clear, we intend to always support issubclass(MyGenericModel[int, str], MyGenericModel), and that will always work.

I can relax now.

Might make sense to make a different issue for it and/or open a PR with just the failing unit test.

I can make a PR for it when I have downtime.

@dmontagu
Copy link
Contributor

I'm going to close this issue and open new issues for handling:

  • variadic generics
  • generic dataclasses

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants