Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to using __dict__ instead of __values__ (up to 14x speedup, lower model size) #711

Closed
Bobronium opened this issue Aug 3, 2019 · 4 comments

Comments

@Bobronium
Copy link
Contributor

Bobronium commented Aug 3, 2019

Question | Feature Request

Recently I've been researching how __dict__ and __slots__ works and discovered that if we inherit from class with __slots__ and don't declare __slots__ in child class, __dict__ will be created automatically.

Also I found out that object.__dict__ uses less memory than ordinal dict, as it shares keys between instances (PEP-0412)

Moreover I found an advice to use __dict__ and __slots__ together in cases when fast access to attributes declared in __slots__ needed.

So combining these researches I replaced
__slots__ = ('__values__', '__fields_set__')
with
__slots__ = ('__dict__', '__fields_set__')
and renamed usages of __values__ in BaseModel (yes, that's exactly the only thing I've made) also removed __getattr__, as it's become redundant.

To see up-to-date changes, visit #712

This lowered model size a bit, but what is most interesting, increased attribute access up to 14 times (average speed up is 10x)!

All tests are passed without any changes, so as far as I know this won't break any existing code except for one that uses __values__ attribute directly. But I guess that can be solved by adding a property:

@property
def __values__(self):
    # throw deprecation warning here
    return self.__dict__

Here's a little test showing a difference:

import datetime
import functools
from sys import getsizeof
from timeit import timeit
from types import ModuleType, FunctionType
from typing import Optional, List

from gc import get_referents

from pydantic import BaseModel


class Article(BaseModel):
    id: int
    label: str
    text: Optional[str]


class User(BaseModel):
    id: int
    login: str
    email: str
    password: str
    first_name: str
    last_name: Optional[str] = None
    articles: Optional[List[Article]] = None

    register_date: datetime.datetime
    last_login: Optional[datetime.datetime] = None


user = User(
    id=42,
    login="mylogin",
    email="email@example.com",
    password="hashed secret password",
    first_name="Some",
    last_name="User",
    articles=[Article(id=1, label="foo"), Article(id=2, label="bar")],
    register_date=datetime.datetime.now(),
    last_login=datetime.datetime.now()
)

globs = {'user': user}
tmt = functools.partial(timeit, globals=globs, number=1000000)


BLACKLIST = type, ModuleType, FunctionType


def get_size(obj):
    """
    sum size of object & members.
    https://stackoverflow.com/a/30316760
    """
    seen_ids = set()
    size = 0
    objects = [obj]
    while objects:
        need_referents = []
        for obj in objects:
            if not isinstance(obj, BLACKLIST) and id(obj) not in seen_ids:
                seen_ids.add(id(obj))
                size += getsizeof(obj)
                need_referents.append(obj)
        objects = get_referents(*need_referents)
    return size


print(f'''
has __dict__: {hasattr(user, '__dict__')}
uses __dict__: {not hasattr(user, '__values__')}
has __values__: {hasattr(user, '__values__')}

getsize(user): {getsizeof(user)} bytes
get_size(user): {get_size(user)} bytes
''')

print(f'''\
getattr:
    user.id: {tmt('user.id')}
    user.articles: {tmt('user.articles')}
    user.articles[0].label: {tmt('user.articles[0].label')}
''')

print('getsizeof:')

try:
    print(f'''\
    user.__dict__: {getsizeof(user.__dict__)} bytes
    user.articles[0].__dict__: {getsizeof(user.articles[0].__dict__)} bytes
    user.articles[1].__dict__: {getsizeof(user.articles[1].__dict__)} bytes
''')
except AttributeError:
    print('   __dict__: skipped')

try:
    print(f'''\
    user.__values__: {getsizeof(user.__values__)} bytes
    user.articles[0].__values__: {getsizeof(user.articles[0].__values__)} bytes
    user.articles[1].__values__: {getsizeof(user.articles[1].__values__)} bytes
''')
except AttributeError:
    print('    __values__: skipped')
...

Results with __values__ used in BaseModel:

has __dict__: True
uses __dict__: False
has __values__: True

getsize(user): 72 bytes
get_size(user): 3620 bytes

getattr:
    user.id: 0.5301102200000969
    user.articles: 0.4485138009995353
    user.articles[0].label: 0.8063585330000933

getsizeof:
    user.__dict__: 112 bytes
    user.articles[0].__dict__: 112 bytes
    user.articles[1].__dict__: 112 bytes

    user.__values__: 368 bytes
    user.articles[0].__values__: 240 bytes
    user.articles[1].__values__: 240 bytes

Result with __dict__ used instead of __values__ (upd. results without getattr):

has __dict__: True
uses __dict__: True
has __values__: False

getsize(user): 64 bytes
get_size(user): 3484 bytes

getattr:
    user.id: 0.03707944100096938
    user.articles: 0.03184107699962624
    user.articles[0].label: 0.06509907500003465

getsizeof:
    user.__dict__: 368 bytes
    user.articles[0].__dict__: 240 bytes
    user.articles[1].__dict__: 240 bytes

    __values__: skipped
@Bobronium Bobronium changed the title Switch to using __dict__ instead of __values__ (up to 10x speedup, lower model size) Switch to using __dict__ instead of __values__ (up to 14x speedup, lower model size) Aug 3, 2019
@dmontagu
Copy link
Contributor

dmontagu commented Aug 3, 2019

This is very interesting! I tried running the benchmark, and it looks like there is no change, but I think the benchmark only relates to parsing speed (basically no Model.__getattr__s involved).

I added the following code to the benchmark to see what it looked like in a slightly more realistic usage scenario:

                    passed, result = test.validate(case)
                    # New here:
                    if passed:
                        for _ in range(200):
                            if result.location is not None:
                                getattr(result.location, "latitude")

If I remove the contribution of the parsing part, I got an approximately 10x speedup on your branch relative to master, whether cythonized or not. As far as I can tell, there was no regression in parsing speed.

However, if I replaced the above getattr with just result.dict() or result.json(), I got no speed up. I'm not familiar enough with the internals to clearly understand why this is the case.

At any rate, given how common an operation attribute access is in practice, this seems like an awesome improvement!


There is another big benefit to this (at least for us PyCharm users 😅): by removing the custom __getattr__, PyCharm now raises warnings when you try to access attributes that don't exist! This is currently handled properly by mypy, but PyCharm doesn't respect the @no_type_check decorator for this purpose. I raised this issue with PyCharm a few months ago and got this response:

Reproduced with mypy but it is not clear why __getattr__ should be ignored. PEP 484 contains the following description of no_type_check decorator:

Functions with the @no_type_check decorator should be treated as having no annotations.

In my opinion, this does not mean that decorated functions should be ignored entirely including their existence.

So it sounds like they have no intention of changing the behavior. Just the fact that this change fixes the lack of getattr warnings in PyCharm alone would be enough to get me excited about this change.

Overall, 👍👍

@Bobronium
Copy link
Contributor Author

Bobronium commented Aug 3, 2019

I noticed that BaseModel.dict speed didn't change too.

I guess that is because we acessing __values__ once or twice per one model and getting it directly with object.__getattribute__, as it stored in __slots__).

And yeah, glad to hear it solves one of PyCharm issues 😄.

@samuelcolvin
Copy link
Member

This looks awesome. Thank you very much.

I'll review the code in the PR but reply to the principle here.

I don't think we need to add the __values__ property, that's something internal which I doubt anyone is used, just clear description in HISTORY should be enough.

@samuelcolvin
Copy link
Member

fixed by #712

alexdrydew pushed a commit to alexdrydew/pydantic that referenced this issue Dec 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants