Switch to using dict instead of values (up to 14x speedup, lower model size) #711

Bobronium · 2019-08-03T14:09:38Z

Question | Feature Request

Recently I've been researching how __dict__ and __slots__ works and discovered that if we inherit from class with __slots__ and don't declare __slots__ in child class, __dict__ will be created automatically.

Also I found out that object.__dict__ uses less memory than ordinal dict, as it shares keys between instances (PEP-0412)

Moreover I found an advice to use __dict__ and __slots__ together in cases when fast access to attributes declared in __slots__ needed.

So combining these researches I replaced
__slots__ = ('__values__', '__fields_set__')
with
__slots__ = ('__dict__', '__fields_set__')
and renamed usages of __values__ in BaseModel ~~(yes, that's exactly the only thing I've made)~~ also removed __getattr__, as it's become redundant.

To see up-to-date changes, visit #712

This lowered model size a bit, but what is most interesting, increased attribute access up to 14 times (average speed up is 10x)!

All tests are passed without any changes, so as far as I know this won't break any existing code except for one that uses __values__ attribute directly. But I guess that can be solved by adding a property:

@property
def __values__(self):
    # throw deprecation warning here
    return self.__dict__

Here's a little test showing a difference:

import datetime
import functools
from sys import getsizeof
from timeit import timeit
from types import ModuleType, FunctionType
from typing import Optional, List

from gc import get_referents

from pydantic import BaseModel


class Article(BaseModel):
    id: int
    label: str
    text: Optional[str]


class User(BaseModel):
    id: int
    login: str
    email: str
    password: str
    first_name: str
    last_name: Optional[str] = None
    articles: Optional[List[Article]] = None

    register_date: datetime.datetime
    last_login: Optional[datetime.datetime] = None


user = User(
    id=42,
    login="mylogin",
    email="email@example.com",
    password="hashed secret password",
    first_name="Some",
    last_name="User",
    articles=[Article(id=1, label="foo"), Article(id=2, label="bar")],
    register_date=datetime.datetime.now(),
    last_login=datetime.datetime.now()
)

globs = {'user': user}
tmt = functools.partial(timeit, globals=globs, number=1000000)


BLACKLIST = type, ModuleType, FunctionType


def get_size(obj):
    """
    sum size of object & members.
    https://stackoverflow.com/a/30316760
    """
    seen_ids = set()
    size = 0
    objects = [obj]
    while objects:
        need_referents = []
        for obj in objects:
            if not isinstance(obj, BLACKLIST) and id(obj) not in seen_ids:
                seen_ids.add(id(obj))
                size += getsizeof(obj)
                need_referents.append(obj)
        objects = get_referents(*need_referents)
    return size


print(f'''
has __dict__: {hasattr(user, '__dict__')}
uses __dict__: {not hasattr(user, '__values__')}
has __values__: {hasattr(user, '__values__')}

getsize(user): {getsizeof(user)} bytes
get_size(user): {get_size(user)} bytes
''')

print(f'''\
getattr:
    user.id: {tmt('user.id')}
    user.articles: {tmt('user.articles')}
    user.articles[0].label: {tmt('user.articles[0].label')}
''')

print('getsizeof:')

try:
    print(f'''\
    user.__dict__: {getsizeof(user.__dict__)} bytes
    user.articles[0].__dict__: {getsizeof(user.articles[0].__dict__)} bytes
    user.articles[1].__dict__: {getsizeof(user.articles[1].__dict__)} bytes
''')
except AttributeError:
    print('   __dict__: skipped')

try:
    print(f'''\
    user.__values__: {getsizeof(user.__values__)} bytes
    user.articles[0].__values__: {getsizeof(user.articles[0].__values__)} bytes
    user.articles[1].__values__: {getsizeof(user.articles[1].__values__)} bytes
''')
except AttributeError:
    print('    __values__: skipped')
...

Results with `values` used in BaseModel:

has __dict__: True
uses __dict__: False
has __values__: True

getsize(user): 72 bytes
get_size(user): 3620 bytes

getattr:
    user.id: 0.5301102200000969
    user.articles: 0.4485138009995353
    user.articles[0].label: 0.8063585330000933

getsizeof:
    user.__dict__: 112 bytes
    user.articles[0].__dict__: 112 bytes
    user.articles[1].__dict__: 112 bytes

    user.__values__: 368 bytes
    user.articles[0].__values__: 240 bytes
    user.articles[1].__values__: 240 bytes

Result with `dict` used instead of `values` (upd. results without getattr):

has __dict__: True
uses __dict__: True
has __values__: False

getsize(user): 64 bytes
get_size(user): 3484 bytes

getattr:
    user.id: 0.03707944100096938
    user.articles: 0.03184107699962624
    user.articles[0].label: 0.06509907500003465

getsizeof:
    user.__dict__: 368 bytes
    user.articles[0].__dict__: 240 bytes
    user.articles[1].__dict__: 240 bytes

    __values__: skipped

The text was updated successfully, but these errors were encountered:

dmontagu · 2019-08-03T19:35:45Z

This is very interesting! I tried running the benchmark, and it looks like there is no change, but I think the benchmark only relates to parsing speed (basically no Model.__getattr__s involved).

I added the following code to the benchmark to see what it looked like in a slightly more realistic usage scenario:

                    passed, result = test.validate(case)
                    # New here:
                    if passed:
                        for _ in range(200):
                            if result.location is not None:
                                getattr(result.location, "latitude")

If I remove the contribution of the parsing part, I got an approximately 10x speedup on your branch relative to master, whether cythonized or not. As far as I can tell, there was no regression in parsing speed.

However, if I replaced the above getattr with just result.dict() or result.json(), I got no speed up. I'm not familiar enough with the internals to clearly understand why this is the case.

At any rate, given how common an operation attribute access is in practice, this seems like an awesome improvement!

There is another big benefit to this (at least for us PyCharm users 😅): by removing the custom __getattr__, PyCharm now raises warnings when you try to access attributes that don't exist! This is currently handled properly by mypy, but PyCharm doesn't respect the @no_type_check decorator for this purpose. I raised this issue with PyCharm a few months ago and got this response:

Reproduced with mypy but it is not clear why __getattr__ should be ignored. PEP 484 contains the following description of no_type_check decorator:

Functions with the @no_type_check decorator should be treated as having no annotations.

In my opinion, this does not mean that decorated functions should be ignored entirely including their existence.

So it sounds like they have no intention of changing the behavior. Just the fact that this change fixes the lack of getattr warnings in PyCharm alone would be enough to get me excited about this change.

Overall, 👍👍

Bobronium · 2019-08-03T20:14:41Z

I noticed that BaseModel.dict speed didn't change too.

I guess that is because we acessing __values__ once or twice per one model and getting it directly with object.__getattribute__, as it stored in __slots__).

And yeah, glad to hear it solves one of PyCharm issues 😄.

samuelcolvin · 2019-08-05T09:28:53Z

This looks awesome. Thank you very much.

I'll review the code in the PR but reply to the principle here.

I don't think we need to add the __values__ property, that's something internal which I doubt anyone is used, just clear description in HISTORY should be enough.

samuelcolvin · 2019-08-05T15:08:11Z

fixed by #712

Bobronium mentioned this issue Aug 3, 2019

Rename BaseModel.__values__ to BaseModel.__dict__ #712

Merged

4 tasks

Bobronium changed the title ~~Switch to using __dict__ instead of __values__ (up to 10x speedup, lower model size)~~ Switch to using __dict__ instead of __values__ (up to 14x speedup, lower model size) Aug 3, 2019

samuelcolvin added the Performance label Aug 5, 2019

samuelcolvin closed this as completed Aug 5, 2019

Bobronium mentioned this issue Apr 8, 2020

Question: add private attribute #655

Closed

alexdrydew pushed a commit to alexdrydew/pydantic that referenced this issue Dec 23, 2023

fix stubtest issue post-merge (pydantic#711)

54daab7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to using dict instead of values (up to 14x speedup, lower model size) #711

Switch to using dict instead of values (up to 14x speedup, lower model size) #711

Bobronium commented Aug 3, 2019 •

edited

dmontagu commented Aug 3, 2019 •

edited

Bobronium commented Aug 3, 2019 •

edited

samuelcolvin commented Aug 5, 2019

samuelcolvin commented Aug 5, 2019

Switch to using __dict__ instead of __values__ (up to 14x speedup, lower model size) #711

Switch to using __dict__ instead of __values__ (up to 14x speedup, lower model size) #711

Comments

Bobronium commented Aug 3, 2019 • edited

Question | Feature Request

To see up-to-date changes, visit #712

Here's a little test showing a difference:

Results with __values__ used in BaseModel:

Result with __dict__ used instead of __values__ (upd. results without getattr):

dmontagu commented Aug 3, 2019 • edited

Bobronium commented Aug 3, 2019 • edited

samuelcolvin commented Aug 5, 2019

samuelcolvin commented Aug 5, 2019

Switch to using dict instead of values (up to 14x speedup, lower model size) #711

Switch to using dict instead of values (up to 14x speedup, lower model size) #711

Bobronium commented Aug 3, 2019 •

edited

Results with `values` used in BaseModel:

Result with `dict` used instead of `values` (upd. results without getattr):

dmontagu commented Aug 3, 2019 •

edited

Bobronium commented Aug 3, 2019 •

edited