Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Py2: binary class is large and tracked by gc; implement in C? #53

Closed
jamadden opened this issue Nov 12, 2019 · 0 comments · Fixed by #54
Closed

Py2: binary class is large and tracked by gc; implement in C? #53

jamadden opened this issue Nov 12, 2019 · 0 comments · Fixed by #54

Comments

@jamadden
Copy link
Member

Instances of zodbpickle.binary on CPython 2.7 are at least 32 bytes larger than the equivalent bytes/str object:

>>> from zodbpickle import binary
>>> sys.getsizeof(binary(''))
69
>>> sys.getsizeof('')
37

They are also tracked by the garbage collector, where bytes (which are known to be immutable) are not:

>>> import gc
>>> b = binary('')
>>> gc.collect(); gc.collect()
0
0
>>> gc.is_tracked(b)
True
>>> gc.is_tracked('')
False

Adding __slots__ = () changes none of this. (16 bytes of the overhead would be for the two GC pointers, another 8 for the __dict__ pointer, if present. I can't explain the final 8. Perhaps alignment? Perhaps the char* is no longer stored at the end of the object when subclassed so there's an extra pointer involved? I haven't looked into it.)

This adds up surprisingly quickly because ZODB uses zodbpickle.binary to store OIDs. They get turned into str in some cases, but in ghosts you can see the binary objects:

>>> import persistent
>>> import ZODB
>>> db = ZODB.DB(None)
>>> with db.transaction() as c:
...     c.root.key = persistent.Persistent()
...
>>> with db.transaction() as c:
...     type(c.root.key._p_oid)
...
<type 'str'>
>>> db.cacheMinimize()
None
>>> with db.transaction() as c:
...     type(c.root.key._p_oid)
...
<class 'zodbpickle.binary'>

In one application, binary was the largest type of object tracked by the GC by an order of magnitude (according to objgraph):

binary                                1141836
LOBucket                              316823
tuple                                 282777
LLBucket                              236532
dict                                  233084
list                                  159828
function                              124778

That's about a 35MB difference in memory used compared to str, but even worse, because all those objects are tracked by the GC, GC times increase by 7x (the relative impact diminishes as other objects are added but the constant cost remains):

$ python -m pyperf timeit \
     -s "strs = [str(i) for i in range(1141836)]; import gc" \
    "gc.collect()"
.....................
Mean +- std dev: 10.5 ms +- 0.9 ms
$ python -m pyperf timeit \
    -s "from zodbpickle import binary; strs = [binary(i) for i in range(1141836)]; import gc" \
    "gc.collect()"
.....................
Mean +- std dev: 69.8 ms +- 3.0 ms

I don't know of a way to solve these problems in Python, but I'm guessing/hoping it should be pretty simple to solve them by implementing binary using a C extension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant