Py2: binary class is large and tracked by gc; implement in C? #53

jamadden · 2019-11-12T14:24:33Z

Instances of zodbpickle.binary on CPython 2.7 are at least 32 bytes larger than the equivalent bytes/str object:

>>> from zodbpickle import binary
>>> sys.getsizeof(binary(''))
69
>>> sys.getsizeof('')
37

They are also tracked by the garbage collector, where bytes (which are known to be immutable) are not:

>>> import gc
>>> b = binary('')
>>> gc.collect(); gc.collect()
0
0
>>> gc.is_tracked(b)
True
>>> gc.is_tracked('')
False

Adding __slots__ = () changes none of this. (16 bytes of the overhead would be for the two GC pointers, another 8 for the __dict__ pointer, if present. I can't explain the final 8. Perhaps alignment? Perhaps the char* is no longer stored at the end of the object when subclassed so there's an extra pointer involved? I haven't looked into it.)

This adds up surprisingly quickly because ZODB uses zodbpickle.binary to store OIDs. They get turned into str in some cases, but in ghosts you can see the binary objects:

>>> import persistent
>>> import ZODB
>>> db = ZODB.DB(None)
>>> with db.transaction() as c:
...     c.root.key = persistent.Persistent()
...
>>> with db.transaction() as c:
...     type(c.root.key._p_oid)
...
<type 'str'>
>>> db.cacheMinimize()
None
>>> with db.transaction() as c:
...     type(c.root.key._p_oid)
...
<class 'zodbpickle.binary'>

In one application, binary was the largest type of object tracked by the GC by an order of magnitude (according to objgraph):

binary                                1141836
LOBucket                              316823
tuple                                 282777
LLBucket                              236532
dict                                  233084
list                                  159828
function                              124778

That's about a 35MB difference in memory used compared to str, but even worse, because all those objects are tracked by the GC, GC times increase by 7x (the relative impact diminishes as other objects are added but the constant cost remains):

$ python -m pyperf timeit \
     -s "strs = [str(i) for i in range(1141836)]; import gc" \
    "gc.collect()"
.....................
Mean +- std dev: 10.5 ms +- 0.9 ms
$ python -m pyperf timeit \
    -s "from zodbpickle import binary; strs = [binary(i) for i in range(1141836)]; import gc" \
    "gc.collect()"
.....................
Mean +- std dev: 69.8 ms +- 3.0 ms

I don't know of a way to solve these problems in Python, but I'm guessing/hoping it should be pretty simple to solve them by implementing binary using a C extension.

The text was updated successfully, but these errors were encountered:

Fixes #53

jamadden added a commit that referenced this issue Nov 12, 2019

Implement zodbpickle.binary in C for Py27.

bbef98c

Fixes #53

jamadden mentioned this issue Nov 12, 2019

Implement zodbpickle.binary in C for Py27. #54

Merged

jamadden closed this as completed in #54 Nov 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Py2: binary class is large and tracked by gc; implement in C? #53

Py2: binary class is large and tracked by gc; implement in C? #53

jamadden commented Nov 12, 2019

Py2: binary class is large and tracked by gc; implement in C? #53

Py2: binary class is large and tracked by gc; implement in C? #53

Comments

jamadden commented Nov 12, 2019