-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterating on a zip keeps objects alive longer than expected (test_itertools leaks sometimes references) #86702
Comments
12:07:06 vstinner@apu$ ./python -m test test_itertools -R 3:3 -F -j4 0:02:03 load avg: 4.24 [ 2/1] test_itertools failed (2 min 3 sec) -- running: test_itertools (2 min 3 sec), test_itertools (2 min 3 sec) test_itertools leaked [32, 32, 32] references, sum=96 Kill <TestWorkerProcess #1 running test=test_itertools pid=159366 time=2 min 3 sec> process group == Tests result: FAILURE == 1 test OK. 1 test failed: Total duration: 2 min 3 sec |
Leak seen on AMD64 RHEL7 Refleaks 3.x: Recent itertools change, I have no idea if it's related: commit cc061d0
|
I wanted to bisect this test before annoying too many people with this random issue. So far, I managed to simplified test_itertools to: def test_main(verbose=None):
test_classes = (
TestBasicOps,
TestVariousIteratorArgs,
TestGC,
)
support.run_unittest(*test_classes) |
You can get a list of all tests with --list-cases and then run them in separate processes. |
Thanks. I'll work on it this evening. If this is new, pairwise() is the most likely cause. |
I reproduce the leak with attached test_zip.py by running: $ ./python -m test -R 3:3 test_zip_leak -j4 -m test.test_zip_leak.TestGC.test_permutations -m test.test_zip_leak.TestGC.test_zip The problem is that the builtin zip class keeps an internal tuple alive. See zip_next) and zipobject.result member (tuple). |
Raymond Hettinger:
I reproduced the issue without involving the itertools module. It seems to be a zip_next() micro-optimization which keeps an object alive longer than epxected. This micro-optimization is not new. It's just really hard to trigger the issue. It seems to depend if the zip object is deleted or not. The builtin zip type implements a traverse function which visits Py_VISIT(lz->result). The GC is supposed to be able to break such cycle, no? |
In zip_new(), lz->result tuple is *not* always tracked by the GC. |
Adding Brandt because he recently worked on the zip() code. |
It looks like the GC untracks the None-filled result tuple in untrack_tuples, and it's never re-tracked again. This can also happen if it's filled with atomic values on an early iteration and the GC visits it. Perhaps a simple fix is to call _PyObject_GC_TRACK on the result tuple inside of each zip_next call if it's not tracked anymore? if (!_PyObject_GC_IS_TRACKED(result)) {
_PyObject_GC_TRACK(result);
}
_PyTuple_MaybeUntrack(result); // Worth it?
return result; Although I'm not sure how much of a win we're getting from the result tuple reuse here - maybe it's easier to just not do that. |
Simple demo: >>> import gc
>>> gc.disable()
>>> z = zip([[]])
>>> gc.is_tracked(next(z))
True
>>> z = zip([[]])
>>> gc.collect()
0
>>> gc.is_tracked(next(z))
False |
Some quick benchmarks on a normal build with CPU isolation, tuned with pyperf. No PGO/LTO. $ ./python -m pyperf timeit --rigorous --setup 'i = (None,) * 10_000_000' 'for _, _ in zip(i, i): pass' # Current master.
.........................................
Mean +- std dev: 279 ms +- 11 ms
$ ./python -m pyperf timeit --rigorous --setup 'i = (None,) * 10_000_000' 'for _, _ in zip(i, i): pass' # With above fix.
.........................................
Mean +- std dev: 369 ms +- 20 ms
$ ./python -m pyperf timeit --rigorous --setup 'i = (None,) * 10_000_000' 'for _, _ in zip(i, i): pass' # With above fix (no _PyTuple_MaybeUntrack).
.........................................
Mean +- std dev: 284 ms +- 17 ms
$ ./python -m pyperf timeit --rigorous --setup 'i = (None,) * 10_000_000' 'for _, _ in zip(i, i): pass' # With no tuple reuse.
.........................................
Mean +- std dev: 526 ms +- 51 ms Note that this example reuses the result tuple for *every* iteration. |
It looks like Victor's original issue is unrelated to zip, though. That test run is clean after adding the same fix to:
...all of which use the same tuple-recycling speed trick. If my zip PR looks good, I can create another one fixing these as well. |
(By the way, I didn't know that -F runs the tests forever... so I was waiting *almost* forever for it to finish!) |
Also, it appears enumerate is affected as well. |
Why there are cycles at all? In normal case there should not be cycle and the result tuple should be destroyed when the iteration ends. |
Extract of Brandt's PR: // The GC may have untracked this result tuple if its elements were all
// untracked. Since we're recycling it, make sure it's tracked again:
if (!_PyObject_GC_IS_TRACKED(result)) {
_PyObject_GC_TRACK(result);
} I would like to understand why the tuple is no longer tracked, whereas PyTuple_New() creates a newly created tuple which is tracked. Using gdb, I found that gc_collect_main() calls untrack_tuples(young) which untracks all tuples of the young generation. I understand that (when the issue happens):
Problem: the tuple is no longer tracked, whereas its content changed and so the newly filled tuple might be part of a reference cycle. Since the tuple is not tracked, the GC can no longer break the reference cycle involving the zip object internal tuple. Example of code where the zip tuple is untracked before zip_next() is called on the zip object: def test_product(self):
gc.set_threshold(5)
pools = [(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
('a', 'b', 'c'),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)]
indices = [0, 2, 10, 11]
print(indices)
print(pools)
list(None for a, b in zip(pools, indices)) |
test_zip_leak2.py: simplified example to trigger the GC issue. Copy it to Lib/test/ and reproduce the leak with:
Extract of the code: container = []
iterator = zip([container], [container])
# untrack the internal zip->result=(None, None)
gc.collect()
# create a reference cycle: container -> iterator -> container
container.append(iterator)
next(iterator)
# zip->result=(container, container) del container, iterator # Try to break the reference cycle
gc.collect() |
Serhiy Storchaka:
test_itertools creates a reference cycle *on purpose*. See test_zip_leak2.py simplified code, it comes from test_itertools. |
I add Pablo and Tim who love GC puzzles. |
The zip bug is quite old and was already tested by test_itertools for a very long time. Suddenly, it started to fail on more and more Refleaks buildbots. It may be because the GC must be triggered at a very specific point, and the default GC threshold triggers the issue with latest test_itertools changes. |
Well, I don’t think the GC is really doing anything wrong here... untracking these sort of tuples likely results in a noticeable reduction in collection times. This code is definitely testing the limits of what is “okay” for us to do with a tuple, and it’s breaking the GC’s (reasonable) assumptions. I kept digging around, and it appears dict.items() is affected, too (which is a bit scarier). The same fix still works, though. |
functools.reduce looks affected, too. |
I added some comments in the PR regarding the possibility of forcing the tracking on the visiting function to redirect the cost to gc passes instead of when calling next() but there is another option: not untracking tuples that have refcount == 1. |
Actually, this may not be a good idea as well: running test_list shows that there are 14786 tuples with refcount == 1 that would have been untracked. This is a non-trivial workload to the gc passes that we will lose if we do this. Performance-wise, I think the best alternative is to force the tracking before visiting in tp_traverse. |
Yep, I agree that this approach is better.
I do like the idea of not having to fix this in ~10 different places, but I'm not sure this is enough. The GC visit may happen when the refcount of the result tuple is > 1, but it may drop back to 1 before __next__ called again z = zip([42, []])
item = next(zip)
gc.collect() # Refcount of item == 2, untracked by GC.
del item # Refcount of item == 1, ready to be recycled.
next(zip) # This should be tracked, but won't be. |
Good point. Also, is it possible that a more complex cycle ends having more than one reference to that result, so indeed that is not enough. |
Thinking about this more, I'm a bit hesitant to put the re-tracking code in the traverse function (mostly stemming from my lack of knowledge about our GC algorithm). I have no idea how problematic tracking and untracking objects *during* collections can be. Consider the following reference chain: -> z -> r -> o -> r Here, z is a zip iterator, r is its result tuple, and o is some object. Let's say r was untracked in the previous collection (I believe untrack_tuples happens *after* all of the traverse functions are called). I'm worried that the GC may not be able to detect the cycle if it visits o, then z (which *then* starts tracking r), then r. Or, if this group is determined to be unreachable, untrack_tuples untracks r *before* the cycle is cleaned up. That seems like it could be a problem. Or, if z's refcount drops to zero and it's cleaned up, its traverse function may *never* be called, which leaves the untracked r -> o -> r cycle. I guess I'm just not sure how resilient the GC is to these sorts of things. Re-tracking the result tuple in __next__ sidesteps all of these potential issues, with no significant performance impact. So I think I might slightly prefer that. |
This is a real problem, indeed. We would need to add the tracking to the tp_dealloc of the zip object as well.
It is indeed tricky: the most problematic part of the "surprise tracking" is the validity of the pointers, but the key here is that the traverse function will be called BEFORE (update_refs()) the pointers start being invalid (because of the bit reuse).
It is not: untrack_tuples is called over the reachable set after the gc knows what part of the linked list is isolated garbage. If the group is in the unreachable set, then untrack_tuples won't touch it.
This is not a problem IMO: if the gc does not see the cycle in that collection, it will in the next ones. Wat will happen is that at the beginning the gc does not see r so it iterates over the linked list of the generation being collected, decrementing the gc refcounts. When it reaches z, it starts decrementing the refcounts of z and that calls the traverse function of z, that tracks r AND decrements its gc refcounts but also it adds "r" to the end of the linked list of the young generation. This means that (assuming we are collecting the middle generation) in this collection the cycle will not be seen, but in a future collection where the young generation is collected then "r" will be visited normally (notice that the gc can see "r" without needing to visit "z" before). Also, notice that untrack_tuples() is called over the reachable set, but "r" is in the young generation so it won't be untracked. Also, if we are collecting the younger generation, then the cycle will be found and it will be cleaned nicely. |
I forgot to mention the conclusion: my conclusion is that although the GC is resilient to these shenanigans (it was also extensive validation along with the algorithm), I still prefer to track it on the next() call because is technically the right thing to do: A tuple has been mutated so it must be (potentially) tracked because it needs to behave "AS IF" the tuple was being created from scratch. |
Thanks for that detailed explanation, Pablo. If nobody objects to the current zip fix, I'll merge it and move on to similar PRs for the other affected stuff:
|
Not sure if is more ergonomic, but I would prefer to fix this in one PR so the buildbots don't test the partial fixes. Also is easier to revert if something goes wrong with the approach as that would affect all. But being said that, if you prefer multiple ones, I am also fine :) |
Yeah, I'm fine with that. |
Adding:
|
Would it be possible to push a quick & dirty workaround for test_itertools? I'm getting more and more buildbot Refleaks failures. For example, always skip TestGC.test_zip() for now. This issue seems quite complex and I would prefer to have time to think about it to design a proper fix, rather than having to work under pressure. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: