Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-101438: Avoid reference cycle in ElementTree.iterparse. #114269

Merged
merged 5 commits into from
Jan 23, 2024

Conversation

colesbury
Copy link
Contributor

@colesbury colesbury commented Jan 18, 2024

Refactor IterParseIterator to avoid a reference cycle between the iterator() function and the IterParseIterator() instance. This leads to more prompt clean-up of the "source" file if the returned iterator is not exhausted and not otherwise part of a reference cycle.

This also avoids a test failure in the GC implementation for the free-threaded build (#114262): if the "source" file is finalized before the iterator() generator, a ResourceWarning is issued leading to a failure in test_iterparse() in test_xml_etree.py. In theory, this warning can occur in the default build as well, but that is much less likely to occur because it would require an unlucky scheduling of the GC between creation of the generator and the file object in order to change the order of finalization.

Refactor IterParseIterator to avoid a reference cycle between the
iterator() function and the IterParseIterator() instance. This leads to
more prompt clean-up of the "source" file if the returned iterator is
not exhausted and not otherwise part of a reference cycle.

This also avoids a test failure in the GC implementation for the
free-threaded build: if the "source" file is finalized before the
"iterator()" generator, a ResourceWarning is issued leading to a
failure in test_iterparse(). In theory, this warning can occur in
the default build as well, but is much less likely because it would
require an unlucky scheduling of the GC between creation of the generator
and the file object in order to change the order of finalization.
@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Jan 19, 2024

A class with __next__() method adds an overhead. See #69824, it made benchmarks 2 times slower.

Please test how this change affects the results of the xml_etree_iterparse benchmark in https://github.com/python/pyperformance.

This avoids the `__next__` wrapper and the `root` property, both of
which had a performance impact on the iterparse benchmark in
bm_xml_etree.
@colesbury
Copy link
Contributor Author

Thanks for the pointer @serhiy-storchaka. It had about a 15% regression. I rewrote it to avoid the regression. On my machine, I'm seeing 1.37-1.38 seconds for ten iterations of the iterparse benchmark (both before and after this PR).

@colesbury
Copy link
Contributor Author

@serhiy-storchaka, would you be able to review this?

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test?

There is an existing test

        # Not exhausting the iterator still closes the resource (bpo-43292)
        with warnings_helper.check_no_resource_warning(self):
            it = iterparse(TESTFN)
            del it

It was purposed to test that source.close() is called, but it seems that it was not called at all.

Perhaps running it in a loop can test this issue. And this code can be moved into a separate method.

Lib/xml/etree/ElementTree.py Outdated Show resolved Hide resolved
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@colesbury
Copy link
Contributor Author

@serhiy-storchaka, I don't think we can do better than the current test. That test does catch the behavior. See https://github.com/python/cpython/actions/runs/7617715570/job/20747338297 for the test failure.

@colesbury
Copy link
Contributor Author

I don't think a loop will do much either. The order of finalization is not specified, but in practice it is pretty deterministic. A loop will make the test slower, but not much more reliable.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@serhiy-storchaka serhiy-storchaka enabled auto-merge (squash) January 23, 2024 19:52
@serhiy-storchaka
Copy link
Member

Thank you for your contribution @colesbury. If you look at the history of this code, you will see that it is a continuous fight against a reference loops. So I'm especially grateful for this fix.

@serhiy-storchaka serhiy-storchaka merged commit ce01ab5 into python:main Jan 23, 2024
29 checks passed
@miss-islington-app
Copy link

Thanks @colesbury for the PR, and @serhiy-storchaka for merging it 🌮🎉.. I'm working now to backport this PR to: 3.11, 3.12.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jan 23, 2024
…honGH-114269)

The iterator returned by ElementTree.iterparse() may hold on to a file
descriptor. The reference cycle prevented prompt clean-up of the file
descriptor if the returned iterator was not exhausted.
(cherry picked from commit ce01ab5)

Co-authored-by: Sam Gross <colesbury@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jan 23, 2024
…honGH-114269)

The iterator returned by ElementTree.iterparse() may hold on to a file
descriptor. The reference cycle prevented prompt clean-up of the file
descriptor if the returned iterator was not exhausted.
(cherry picked from commit ce01ab5)

Co-authored-by: Sam Gross <colesbury@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented Jan 23, 2024

GH-114499 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 bug and security fixes label Jan 23, 2024
@bedevere-app
Copy link

bedevere-app bot commented Jan 23, 2024

GH-114500 is a backport of this pull request to the 3.11 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.11 only security fixes label Jan 23, 2024
serhiy-storchaka pushed a commit that referenced this pull request Jan 23, 2024
…-114269) (GH-114499)

The iterator returned by ElementTree.iterparse() may hold on to a file
descriptor. The reference cycle prevented prompt clean-up of the file
descriptor if the returned iterator was not exhausted.
(cherry picked from commit ce01ab5)

Co-authored-by: Sam Gross <colesbury@gmail.com>
serhiy-storchaka pushed a commit that referenced this pull request Jan 23, 2024
…-114269) (GH-114500)

The iterator returned by ElementTree.iterparse() may hold on to a file
descriptor. The reference cycle prevented prompt clean-up of the file
descriptor if the returned iterator was not exhausted.
(cherry picked from commit ce01ab5)

Co-authored-by: Sam Gross <colesbury@gmail.com>
@colesbury colesbury deleted the gh-101438-iterparse branch January 24, 2024 17:28
Comment on lines +1258 to +1260
it = IterParseIterator()
wr = weakref.ref(it)
del IterParseIterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious, why previously both iterator and IterParseIterator names were deleted, but now only IterParseIterator? And what is the purpose of this statement in the first place? I was thinking that iterator.__closure__ stores references to these objects; therefore, unnecessary references should be deleted. However, as per my checks, closure stores only referenced variables inside; pullparser, close_source and wr in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed that it.root = None was deleted. This fact is not documented, but this may still cause unintended errors on the user side if they use root.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right about the it.root = None. I did not intend a behavioral change here, so it seems like a good idea to add it back.

I don't think the del statements matter one way or the other. They look like they break a cycle, but not really, but they also are harmless.

colesbury added a commit to colesbury/cpython that referenced this pull request Jan 30, 2024
Prior to pythongh-114269, the iterator returned by ElementTree.iterparse was
initialized with the root attribute as None. This restores the previous
behavior.
serhiy-storchaka pushed a commit that referenced this pull request Jan 31, 2024
…H-114755)

Prior to gh-114269, the iterator returned by ElementTree.iterparse was
initialized with the root attribute as None. This restores the previous
behavior.
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jan 31, 2024
…ute (pythonGH-114755)

Prior to pythongh-114269, the iterator returned by ElementTree.iterparse was
initialized with the root attribute as None. This restores the previous
behavior.
(cherry picked from commit 66f95ea)

Co-authored-by: Sam Gross <colesbury@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jan 31, 2024
…ute (pythonGH-114755)

Prior to pythongh-114269, the iterator returned by ElementTree.iterparse was
initialized with the root attribute as None. This restores the previous
behavior.
(cherry picked from commit 66f95ea)

Co-authored-by: Sam Gross <colesbury@gmail.com>
serhiy-storchaka pushed a commit that referenced this pull request Jan 31, 2024
…bute (GH-114755) (GH-114798)

Prior to gh-114269, the iterator returned by ElementTree.iterparse was
initialized with the root attribute as None. This restores the previous
behavior.
(cherry picked from commit 66f95ea)

Co-authored-by: Sam Gross <colesbury@gmail.com>
serhiy-storchaka pushed a commit that referenced this pull request Jan 31, 2024
…bute (GH-114755) (GH-114799)

Prior to gh-114269, the iterator returned by ElementTree.iterparse was
initialized with the root attribute as None. This restores the previous
behavior.
(cherry picked from commit 66f95ea)

Co-authored-by: Sam Gross <colesbury@gmail.com>
aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
…honGH-114269)

The iterator returned by ElementTree.iterparse() may hold on to a file
descriptor. The reference cycle prevented prompt clean-up of the file
descriptor if the returned iterator was not exhausted.
aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
…ute (pythonGH-114755)

Prior to pythongh-114269, the iterator returned by ElementTree.iterparse was
initialized with the root attribute as None. This restores the previous
behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants