-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in xml.etree.ElementTree.iterparse #79683
Comments
When given xml that that would raise a ParseError, but parsing is stopped before the ParseError is raised, xml.etree.ElementTree.iterparse leaks memory. Example: import gc
from io import StringIO
import xml.etree.ElementTree as etree
import objgraph
def parse_xml():
xml = """
<LEVEL1>
</LEVEL1>
</ROOT>
"""
parser = etree.iterparse(StringIO(initial_value=xml))
for _, elem in parser:
if elem.tag == 'LEVEL1':
break
def run():
parse_xml()
gc.collect()
uncollected_elems = objgraph.by_type('Element')
print(uncollected_elems)
objgraph.show_backrefs(uncollected_elems, max_depth=15)
if __name__ == "__main__":
run() Output: Also see this gist which has an image showing the objects that are retained in memory: https://gist.github.com/grokcode/f89d5c5f1831c6bc373be6494f843de3 |
I wrote attached run.py which confirms a leak using tracemalloc: $ python3 run.py
1 calls: 15.3B / call (total: 15.3 kB)
100 calls: 15.3B / call (total: 1527.7 kB)
1000 calls: 15.3B / call (total: 15265.0 kB) |
Oops, there was a typo, you should read kB: 1 calls: 15.3 kB / call (total: 15.3 kB) |
The problem was with detecting a reference cycle containing a TreeBuilder. |
Oops, my PR 11169 used the wrong issue number: bpo-35257 instead of bpo-35502. Anyway, I closed it, the change is too complex. -- IMHO the root issue is the handling of the SyntaxError exception in XMLPullParser.feed(). I wrote a fix, but I don't have the bandwidth to write an unit test checking that the reference cycle is broken. commit 9f3354d36a89d7898bdb631e5119cc37e9a74840 (fix_etree_leak)
diff --git a/Lib/xml/etree/ElementTree.py b/Lib/xml/etree/ElementTree.py
index c1cf483cf5..f17c52541b 100644
--- a/Lib/xml/etree/ElementTree.py
+++ b/Lib/xml/etree/ElementTree.py
@@ -1266,6 +1266,8 @@ class XMLPullParser:
try:
self._parser.feed(data)
except SyntaxError as exc:
+ # bpo-35502: Break reference cycle
+ #exc.__traceback__ = None
self._events_queue.append(exc)
def _close_and_return_root(self): I don't see any behavior difference in XMLPullParser.read_events() which raise again the exception: events = self._events_queue
while events:
event = events.popleft()
if isinstance(event, Exception):
raise event
else:
yield event -- PR 11170 is also a nice enhancement (fix treebuilder_gc_traverse()), but maybe we should also prevent creating reference cycles in the first place? |
It is not easy to avoid reference cycles if use a generator function. And generator function is much faster than an implementation as a class with the __next__ method. We need to access the iterator object from the code of the generator function, and this creates a cycle. |
This ticket looks like it's done for 3.7/8. Can it be closed? |
The 3.6 branch no longer accept bugfixes. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: