New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks #87726
Comments
bpo-43483 was posted as a "bug" but retracted. Though the problem is real, it is tricky to declare an UNSPECIFIED behavior to be a bug. See that issue page for more discussion and a test case. A brief overview is repeated here. SCENARIO - XML PARSING LOSES DATA (or not) The parsing attempts to capture text consisting of very tiny quoted strings. A typical content line reads something like this: <p>Colchuck</p> The parser implements a scheme presented at various tutorial Web sites, using two member functions. # Note the name attribute of the current tag group
def element_handler(self, tagname, attrs) :
self.CurrentTag = tagname
# Record the content from each "p" tag when encountered
def characters(self, content):
if self.CurrentTag == "p":
self.name = content ... > print(parser.name) But then, after successfully extracting content from perhaps hundreds of thousands of XML tag sets in this way, the parsing suddenly "drops" a few characters of content. > print(parser.name) While this problem was observed with a SAX parser, it can affect expat parsers as well. It affects 32-bit and 64-bit implementations the same, over several major releases of the Python 3 system. SPECIFIED BEHAVIOR (or not) The "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Library (and many prior versions) states: ----------- If it happens that the content is delivered in two chunks instead of one, the characters() method shown above overwrites the first part of the text with the second part, and some content seems lost. This completely explains the observed behavior. EXPECTED BEHAVIOR (or not) Even though the behavior is unspecified, users can have certain expectations about what a reasonable parser should do. Among these: -- EFFICIENCY: the parser should do simple things simply, and complicated things as simply as possible The design can be considered "poor" if thorough testing cannot identify what the actual behaviors are going to be, because those behaviors are rare and unpredictable. The obvious "simple thing," from the user perspective, is that the parser should return each tiny text string as one tiny text chunk. In fact, this is precisely what it does... 99.999% of the time. But then, suddenly, it doesn't. One hypothesis is that when the parsing scan of raw input text reaches the end of a large internal text buffer, it is easier from the implementer's perspective to flush any text remaining in the old buffer prior to fetching a new one, even if that produces a fragmented chunk with only a couple of characters. IMPROVEMENTS REQUIRED Review the code to determine whether the text buffer scenario is in fact the primary cause of inconsistent behavior. Modify the data handling to defer delivery of content fragments that are small, carrying over a small amount of previously scanned text so that small contiguous text chunks are recombined rather than reported as multiple fragments. If the length of the content text to carry over is greater than some configurable xml.sax.handler.ContiguousChunkLength, the parser can go ahead and deliver it as a fragment. DOCUMENTING THE IMPROVEMENTS Strictly speaking: none required. Undefined behaviors are undefined, whether consistent or otherwise. But after the improvements are implemented, it would be helpful to modify documentation to expose the new performance guarantees, making users more aware of the possible hazards. For example, a new description in the "xml.sax.handler" page might read as follows: ----------- |
I believe this a non-bug but documented, by-design, 20 years old Expat behavior. Collecting all character data before calling a handler can become a denial-of-service vulnerability easily, so "fixing" this can make things worse. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: