Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks #87726

ridgerat · 2021-03-19T18:39:19Z

BPO	43560
Nosy	@ridgerat

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-03-19.18:39:18.829>
labels = ['expert-XML', '3.8', 'type-feature', '3.7', '3.9']
title = 'Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks'
updated_at = <Date 2021-03-19.18:39:18.829>
user = 'https://github.com/ridgerat'

bugs.python.org fields:

activity = <Date 2021-03-19.18:39:18.829>
actor = 'ridgerat1611'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['XML']
creation = <Date 2021-03-19.18:39:18.829>
creator = 'ridgerat1611'
dependencies = []
files = []
hgrepos = []
issue_num = 43560
keywords = []
message_count = 1.0
messages = ['389108']
nosy_count = 1.0
nosy_names = ['ridgerat1611']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue43560'
versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

ridgerat · 2021-03-19T18:39:18Z

bpo-43483 was posted as a "bug" but retracted. Though the problem is real, it is tricky to declare an UNSPECIFIED behavior to be a bug. See that issue page for more discussion and a test case. A brief overview is repeated here.

SCENARIO - XML PARSING LOSES DATA (or not)

The parsing attempts to capture text consisting of very tiny quoted strings. A typical content line reads something like this:

<p>Colchuck</p>

The parser implements a scheme presented at various tutorial Web sites, using two member functions.

   # Note the name attribute of the current tag group
   def element_handler(self, tagname, attrs) :
       self.CurrentTag = tagname      

   # Record the content from each "p" tag when encountered
   def characters(self, content):
       if self.CurrentTag == "p":
           self.name = content

...

> print(parser.name)
"Colchuck"

But then, after successfully extracting content from perhaps hundreds of thousands of XML tag sets in this way, the parsing suddenly "drops" a few characters of content.

> print(parser.name)
"lchuck"

While this problem was observed with a SAX parser, it can affect expat parsers as well. It affects 32-bit and 64-bit implementations the same, over several major releases of the Python 3 system.

SPECIFIED BEHAVIOR (or not)

The "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python Standard Library (and many prior versions) states:

-----------
ContentHandler.characters(content) -- The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks...
-----------

If it happens that the content is delivered in two chunks instead of one, the characters() method shown above overwrites the first part of the text with the second part, and some content seems lost. This completely explains the observed behavior.

EXPECTED BEHAVIOR (or not)

Even though the behavior is unspecified, users can have certain expectations about what a reasonable parser should do. Among these:

-- EFFICIENCY: the parser should do simple things simply, and complicated things as simply as possible
-- CONSISTENCY: the parser behavior should be repeatable and dependable

The design can be considered "poor" if thorough testing cannot identify what the actual behaviors are going to be, because those behaviors are rare and unpredictable.

The obvious "simple thing," from the user perspective, is that the parser should return each tiny text string as one tiny text chunk. In fact, this is precisely what it does... 99.999% of the time. But then, suddenly, it doesn't.

One hypothesis is that when the parsing scan of raw input text reaches the end of a large internal text buffer, it is easier from the implementer's perspective to flush any text remaining in the old buffer prior to fetching a new one, even if that produces a fragmented chunk with only a couple of characters.

IMPROVEMENTS REQUIRED

Review the code to determine whether the text buffer scenario is in fact the primary cause of inconsistent behavior. Modify the data handling to defer delivery of content fragments that are small, carrying over a small amount of previously scanned text so that small contiguous text chunks are recombined rather than reported as multiple fragments. If the length of the content text to carry over is greater than some configurable xml.sax.handler.ContiguousChunkLength, the parser can go ahead and deliver it as a fragment.

DOCUMENTING THE IMPROVEMENTS

Strictly speaking: none required. Undefined behaviors are undefined, whether consistent or otherwise. But after the improvements are implemented, it would be helpful to modify documentation to expose the new performance guarantees, making users more aware of the possible hazards. For example, a new description in the "xml.sax.handler" page might read as follows:

-----------
ContentHandler.characters(content) -- The Parser will call this method to report chunks of character data. In general, character data may be reported as a single chunk or as sequence of chunks; but character data sequences with fewer than xml.sax.handler.ContiguousChunkLength characters, when uninterrupted any other xml.sax.handler.ContentHandler event, are guaranteed to be delivered as a single chunk...
-----------

hartwork · 2024-02-11T14:47:58Z

I believe this a non-bug but documented, by-design, 20 years old Expat behavior. Collecting all character data before calling a handler can become a denial-of-service vulnerability easily, so "fixing" this can make things worse.

ridgerat mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes topic-XML type-feature A feature request or enhancement labels Mar 19, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks #87726

Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks #87726

ridgerat mannequin commented Mar 19, 2021

ridgerat mannequin commented Mar 19, 2021

hartwork commented Feb 11, 2024

Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks #87726

Modify SAX/expat parsing to avoid fragmentation of already-tiny content chunks #87726

Comments

ridgerat mannequin commented Mar 19, 2021

ridgerat mannequin commented Mar 19, 2021

hartwork commented Feb 11, 2024