Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Only skip XML white-space when scanning for tag-start. #75
This PR fixes an issue with back-tracking when one has to match either an XML element or text content.
referenced this pull request
Feb 3, 2016
Of course: given a situation where we can either expect an element or content, i.e both
so one might use:
Which for both the above cases will produce
However: if the content is
(there are further details on the #74 issue where I reported this bug before fixing it - sorry for the inbox spam).
Well, the tests prove that it fixes a class of errors.
The underlying issue is that we are transforming entities before we consume chunks of content. This means that the line in the tag parser I changed (which consumed all whitespace) also consumed the encoded whitespace. Obviously it is more correct to only scan past XML white-space.
You are correct that this is just a partial fix. I mean to come back to this on the week-end and try and address the fact that this would still consume encoded new-lines and line-feeds; so that bug remains.
(ps: back-tracking is probably the wrong term here; there is no back-tracking in the true sense here. What is happening is that one parser is consuming content that the next parser should consume).
Having taken a bit of a look at the parser it is clear that this is a bug due to the fact that
Philosophically, I feel the second option is the correct one: the decode phase erases semantics which are vital to the correct interpretation of a document. Practically the first one is the correct choice, since it can be implemented without changing other packages or breaking the API of this or any other module.
Actually - I realise there is a third (superior) option: deal with this at the token level. Adjust the tokeniser to skip leading and trailing XML whitespace inside elements and not emit events for it (which is kind of what it does now, or at least all you can rely on it to do). At this point the above bug is fixed (since leading XML space is skipped, but encoded elements are always emitted as
Then as a further step, add a field to
This will not require API breakage (even with the new field on
I will prepare a PR for you to review.