New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only skip XML white-space when scanning for tag-start. #75
Conversation
Caused by confusion between XML white-space and white-space.
I'm having a lot of difficulty understanding what the actual bug report is here, or how this actually fixes the bug. Can you clarify? |
Of course: given a situation where we can either expect an element or content, i.e both so one might use:
Which for both the above cases will produce However: if the content is (there are further details on the #74 issue where I reported this bug before fixing it - sorry for the inbox spam). |
I'm not seeing how the description of a failure with back tracking is actually being solved here. I'm not opposed to the code change you've made, it makes sense, but I don't think it's fixing the real bug. |
Well, the tests prove that it fixes a class of errors. The underlying issue is that we are transforming entities before we consume chunks of content. This means that the line in the tag parser I changed (which consumed all whitespace) also consumed the encoded whitespace. Obviously it is more correct to only scan past XML white-space. You are correct that this is just a partial fix. I mean to come back to this on the week-end and try and address the fact that this would still consume encoded new-lines and line-feeds; so that bug remains. (ps: back-tracking is probably the wrong term here; there is no back-tracking in the true sense here. What is happening is that one parser is consuming content that the next parser should consume). |
Having taken a bit of a look at the parser it is clear that this is a bug due to the fact that Options:
Philosophically, I feel the second option is the correct one: the decode phase erases semantics which are vital to the correct interpretation of a document. Practically the first one is the correct choice, since it can be implemented without changing other packages or breaking the API of this or any other module. Thoughts, comments? |
Actually - I realise there is a third (superior) option: deal with this at the token level. Adjust the tokeniser to skip leading and trailing XML whitespace inside elements and not emit events for it (which is kind of what it does now, or at least all you can rely on it to do). At this point the above bug is fixed (since leading XML space is skipped, but encoded elements are always emitted as Then as a further step, add a field to This will not require API breakage (even with the new field on I will prepare a PR for you to review. |
PR #76 implements the first stage of this improved tokenisation. |
superceded by #76 |
I failed to replicate my issue (annoyingly). But I note it seemed worthwhile to include a test for this behaviour anyway. Feel free to ignore, but you may wish to include this.This PR fixes an issue with back-tracking when one has to match either an XML element or text content.