Only skip XML white-space when scanning for tag-start. #75

alexkalderimis · 2016-02-03T21:40:15Z

~~I failed to replicate my issue (annoyingly). But I note it seemed worthwhile to include a test for this behaviour anyway. Feel free to ignore, but you may wish to include this.~~

This PR fixes an issue with back-tracking when one has to match either an XML element or text content.

Caused by confusion between XML white-space and white-space.

snoyberg · 2016-02-04T06:06:22Z

I'm having a lot of difficulty understanding what the actual bug report is here, or how this actually fixes the bug. Can you clarify?

alexkalderimis · 2016-02-04T13:57:03Z

Of course: given a situation where we can either expect an element or content, i.e both <outer><inner>x</inner></outer> and <outer>x</outer> are valid, attempting to use choose will trim leading escaped whitespace from the content:

so one might use:

nestedOrBare = tagNoAttr "outer" $ choose [tagNoAttr "inner" content, contentMaybe]

Which for both the above cases will produce "x".

However: if the content is <outer> x</outer> this will produce "x" rather than the correct value of "\160x", due to the fact that the first alternative of choose scans forward using isSpace.

(there are further details on the #74 issue where I reported this bug before fixing it - sorry for the inbox spam).

snoyberg · 2016-02-05T06:17:15Z

I'm not seeing how the description of a failure with back tracking is actually being solved here. I'm not opposed to the code change you've made, it makes sense, but I don't think it's fixing the real bug.

alexkalderimis · 2016-02-05T17:57:22Z

Well, the tests prove that it fixes a class of errors.

The underlying issue is that we are transforming entities before we consume chunks of content. This means that the line in the tag parser I changed (which consumed all whitespace) also consumed the encoded whitespace. Obviously it is more correct to only scan past XML white-space.

You are correct that this is just a partial fix. I mean to come back to this on the week-end and try and address the fact that this would still consume encoded new-lines and line-feeds; so that bug remains.

(ps: back-tracking is probably the wrong term here; there is no back-tracking in the true sense here. What is happening is that one parser is consuming content that the next parser should consume).

alexkalderimis · 2016-02-05T21:40:54Z

Having taken a bit of a look at the parser it is clear that this is a bug due to the fact that decodeXmlEntities is run before any of the Event consumers have access to the emitted events. This means the distinction between plain text content and encoded values is erased, changing the semantics.

Options:

pass the decoder to the parsing primitives, making them responsible for managing the distinction.
this is a rather heavy-weight and error prone option.
extend the definition of Content from = ContentText Text | ContentEntity Text by adding another
constructor, DecodedEntity Text, thus preserving the information erased by decodeEntities. The
downside of this is that it obviously involves a change to Data.XML.Types and would be a majorly
backwards incompatible change.

Philosophically, I feel the second option is the correct one: the decode phase erases semantics which are vital to the correct interpretation of a document. Practically the first one is the correct choice, since it can be implemented without changing other packages or breaking the API of this or any other module.

Thoughts, comments?

alexkalderimis · 2016-02-06T12:29:23Z

Actually - I realise there is a third (superior) option: deal with this at the token level. Adjust the tokeniser to skip leading and trailing XML whitespace inside elements and not emit events for it (which is kind of what it does now, or at least all you can rely on it to do). At this point the above bug is fixed (since leading XML space is skipped, but encoded elements are always emitted as Content)

Then as a further step, add a field to ParseSettings such as preserveAllWhiteSpace which allows all whitespace in elements to be emitted as content elements. Some applications require whitespace itself to be preserved as is, and this should be available as a setting.

This will not require API breakage (even with the new field on ParseSettings, since its only public constructor is def), and can be implemented solely in this package.

I will prepare a PR for you to review.

alexkalderimis · 2016-02-06T20:19:27Z

PR #76 implements the first stage of this improved tokenisation.

alexkalderimis · 2016-02-06T21:04:15Z

superceded by #76

…#76)

alexkalderimis added 2 commits February 3, 2016 21:37

Failed to replicate issue.

aa41aad

Replicated error

4cb2e95

alexkalderimis mentioned this pull request Feb 3, 2016

Issues decoding leading escaped spaces when using Text.XML.Stream.Parse #74

Closed

Fixed back-tracking issue.

04c1a9f

Caused by confusion between XML white-space and white-space.

alexkalderimis changed the title ~~This is a test for #74~~ Only skip XML white-space when scanning for tag-start. Feb 3, 2016

alexkalderimis mentioned this pull request Feb 6, 2016

Whitespace sensitive tokenisation #76

Closed

alexkalderimis closed this Feb 6, 2016

snoyberg added a commit that referenced this pull request Feb 15, 2016

dropWS retains consumed whitespace values (fixes #74, fixes #75, fixes …

330cb0c

…#76)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only skip XML white-space when scanning for tag-start. #75

Only skip XML white-space when scanning for tag-start. #75

alexkalderimis commented Feb 3, 2016

snoyberg commented Feb 4, 2016

alexkalderimis commented Feb 4, 2016

snoyberg commented Feb 5, 2016

alexkalderimis commented Feb 5, 2016

alexkalderimis commented Feb 5, 2016

alexkalderimis commented Feb 6, 2016

alexkalderimis commented Feb 6, 2016

alexkalderimis commented Feb 6, 2016

Only skip XML white-space when scanning for tag-start. #75

Only skip XML white-space when scanning for tag-start. #75

Conversation

alexkalderimis commented Feb 3, 2016

snoyberg commented Feb 4, 2016

alexkalderimis commented Feb 4, 2016

snoyberg commented Feb 5, 2016

alexkalderimis commented Feb 5, 2016

alexkalderimis commented Feb 5, 2016

alexkalderimis commented Feb 6, 2016

alexkalderimis commented Feb 6, 2016

alexkalderimis commented Feb 6, 2016