Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only skip XML white-space when scanning for tag-start. #75

Closed
wants to merge 3 commits into from

Conversation

alexkalderimis
Copy link
Contributor

I failed to replicate my issue (annoyingly). But I note it seemed worthwhile to include a test for this behaviour anyway. Feel free to ignore, but you may wish to include this.

This PR fixes an issue with back-tracking when one has to match either an XML element or text content.

Caused by confusion between XML white-space and white-space.
@alexkalderimis alexkalderimis changed the title This is a test for #74 Only skip XML white-space when scanning for tag-start. Feb 3, 2016
@snoyberg
Copy link
Owner

snoyberg commented Feb 4, 2016

I'm having a lot of difficulty understanding what the actual bug report is here, or how this actually fixes the bug. Can you clarify?

@alexkalderimis
Copy link
Contributor Author

Of course: given a situation where we can either expect an element or content, i.e both <outer><inner>x</inner></outer> and <outer>x</outer> are valid, attempting to use choose will trim leading escaped whitespace from the content:

so one might use:

nestedOrBare = tagNoAttr "outer" $ choose [tagNoAttr "inner" content, contentMaybe]

Which for both the above cases will produce "x".

However: if the content is <outer>&#160;x</outer> this will produce "x" rather than the correct value of "\160x", due to the fact that the first alternative of choose scans forward using isSpace.

(there are further details on the #74 issue where I reported this bug before fixing it - sorry for the inbox spam).

@snoyberg
Copy link
Owner

snoyberg commented Feb 5, 2016

I'm not seeing how the description of a failure with back tracking is actually being solved here. I'm not opposed to the code change you've made, it makes sense, but I don't think it's fixing the real bug.

@alexkalderimis
Copy link
Contributor Author

Well, the tests prove that it fixes a class of errors.

The underlying issue is that we are transforming entities before we consume chunks of content. This means that the line in the tag parser I changed (which consumed all whitespace) also consumed the encoded whitespace. Obviously it is more correct to only scan past XML white-space.

You are correct that this is just a partial fix. I mean to come back to this on the week-end and try and address the fact that this would still consume encoded new-lines and line-feeds; so that bug remains.

(ps: back-tracking is probably the wrong term here; there is no back-tracking in the true sense here. What is happening is that one parser is consuming content that the next parser should consume).

@alexkalderimis
Copy link
Contributor Author

Having taken a bit of a look at the parser it is clear that this is a bug due to the fact that decodeXmlEntities is run before any of the Event consumers have access to the emitted events. This means the distinction between plain text content and encoded values is erased, changing the semantics.

Options:

  • pass the decoder to the parsing primitives, making them responsible for managing the distinction.
    this is a rather heavy-weight and error prone option.
  • extend the definition of Content from = ContentText Text | ContentEntity Text by adding another
    constructor, DecodedEntity Text, thus preserving the information erased by decodeEntities. The
    downside of this is that it obviously involves a change to Data.XML.Types and would be a majorly
    backwards incompatible change.

Philosophically, I feel the second option is the correct one: the decode phase erases semantics which are vital to the correct interpretation of a document. Practically the first one is the correct choice, since it can be implemented without changing other packages or breaking the API of this or any other module.

Thoughts, comments?

@alexkalderimis
Copy link
Contributor Author

Actually - I realise there is a third (superior) option: deal with this at the token level. Adjust the tokeniser to skip leading and trailing XML whitespace inside elements and not emit events for it (which is kind of what it does now, or at least all you can rely on it to do). At this point the above bug is fixed (since leading XML space is skipped, but encoded elements are always emitted as Content)

Then as a further step, add a field to ParseSettings such as preserveAllWhiteSpace which allows all whitespace in elements to be emitted as content elements. Some applications require whitespace itself to be preserved as is, and this should be available as a setting.

This will not require API breakage (even with the new field on ParseSettings, since its only public constructor is def), and can be implemented solely in this package.

I will prepare a PR for you to review.

@alexkalderimis
Copy link
Contributor Author

PR #76 implements the first stage of this improved tokenisation.

@alexkalderimis
Copy link
Contributor Author

superceded by #76

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants