Whitespace sensitive tokenisation #76

alexkalderimis · 2016-02-06T20:17:40Z

This is the PR referred to in #75. It improves tokenisation to handle leading XML white-space consistently. The benefit of this is that entities (such as &x#20; will never be handled by the tokenisation level as whitespace. Thus it is now possible to have content <a>&x#20;<b/></a> which will cause the following parser to fail:

bInsideA = tagNoAttr "a" (tagNoAttr "b" >> return "Found")

which will now fail - whereas previously it would have succeeded, erroneously consuming the encoded space.

This has been updated to include configurable white-space preservation and more spec compliant tag-name parsing.

Caused by confusion between XML white-space and white-space.

This change improves tokenisation by handling whitespace as a semantic feature. Instead of relying on the `tag` combinator to skip leading whitespace, this is done during tokenisation. A significant change here is that we need to now take context into account when parsing a sequence of text content, since the leading node may skip leading whitespace, whereas subsequent ones need to preserve it. This is handled by changing the type of parseToken to return a list of tokens, so that consecutive text nodes can be all handled together, with a different setting of the preserve-leading-whitespace flag on the tail versus the head. The downside of this is that the event positions will be less accurate (it is the aim of future work to improve this). Another area for further work is the fact that for `<node> xyz </node>` we will strip the leading space but keep the trailing one (yielding "xyz "). This does not break any tests (the aim of this change is simply to introduce this change while preserving API compatibility) but should in the future be handled consistently, yielding "xyz".

alexkalderimis · 2016-02-06T20:21:07Z

xml-conduit/test/main.hs

  where
-    input = L.concat
+    input = L.unlines


I changed this to unlines to a) test handling new-lines within the document, and b) also so the the error messages would be easier to read (since they have useful line numbers instead of a single unhelpful column number)

alexkalderimis · 2016-02-06T20:37:27Z

I note that this currently breaks html-conduit: looking at this now (it would be really awesome if you could enable travis builds for PRs)

this is required to get html-conduit tests to pass, and is generally dersirable.

This identifies and fixes an issue with whitespace following CDATA.

this applies the same rules as applied to leading whitespace to leading whitespace.

snoyberg · 2016-02-10T06:24:17Z

xml-conduit/Text/XML/Stream/Parse.hs

+    , psPreserveWhiteSpace :: Bool
+    -- ^ Whether we should preserve literal XML whitespace within nodes.
+    --
+    -- Default: False


This is a breaking change in behavior without a good reason. Why do we want to no longer preserve whitespace, and instead change the semantics of a document by default?

if we preserve whitespace by default, then using streaming tag combinators becomes more difficult, since they will have to consume whitespace, and it is impossible to know if it is XML literal whitespace (and thus reasonable to skip) or decoded XML entities (and thus illegal to skip).

This is not my favoured route, for this very reason, as mentioned on the issue, but it does limit the scope of changes.

Thanks for taking some time to review the changes.

snoyberg · 2016-02-10T06:27:43Z

I'm sorry, but each PR is making this issue more confused to me. If there's a backtracking issue, we should fix the backtracking. If there's inconsistent whitespace handling, that should be fixed. But this seems like a very opinionated change in default behavior which will break many existing codebases in very bad ways. As an example, consider the following XHTML snippet, which will now be parsed incorrectly:

<p>Hello, my name is <b>Inigo Montoya</b></p>

The very necessary space before the <b> would be stripped, which is not at all what is desired.

alexkalderimis · 2016-02-10T14:14:58Z

Yes, that is a breaking change. I am indeed happy to do the reverse by default, but that would then leave the parser in the present situation where it silently consumes whitespace when attempting to parse a tag.

I note that all your existing tests continue to pass.

(to clarify, I was incorrect to initially suggest this was an issue with back-tracking, it is entirely an issue with tokenisation and the semantics of the decoded tokens, i.e. whitespace handling).

alexkalderimis · 2016-02-10T19:26:17Z

I've had a bit of a think about the approach here and really I'd like your input on the best way to proceed. This approach was taken to see how far I could get working solely within the API boundaries of this one module. I concede it has knock on effects, and involves a breaking set of changes, which is far from ideal. This is, as you are rightly feeling, a significant change in behaviour to address a small issue (the consumption of non-literal white-space). Very much a sledge-hammer to crack a nut really.

I still hold that the correct way to address this problem is by addressing the representation of tokens within Data.XML.Types, or failing that within Text.XML.Stream.Token. I am prepared to see what the approach would look like to add a new token constructor, since this is not a public API, unlike the Content type from Data.XML.Types. Please let me know if you think that would be a more suitable approach to take and I can prepare an alternate pull request.

Alternatively, you can open and re-assess the smaller PR (#74) which contained a very small and non API-breaking change that addressed the majority of issues (if not all of them).

alexkalderimis · 2016-02-10T20:03:17Z

and I'll just leave this here to help you understand the issue: I added diagnostic tests to a clean branch based on master: 4 of these tests fail. (at lines 170, 172, 174 and 176).

alexkalderimis@8bac1ee

snoyberg · 2016-02-11T06:20:04Z

I'm going to tell you the same concern again: you've raised an error
indicating that there's a backtracking bug, and have submitted (a lot of)
changed code that doesn't touch any backtracking logic. Each time I ask
about this, you answer with more whitespace discussion. I do not believe
you're looking at the real bug, or all of the comments about backtracking
were incorrect. Either way, I do not feel comfortable with any of the
proposed steps for moving forward.

On Wed, Feb 10, 2016 at 10:03 PM, Alex Kalderimis notifications@github.com
wrote:

and I'll just leave this here to help you understand the issue: I added
diagnostic tests to a clean branch based on master: 4 of these tests fail.
(at lines 170, 172, 174 and 176).

alexkalderimis/xml@8bac1ee
alexkalderimis@8bac1ee

—
Reply to this email directly or view it on GitHub
#76 (comment).

alexkalderimis · 2016-02-11T10:28:07Z

Thanks again for your time in looking at this. I feel I should just lay out my reasoning why this is not a back-tracking issue, and instead a tokenisation one.

The current back-tracking logic for when we fail to match a tag cannot be substantially improved. It can be found here: https://github.com/snoyberg/xml/blob/master/xml-conduit/Text/XML/Stream/Parse.hs#L674. Its logic can be summarised as:

while next token is entirely whitespace:
  drop next token

This works fine, as long as there is only one whitespace token in the stream. If there are multiple whitespace tokens, all but the last are dropped. (as demonstrated by alexkalderimis/xml@8bac1ee). It is possible to have multiple adjacent tokens in the stream because once an entity has been parsed (which happens prior to tokenisation) it is indistinguishable from ordinary text, even though it is semantically distinct.

This means that <tag> </tag> has one content token inside it (a single Text of length 3, namely "\x20\x20\x20"), while <tag>   </tag> has 3 whitespace tokens, all of which are the same single space character ("\x20"). The distinction between entities and plain content has been erased.

This is why this is not a back-tracking issue. Even if dropWS was amended to either drop all consecutive white-space tokens or none it would still be incorrect since it would be impossible to know if it is skipping a decoded entity or a literal space character.

I acknowledge that this first attempt has significant drawbacks, and indeed I am perfectly happy to find an alternative solution. I feel I understand the issue well and have attempted to explain it, in code and in comments. If you feel I have not succeeded in that attempt I am sorry - feel free to close this PR in any case. If you do want me to work towards an acceptable solution I would be more than happy to contribute.

snoyberg · 2016-02-14T12:21:58Z

Even if dropWS was amended to either drop all consecutive white-space tokens or none it would still be incorrect since it would be impossible to know if it is skipping a decoded entity or a literal space character.

This is where I'm confused, as it seems like the direct solution. What's the reason this isn't an option? Why do we care if it's not possible to distinguish between a decoded entity and a literal space character?

alexkalderimis · 2016-02-14T21:22:18Z

For the simple reason that <outer> <inner/></outer> is clearly not the same
as <outer> <inner/></outer> - we encode characters when we want them treated as significant content, not as indentation.

snoyberg · 2016-02-15T04:59:45Z

This is not correct according to the XML specification, the two are
identical semantically. It seems like the problem is that you're trying to
get XML to do something it's not designed to do.

On Sun, Feb 14, 2016, 11:22 PM Alex Kalderimis notifications@github.com
wrote:

For the simple reason that is clearly not
the same
as - we encode characters when we want them
treated as significant content, not as indentation.

—
Reply to this email directly or view it on GitHub
#76 (comment).

alexkalderimis · 2016-02-15T10:20:03Z

I that case you might want to close the ticket. I would still suggest you
merge #75 though, which at least corrects the scanning over white space to
only scan past XML space, and not the wider definition of space that
isSpace tests for.

On 15 February 2016 at 04:59, Michael Snoyman notifications@github.com
wrote:

This is not correct according to the XML specification, the two are
identical semantically. It seems like the problem is that you're trying to
get XML to do something it's not designed to do.

On Sun, Feb 14, 2016, 11:22 PM Alex Kalderimis notifications@github.com
wrote:

For the simple reason that is clearly not
the same
as - we encode characters when we want them
treated as significant content, not as indentation.

—
Reply to this email directly or view it on GitHub
#76 (comment).

—
Reply to this email directly or view it on GitHub
#76 (comment).

…#76)

alexkalderimis added 4 commits February 3, 2016 21:37

Failed to replicate issue.

aa41aad

Replicated error

4cb2e95

Fixed back-tracking issue.

04c1a9f

Caused by confusion between XML white-space and white-space.

alexkalderimis reviewed Feb 6, 2016
View reviewed changes

Add psPreserveWhiteSpace

125d1ea

this is required to get html-conduit tests to pass, and is generally dersirable.

alexkalderimis mentioned this pull request Feb 6, 2016

Only skip XML white-space when scanning for tag-start. #75

Closed

alexkalderimis added 3 commits February 7, 2016 20:41

Handle whitespace after CDATA

4ebf0dc

This identifies and fixes an issue with whitespace following CDATA.

Handle trailing whitespace as per leading whitespace

c8c12ac

this applies the same rules as applied to leading whitespace to leading whitespace.

Normalise whitespace (as per XML specification)

5dfccbf

snoyberg reviewed Feb 10, 2016
View reviewed changes

snoyberg added a commit that referenced this pull request Feb 15, 2016

dropWS retains consumed whitespace values (fixes #74, fixes #75, fixes …

330cb0c

…#76)

snoyberg closed this Feb 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitespace sensitive tokenisation #76

Whitespace sensitive tokenisation #76

alexkalderimis commented Feb 6, 2016

alexkalderimis Feb 6, 2016

alexkalderimis commented Feb 6, 2016

snoyberg Feb 10, 2016

alexkalderimis Feb 10, 2016

snoyberg commented Feb 10, 2016

alexkalderimis commented Feb 10, 2016

alexkalderimis commented Feb 10, 2016

alexkalderimis commented Feb 10, 2016

snoyberg commented Feb 11, 2016

alexkalderimis commented Feb 11, 2016

snoyberg commented Feb 14, 2016

alexkalderimis commented Feb 14, 2016

snoyberg commented Feb 15, 2016

alexkalderimis commented Feb 15, 2016

Whitespace sensitive tokenisation #76

Whitespace sensitive tokenisation #76

Conversation

alexkalderimis commented Feb 6, 2016

alexkalderimis Feb 6, 2016

Choose a reason for hiding this comment

alexkalderimis commented Feb 6, 2016

snoyberg Feb 10, 2016

Choose a reason for hiding this comment

alexkalderimis Feb 10, 2016

Choose a reason for hiding this comment

snoyberg commented Feb 10, 2016

alexkalderimis commented Feb 10, 2016

alexkalderimis commented Feb 10, 2016

alexkalderimis commented Feb 10, 2016

snoyberg commented Feb 11, 2016

alexkalderimis commented Feb 11, 2016

snoyberg commented Feb 14, 2016

alexkalderimis commented Feb 14, 2016

snoyberg commented Feb 15, 2016

alexkalderimis commented Feb 15, 2016