Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespace sensitive tokenisation #76

Conversation

alexkalderimis
Copy link
Contributor

This is the PR referred to in #75. It improves tokenisation to handle leading XML white-space consistently. The benefit of this is that entities (such as &x#20; will never be handled by the tokenisation level as whitespace. Thus it is now possible to have content <a>&x#20;<b/></a> which will cause the following parser to fail:

bInsideA = tagNoAttr "a" (tagNoAttr "b" >> return "Found")

which will now fail - whereas previously it would have succeeded, erroneously consuming the encoded space.

This has been updated to include configurable white-space preservation and more spec compliant tag-name parsing.

Caused by confusion between XML white-space and white-space.
This change improves tokenisation by handling whitespace as a semantic
feature. Instead of relying on the `tag` combinator to skip leading
whitespace, this is done during tokenisation.

A significant change here is that we need to now take context into
account when parsing a sequence of text content, since the leading node
may skip leading whitespace, whereas subsequent ones need to preserve
it.  This is handled by changing the type of parseToken to return a list
of tokens, so that consecutive text nodes can be all handled together,
with a different setting of the preserve-leading-whitespace flag on the
tail versus the head. The downside of this is that the event positions
will be less accurate (it is the aim of future work to improve this).

Another area for further work is the fact that for `<node> xyz </node>`
we will strip the leading space but keep the trailing one (yielding
"xyz "). This does not break any tests (the aim of this change is simply to
introduce this change while preserving API compatibility) but should in
the future be handled consistently, yielding "xyz".
where
input = L.concat
input = L.unlines
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to unlines to a) test handling new-lines within the document, and b) also so the the error messages would be easier to read (since they have useful line numbers instead of a single unhelpful column number)

@alexkalderimis
Copy link
Contributor Author

I note that this currently breaks html-conduit: looking at this now (it would be really awesome if you could enable travis builds for PRs)

this is required to get html-conduit tests to pass, and is generally
dersirable.
This identifies and fixes an issue with whitespace following CDATA.
this applies the same rules as applied to leading whitespace to leading
whitespace.
, psPreserveWhiteSpace :: Bool
-- ^ Whether we should preserve literal XML whitespace within nodes.
--
-- Default: False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change in behavior without a good reason. Why do we want to no longer preserve whitespace, and instead change the semantics of a document by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we preserve whitespace by default, then using streaming tag combinators becomes more difficult, since they will have to consume whitespace, and it is impossible to know if it is XML literal whitespace (and thus reasonable to skip) or decoded XML entities (and thus illegal to skip).

This is not my favoured route, for this very reason, as mentioned on the issue, but it does limit the scope of changes.

Thanks for taking some time to review the changes.

@snoyberg
Copy link
Owner

I'm sorry, but each PR is making this issue more confused to me. If there's a backtracking issue, we should fix the backtracking. If there's inconsistent whitespace handling, that should be fixed. But this seems like a very opinionated change in default behavior which will break many existing codebases in very bad ways. As an example, consider the following XHTML snippet, which will now be parsed incorrectly:

<p>Hello, my name is <b>Inigo Montoya</b></p>

The very necessary space before the <b> would be stripped, which is not at all what is desired.

@alexkalderimis
Copy link
Contributor Author

Yes, that is a breaking change. I am indeed happy to do the reverse by default, but that would then leave the parser in the present situation where it silently consumes whitespace when attempting to parse a tag.

I note that all your existing tests continue to pass.

(to clarify, I was incorrect to initially suggest this was an issue with back-tracking, it is entirely an issue with tokenisation and the semantics of the decoded tokens, i.e. whitespace handling).

@alexkalderimis
Copy link
Contributor Author

I've had a bit of a think about the approach here and really I'd like your input on the best way to proceed. This approach was taken to see how far I could get working solely within the API boundaries of this one module. I concede it has knock on effects, and involves a breaking set of changes, which is far from ideal. This is, as you are rightly feeling, a significant change in behaviour to address a small issue (the consumption of non-literal white-space). Very much a sledge-hammer to crack a nut really.

I still hold that the correct way to address this problem is by addressing the representation of tokens within Data.XML.Types, or failing that within Text.XML.Stream.Token. I am prepared to see what the approach would look like to add a new token constructor, since this is not a public API, unlike the Content type from Data.XML.Types. Please let me know if you think that would be a more suitable approach to take and I can prepare an alternate pull request.

Alternatively, you can open and re-assess the smaller PR (#74) which contained a very small and non API-breaking change that addressed the majority of issues (if not all of them).

@alexkalderimis
Copy link
Contributor Author

and I'll just leave this here to help you understand the issue: I added diagnostic tests to a clean branch based on master: 4 of these tests fail. (at lines 170, 172, 174 and 176).

alexkalderimis@8bac1ee

@snoyberg
Copy link
Owner

I'm going to tell you the same concern again: you've raised an error
indicating that there's a backtracking bug, and have submitted (a lot of)
changed code that doesn't touch any backtracking logic. Each time I ask
about this, you answer with more whitespace discussion. I do not believe
you're looking at the real bug, or all of the comments about backtracking
were incorrect. Either way, I do not feel comfortable with any of the
proposed steps for moving forward.

On Wed, Feb 10, 2016 at 10:03 PM, Alex Kalderimis notifications@github.com
wrote:

and I'll just leave this here to help you understand the issue: I added
diagnostic tests to a clean branch based on master: 4 of these tests fail.
(at lines 170, 172, 174 and 176).

alexkalderimis/xml@8bac1ee
alexkalderimis@8bac1ee


Reply to this email directly or view it on GitHub
#76 (comment).

@alexkalderimis
Copy link
Contributor Author

Thanks again for your time in looking at this. I feel I should just lay out my reasoning why this is not a back-tracking issue, and instead a tokenisation one.

The current back-tracking logic for when we fail to match a tag cannot be substantially improved. It can be found here: https://github.com/snoyberg/xml/blob/master/xml-conduit/Text/XML/Stream/Parse.hs#L674. Its logic can be summarised as:

while next token is entirely whitespace:
  drop next token

This works fine, as long as there is only one whitespace token in the stream. If there are multiple whitespace tokens, all but the last are dropped. (as demonstrated by alexkalderimis/xml@8bac1ee). It is possible to have multiple adjacent tokens in the stream because once an entity has been parsed (which happens prior to tokenisation) it is indistinguishable from ordinary text, even though it is semantically distinct.

This means that <tag> </tag> has one content token inside it (a single Text of length 3, namely "\x20\x20\x20"), while <tag>&#x20; &#x20;</tag> has 3 whitespace tokens, all of which are the same single space character ("\x20"). The distinction between entities and plain content has been erased.

This is why this is not a back-tracking issue. Even if dropWS was amended to either drop all consecutive white-space tokens or none it would still be incorrect since it would be impossible to know if it is skipping a decoded entity or a literal space character.

I acknowledge that this first attempt has significant drawbacks, and indeed I am perfectly happy to find an alternative solution. I feel I understand the issue well and have attempted to explain it, in code and in comments. If you feel I have not succeeded in that attempt I am sorry - feel free to close this PR in any case. If you do want me to work towards an acceptable solution I would be more than happy to contribute.

@snoyberg
Copy link
Owner

Even if dropWS was amended to either drop all consecutive white-space tokens or none it would still be incorrect since it would be impossible to know if it is skipping a decoded entity or a literal space character.

This is where I'm confused, as it seems like the direct solution. What's the reason this isn't an option? Why do we care if it's not possible to distinguish between a decoded entity and a literal space character?

@alexkalderimis
Copy link
Contributor Author

For the simple reason that <outer>&#x20;<inner/></outer> is clearly not the same
as <outer> <inner/></outer> - we encode characters when we want them treated as significant content, not as indentation.

@snoyberg
Copy link
Owner

This is not correct according to the XML specification, the two are
identical semantically. It seems like the problem is that you're trying to
get XML to do something it's not designed to do.

On Sun, Feb 14, 2016, 11:22 PM Alex Kalderimis notifications@github.com
wrote:

For the simple reason that is clearly not
the same
as - we encode characters when we want them
treated as significant content, not as indentation.


Reply to this email directly or view it on GitHub
#76 (comment).

@alexkalderimis
Copy link
Contributor Author

I that case you might want to close the ticket. I would still suggest you
merge #75 though, which at least corrects the scanning over white space to
only scan past XML space, and not the wider definition of space that
isSpace tests for.

On 15 February 2016 at 04:59, Michael Snoyman notifications@github.com
wrote:

This is not correct according to the XML specification, the two are
identical semantically. It seems like the problem is that you're trying to
get XML to do something it's not designed to do.

On Sun, Feb 14, 2016, 11:22 PM Alex Kalderimis notifications@github.com
wrote:

For the simple reason that is clearly not
the same
as - we encode characters when we want them
treated as significant content, not as indentation.


Reply to this email directly or view it on GitHub
#76 (comment).


Reply to this email directly or view it on GitHub
#76 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants