New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whitespace sensitive tokenisation #76
Whitespace sensitive tokenisation #76
Conversation
Caused by confusion between XML white-space and white-space.
This change improves tokenisation by handling whitespace as a semantic feature. Instead of relying on the `tag` combinator to skip leading whitespace, this is done during tokenisation. A significant change here is that we need to now take context into account when parsing a sequence of text content, since the leading node may skip leading whitespace, whereas subsequent ones need to preserve it. This is handled by changing the type of parseToken to return a list of tokens, so that consecutive text nodes can be all handled together, with a different setting of the preserve-leading-whitespace flag on the tail versus the head. The downside of this is that the event positions will be less accurate (it is the aim of future work to improve this). Another area for further work is the fact that for `<node> xyz </node>` we will strip the leading space but keep the trailing one (yielding "xyz "). This does not break any tests (the aim of this change is simply to introduce this change while preserving API compatibility) but should in the future be handled consistently, yielding "xyz".
where | ||
input = L.concat | ||
input = L.unlines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to unlines to a) test handling new-lines within the document, and b) also so the the error messages would be easier to read (since they have useful line numbers instead of a single unhelpful column number)
I note that this currently breaks |
this is required to get html-conduit tests to pass, and is generally dersirable.
This identifies and fixes an issue with whitespace following CDATA.
this applies the same rules as applied to leading whitespace to leading whitespace.
, psPreserveWhiteSpace :: Bool | ||
-- ^ Whether we should preserve literal XML whitespace within nodes. | ||
-- | ||
-- Default: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change in behavior without a good reason. Why do we want to no longer preserve whitespace, and instead change the semantics of a document by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we preserve whitespace by default, then using streaming tag
combinators becomes more difficult, since they will have to consume whitespace, and it is impossible to know if it is XML literal whitespace (and thus reasonable to skip) or decoded XML entities (and thus illegal to skip).
This is not my favoured route, for this very reason, as mentioned on the issue, but it does limit the scope of changes.
Thanks for taking some time to review the changes.
I'm sorry, but each PR is making this issue more confused to me. If there's a backtracking issue, we should fix the backtracking. If there's inconsistent whitespace handling, that should be fixed. But this seems like a very opinionated change in default behavior which will break many existing codebases in very bad ways. As an example, consider the following XHTML snippet, which will now be parsed incorrectly: <p>Hello, my name is <b>Inigo Montoya</b></p> The very necessary space before the |
Yes, that is a breaking change. I am indeed happy to do the reverse by default, but that would then leave the parser in the present situation where it silently consumes whitespace when attempting to parse a tag. I note that all your existing tests continue to pass. (to clarify, I was incorrect to initially suggest this was an issue with back-tracking, it is entirely an issue with tokenisation and the semantics of the decoded tokens, i.e. whitespace handling). |
I've had a bit of a think about the approach here and really I'd like your input on the best way to proceed. This approach was taken to see how far I could get working solely within the API boundaries of this one module. I concede it has knock on effects, and involves a breaking set of changes, which is far from ideal. This is, as you are rightly feeling, a significant change in behaviour to address a small issue (the consumption of non-literal white-space). Very much a sledge-hammer to crack a nut really. I still hold that the correct way to address this problem is by addressing the representation of tokens within Alternatively, you can open and re-assess the smaller PR (#74) which contained a very small and non API-breaking change that addressed the majority of issues (if not all of them). |
and I'll just leave this here to help you understand the issue: I added diagnostic tests to a clean branch based on master: 4 of these tests fail. (at lines 170, 172, 174 and 176). |
I'm going to tell you the same concern again: you've raised an error On Wed, Feb 10, 2016 at 10:03 PM, Alex Kalderimis notifications@github.com
|
Thanks again for your time in looking at this. I feel I should just lay out my reasoning why this is not a back-tracking issue, and instead a tokenisation one. The current back-tracking logic for when we fail to match a tag cannot be substantially improved. It can be found here: https://github.com/snoyberg/xml/blob/master/xml-conduit/Text/XML/Stream/Parse.hs#L674. Its logic can be summarised as:
This works fine, as long as there is only one whitespace token in the stream. If there are multiple whitespace tokens, all but the last are dropped. (as demonstrated by alexkalderimis/xml@8bac1ee). It is possible to have multiple adjacent tokens in the stream because once an entity has been parsed (which happens prior to tokenisation) it is indistinguishable from ordinary text, even though it is semantically distinct. This means that This is why this is not a back-tracking issue. Even if I acknowledge that this first attempt has significant drawbacks, and indeed I am perfectly happy to find an alternative solution. I feel I understand the issue well and have attempted to explain it, in code and in comments. If you feel I have not succeeded in that attempt I am sorry - feel free to close this PR in any case. If you do want me to work towards an acceptable solution I would be more than happy to contribute. |
This is where I'm confused, as it seems like the direct solution. What's the reason this isn't an option? Why do we care if it's not possible to distinguish between a decoded entity and a literal space character? |
For the simple reason that |
This is not correct according to the XML specification, the two are On Sun, Feb 14, 2016, 11:22 PM Alex Kalderimis notifications@github.com
|
I that case you might want to close the ticket. I would still suggest you On 15 February 2016 at 04:59, Michael Snoyman notifications@github.com
|
This is the PR referred to in #75. It improves tokenisation to handle leading XML white-space consistently. The benefit of this is that entities (such as
&x#20;
will never be handled by the tokenisation level as whitespace. Thus it is now possible to have content<a>&x#20;<b/></a>
which will cause the following parser to fail:which will now fail - whereas previously it would have succeeded, erroneously consuming the encoded space.
This has been updated to include configurable white-space preservation and more spec compliant tag-name parsing.