Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespace sensitive tokenisation #76

Conversation

Projects
None yet
2 participants
@alexkalderimis
Copy link
Contributor

alexkalderimis commented Feb 6, 2016

This is the PR referred to in #75. It improves tokenisation to handle leading XML white-space consistently. The benefit of this is that entities (such as &x#20; will never be handled by the tokenisation level as whitespace. Thus it is now possible to have content <a>&x#20;<b/></a> which will cause the following parser to fail:

bInsideA = tagNoAttr "a" (tagNoAttr "b" >> return "Found")

which will now fail - whereas previously it would have succeeded, erroneously consuming the encoded space.

This has been updated to include configurable white-space preservation and more spec compliant tag-name parsing.

@alexkalderimis

This comment has been minimized.

Copy link
Owner Author

alexkalderimis commented on xml-conduit/test/main.hs in 4cb2e95 Feb 3, 2016

This fails. This indicates this is probably an issue with back-tracking.

alexkalderimis added some commits Feb 3, 2016

Fixed back-tracking issue.
Caused by confusion between XML white-space and white-space.
Consider whitespace when tokenising
This change improves tokenisation by handling whitespace as a semantic
feature. Instead of relying on the `tag` combinator to skip leading
whitespace, this is done during tokenisation.

A significant change here is that we need to now take context into
account when parsing a sequence of text content, since the leading node
may skip leading whitespace, whereas subsequent ones need to preserve
it.  This is handled by changing the type of parseToken to return a list
of tokens, so that consecutive text nodes can be all handled together,
with a different setting of the preserve-leading-whitespace flag on the
tail versus the head. The downside of this is that the event positions
will be less accurate (it is the aim of future work to improve this).

Another area for further work is the fact that for `<node> xyz </node>`
we will strip the leading space but keep the trailing one (yielding
"xyz "). This does not break any tests (the aim of this change is simply to
introduce this change while preserving API compatibility) but should in
the future be handled consistently, yielding "xyz".
where
input = L.concat
input = L.unlines

This comment has been minimized.

Copy link
@alexkalderimis

alexkalderimis Feb 6, 2016

Author Contributor

I changed this to unlines to a) test handling new-lines within the document, and b) also so the the error messages would be easier to read (since they have useful line numbers instead of a single unhelpful column number)

isXMLSpace '\x20' = True
isXMLSpace '\x09' = True
isXMLSpace '\x0D' = True
isXMLSpace '\x0A' = True

This comment has been minimized.

Copy link
@alexkalderimis

alexkalderimis Feb 6, 2016

Author Contributor

I changed these to hex literals so that it exactly matches the spec laid out above.

newline :: Parser ()
newline = ((char '\r' >> char '\n') <|> char '\n') >> return ()
newline = endOfLine <|> (void $ char '\r')

This comment has been minimized.

Copy link
@alexkalderimis

alexkalderimis Feb 6, 2016

Author Contributor

endOfLine already matches \n and \r\n.

parseEntity <|> parseText'
where
parseEntity = do
char' '&'
t <- takeWhile1 (/= ';')
t <- takeWhile1 (/= ';') <?> "not a semicolon"

This comment has been minimized.

Copy link
@alexkalderimis

alexkalderimis Feb 6, 2016

Author Contributor

takeWhile produces terrible error messages ("failed satisfy"), naming the parsers makes them more meaningful.

@@ -474,7 +484,7 @@ parseToken de = (char '<' >> parseLt) <|> TokenContent <$> parseContent de False
(parseEntity >>= \e -> parseEntities (front . (e:))) <|>
(char '<' >> parseEntities front) <|>
(skipWhile (\t -> t /= ']' && t /= '<') >> parseEntities front)
parseEntity = try $ do

This comment has been minimized.

Copy link
@alexkalderimis

alexkalderimis Feb 6, 2016

Author Contributor

try is a no-op in Attoparsec. It can be safely removed

where
trailingWS = takeWhile1 isXMLSpace >> return []

This comment has been minimized.

Copy link
@alexkalderimis

alexkalderimis Feb 6, 2016

Author Contributor

we need this final catch-all to ensure that we can parse a full document.

@alexkalderimis

This comment has been minimized.

Copy link
Contributor Author

alexkalderimis commented Feb 6, 2016

I note that this currently breaks html-conduit: looking at this now (it would be really awesome if you could enable travis builds for PRs)

Add psPreserveWhiteSpace
this is required to get html-conduit tests to pass, and is generally
dersirable.
@alexkalderimis

This comment has been minimized.

Copy link
Owner Author

alexkalderimis commented on html-conduit/test/main.hs in 125d1ea Feb 6, 2016

shouldBe is nicer, since it is much clearer which is the expectation and which is the result.

alexkalderimis added some commits Feb 7, 2016

Handle whitespace after CDATA
This identifies and fixes an issue with whitespace following CDATA.
Handle trailing whitespace as per leading whitespace
this applies the same rules as applied to leading whitespace to leading
whitespace.
, psPreserveWhiteSpace :: Bool
-- ^ Whether we should preserve literal XML whitespace within nodes.
--
-- Default: False

This comment has been minimized.

Copy link
@snoyberg

snoyberg Feb 10, 2016

Owner

This is a breaking change in behavior without a good reason. Why do we want to no longer preserve whitespace, and instead change the semantics of a document by default?

This comment has been minimized.

Copy link
@alexkalderimis

alexkalderimis Feb 10, 2016

Author Contributor

if we preserve whitespace by default, then using streaming tag combinators becomes more difficult, since they will have to consume whitespace, and it is impossible to know if it is XML literal whitespace (and thus reasonable to skip) or decoded XML entities (and thus illegal to skip).

This is not my favoured route, for this very reason, as mentioned on the issue, but it does limit the scope of changes.

Thanks for taking some time to review the changes.

@snoyberg

This comment has been minimized.

Copy link
Owner

snoyberg commented Feb 10, 2016

I'm sorry, but each PR is making this issue more confused to me. If there's a backtracking issue, we should fix the backtracking. If there's inconsistent whitespace handling, that should be fixed. But this seems like a very opinionated change in default behavior which will break many existing codebases in very bad ways. As an example, consider the following XHTML snippet, which will now be parsed incorrectly:

<p>Hello, my name is <b>Inigo Montoya</b></p>

The very necessary space before the <b> would be stripped, which is not at all what is desired.

@alexkalderimis

This comment has been minimized.

Copy link
Contributor Author

alexkalderimis commented Feb 10, 2016

Yes, that is a breaking change. I am indeed happy to do the reverse by default, but that would then leave the parser in the present situation where it silently consumes whitespace when attempting to parse a tag.

I note that all your existing tests continue to pass.

(to clarify, I was incorrect to initially suggest this was an issue with back-tracking, it is entirely an issue with tokenisation and the semantics of the decoded tokens, i.e. whitespace handling).

@alexkalderimis

This comment has been minimized.

Copy link
Contributor Author

alexkalderimis commented Feb 10, 2016

I've had a bit of a think about the approach here and really I'd like your input on the best way to proceed. This approach was taken to see how far I could get working solely within the API boundaries of this one module. I concede it has knock on effects, and involves a breaking set of changes, which is far from ideal. This is, as you are rightly feeling, a significant change in behaviour to address a small issue (the consumption of non-literal white-space). Very much a sledge-hammer to crack a nut really.

I still hold that the correct way to address this problem is by addressing the representation of tokens within Data.XML.Types, or failing that within Text.XML.Stream.Token. I am prepared to see what the approach would look like to add a new token constructor, since this is not a public API, unlike the Content type from Data.XML.Types. Please let me know if you think that would be a more suitable approach to take and I can prepare an alternate pull request.

Alternatively, you can open and re-assess the smaller PR (#74) which contained a very small and non API-breaking change that addressed the majority of issues (if not all of them).

@alexkalderimis

This comment has been minimized.

Copy link
Contributor Author

alexkalderimis commented Feb 10, 2016

and I'll just leave this here to help you understand the issue: I added diagnostic tests to a clean branch based on master: 4 of these tests fail. (at lines 170, 172, 174 and 176).

alexkalderimis@8bac1ee

@snoyberg

This comment has been minimized.

Copy link
Owner

snoyberg commented Feb 11, 2016

I'm going to tell you the same concern again: you've raised an error
indicating that there's a backtracking bug, and have submitted (a lot of)
changed code that doesn't touch any backtracking logic. Each time I ask
about this, you answer with more whitespace discussion. I do not believe
you're looking at the real bug, or all of the comments about backtracking
were incorrect. Either way, I do not feel comfortable with any of the
proposed steps for moving forward.

On Wed, Feb 10, 2016 at 10:03 PM, Alex Kalderimis notifications@github.com
wrote:

and I'll just leave this here to help you understand the issue: I added
diagnostic tests to a clean branch based on master: 4 of these tests fail.
(at lines 170, 172, 174 and 176).

alexkalderimis/xml@8bac1ee
alexkalderimis@8bac1ee


Reply to this email directly or view it on GitHub
#76 (comment).

@alexkalderimis

This comment has been minimized.

Copy link
Contributor Author

alexkalderimis commented Feb 11, 2016

Thanks again for your time in looking at this. I feel I should just lay out my reasoning why this is not a back-tracking issue, and instead a tokenisation one.

The current back-tracking logic for when we fail to match a tag cannot be substantially improved. It can be found here: https://github.com/snoyberg/xml/blob/master/xml-conduit/Text/XML/Stream/Parse.hs#L674. Its logic can be summarised as:

while next token is entirely whitespace:
  drop next token

This works fine, as long as there is only one whitespace token in the stream. If there are multiple whitespace tokens, all but the last are dropped. (as demonstrated by alexkalderimis/xml@8bac1ee). It is possible to have multiple adjacent tokens in the stream because once an entity has been parsed (which happens prior to tokenisation) it is indistinguishable from ordinary text, even though it is semantically distinct.

This means that <tag> </tag> has one content token inside it (a single Text of length 3, namely "\x20\x20\x20"), while <tag>&#x20; &#x20;</tag> has 3 whitespace tokens, all of which are the same single space character ("\x20"). The distinction between entities and plain content has been erased.

This is why this is not a back-tracking issue. Even if dropWS was amended to either drop all consecutive white-space tokens or none it would still be incorrect since it would be impossible to know if it is skipping a decoded entity or a literal space character.

I acknowledge that this first attempt has significant drawbacks, and indeed I am perfectly happy to find an alternative solution. I feel I understand the issue well and have attempted to explain it, in code and in comments. If you feel I have not succeeded in that attempt I am sorry - feel free to close this PR in any case. If you do want me to work towards an acceptable solution I would be more than happy to contribute.

@snoyberg

This comment has been minimized.

Copy link
Owner

snoyberg commented Feb 14, 2016

Even if dropWS was amended to either drop all consecutive white-space tokens or none it would still be incorrect since it would be impossible to know if it is skipping a decoded entity or a literal space character.

This is where I'm confused, as it seems like the direct solution. What's the reason this isn't an option? Why do we care if it's not possible to distinguish between a decoded entity and a literal space character?

@alexkalderimis

This comment has been minimized.

Copy link
Contributor Author

alexkalderimis commented Feb 14, 2016

For the simple reason that <outer>&#x20;<inner/></outer> is clearly not the same
as <outer> <inner/></outer> - we encode characters when we want them treated as significant content, not as indentation.

@snoyberg

This comment has been minimized.

Copy link
Owner

snoyberg commented Feb 15, 2016

This is not correct according to the XML specification, the two are
identical semantically. It seems like the problem is that you're trying to
get XML to do something it's not designed to do.

On Sun, Feb 14, 2016, 11:22 PM Alex Kalderimis notifications@github.com
wrote:

For the simple reason that is clearly not
the same
as - we encode characters when we want them
treated as significant content, not as indentation.


Reply to this email directly or view it on GitHub
#76 (comment).

@alexkalderimis

This comment has been minimized.

Copy link
Contributor Author

alexkalderimis commented Feb 15, 2016

I that case you might want to close the ticket. I would still suggest you
merge #75 though, which at least corrects the scanning over white space to
only scan past XML space, and not the wider definition of space that
isSpace tests for.

On 15 February 2016 at 04:59, Michael Snoyman notifications@github.com
wrote:

This is not correct according to the XML specification, the two are
identical semantically. It seems like the problem is that you're trying to
get XML to do something it's not designed to do.

On Sun, Feb 14, 2016, 11:22 PM Alex Kalderimis notifications@github.com
wrote:

For the simple reason that is clearly not
the same
as - we encode characters when we want them
treated as significant content, not as indentation.


Reply to this email directly or view it on GitHub
#76 (comment).


Reply to this email directly or view it on GitHub
#76 (comment).

snoyberg added a commit that referenced this pull request Feb 15, 2016

@snoyberg snoyberg closed this Feb 15, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.