-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove error highlight for dashes in comments and CDATA sections #49
Conversation
@sorbits friendly ping 🙂 |
Hmm, I admit I don't really know what to think about this one. Both the HTML 4 and 5 spec say the grammar is correct, only this recent change in WhatWG suggests otherwise. I admit I haven't been super following HTML5 politics recently, what is the current relationship between W3C and WhatWG? |
It wasn't correct per either HTML4 or HTML5 before the recent change. HTML4 used SGML rules which toggle comment on/off for each double dash.
HTML5 doesn't have that. A comment can end with As is argued in the whatwg issue, making double dash an error mostly serves to confuse people about something that is no longer a problem. I think syntax highlighting and conformance checking should be helpful. 😊 As to W3C HTML, they usually copy changes from whatwg but are lagging behind a bit. HTH |
We do not highlight comments according to the SGML standard because few (if any) current browser support it. But at the same time, we do highlight markup that is a violation of the spec as invalid, because when you write HTML in TextMate, it’s nice to be made aware that you’re currently using syntax that could cause problems (given that it is using a part of the specification that no browser has/will implement). In the WhatWG’s introduction of this new comment rule, did they do any compatibility testing? I wrote a HTML parser many many years ago, and back then my own findings (based on web sites at the time) was that the safest way to parse comments was using So for example |
Yes, extensive compat testing was done. Spaces before the > was dropped because it broke more than it fixed. The whatwg spec matches what browsers implement. How comments can end has been stable for many years now. ☺ |
Browsers are of course only a small subset of the software that parse
HTML.
Is there any public notes on this compatibility testing?
…On 4 Dec 2016, at 15:00, Simon Pieters wrote:
Yes, extensive compat testing was done. Spaces before the > was
dropped because it broke more than it fixed.
The whatwg spec matches what browsers implement. How comments can end
has been stable for many years now. ☺
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#49 (comment)
|
"Small"? This has been interoperably implemented in all major browser engines for about a decade. What can be a more convincing study? |
(Everything is public, I can dig up relevant emails maybe tomorrow.) |
Small in that there are 3 major browser engines, but there must be a lot
more HTML parsers out there which are written independently of these
browser engines.
When you say that extensive testing was done, I assume it was based on
looking at real-world documents, not real world parsers.
Any parser that follows the current HTML 5 specification will not like
the two dashes inside a comment, in fact today I used DOM PDF, so I took
a look at their parser, and my reading is that they indeed treat it as
an invalid state:
https://github.com/dompdf/dompdf/blob/master/lib/html5lib/Tokenizer.php#L1360
We probably come at this from different perspectives, my perspective is
that TextMate is used to write HTML documents, and we want to help the
user not write obviously bad markup, which is why we highlight a lone
ampersand or two dashes in comments, because these could cause problems
(depending on what software works on this document).
I assume you’re viewing this more as a “the grammar is a parser,
i.e. user agent, and thus it should be compatible with as many documents
as possible”. Given that it is used on GitHub for highlighting, I can
sort of understand that argument, although it does bother me that the
HTML 5 specification currently does *not* agree with WhatWG about this
issue.
When I asked about public notes I was thinking about discussion about
document conformance versus user agent conformance. To me it would have
made more sense if they said that user agents must parse comments as
specified (i.e. no special meaning given to `--`), but document authors
should not use double-dashes in comments.
…On 4 Dec 2016, at 19:41, Simon Pieters wrote:
"Small"?
This has been interoperably implemented in all major browser engines
for about a decade. What can be a more convincing study?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#49 (comment)
|
Maybe we are talking about different things. 😆 When I referred to compat testing, I meant about what is Web compatible way to close a comment. i.e., in reply to:
Double dash being an error or not has no Web compat impact, because it does not change how comments are parsed, only what is flagged as an error.
Well, both. Parsers in browsers were studied when designing the HTML parser in the HTML spec, and real-world documents were also studied. And of course, when it is implemented and shipped in a browser, you get wide-scale compatibility testing of real-world documents.
That is what I want, too. In which scenario is double dash in a comment a problem? A few pointers for the history: |
I don’t think I understand this question. It is a violation of the HTML 4 and (X)HTML 5 standard. Any validator should therefore complain about this (I tested one, and it did complain). Furthermore, parsers written to strictly conform to the standard may also complain, in my previous comment I pointed to the DOMPDF parser as an example of one. |
I mean, in which scenario does using a double dash lead to a problem for the Web developer? That earlier versions of HTML said that it was invalid is not in itself a practical problem -- it's typically the other way around, the standard declares something invalid because violating it can lead to problems. But in this case the problems are gone, as far as I can tell, which is why the WHATWG standard was updated to reflect that.
Which one did you test? DOMPDF uses html5lib, which does not abort parsing on a parse error. |
How can you and I know? The W3C standard for comments is here and says that it is not allowed, hence it is not unreasonable to think that some (strict) software will consider it an error and act accordingly. The DOMPDF I pointed to has this code:
This PR is because WhatWG, a separate organization, with a standard that is now diverging from the official W3C recommendation, has relaxed the rules for comments. I think WhatWG’s rules makes sense, and as a parser implementar, I would definitely follow them, but as a document author, I would follow the robustness principle and be conservative with what I send, and thus I would be hesitant to include characters in my HTML that is a violation of the W3C standard, hence I feel there is value in keeping this rule a bit longer. Why do you think, that it is important to remove the rule? Also, would anyone who relies on XML tools to process their XHTML not run into a problem with double-hyphens in comments? |
We can never be 100% sure of course, but in my experience, HTML parsers are typically designed to be non-fatal upon finding a syntax error.
Yes, and it is non-fatal, AFAICT.
Yes. I would suggest they use XML syntax highlighting instead, since HTML has various things that are allowed but are not allowed in XML (like optional tags, unquoted attribute values, lowercase " I would be fine with waiting a bit longer before removing the check for double dash if you think it's too early. I don't think it's the most important thing, but I think it is not necessary today to draw the developer's attention to double-hyphen in a comment. |
This is right except I disagree with which one is official. 😉 The WHATWG is what browser vendors treat as official as a general policy. The W3C are creating forks and are lagging behind in copying upstream changes. |
@@ -203,12 +203,6 @@ | |||
<key>name</key> | |||
<string>constant.other.inline-data.html</string> | |||
</dict> | |||
<dict> | |||
<key>match</key> | |||
<string>(\s*)(?!--|>)\S(\s*)</string> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this causes incorrect highlighting in GitHub's syntax highlighting here
It appears that http://w3c.github.io/html/syntax.html#comments has adopted the whatwg change by now. |
OK, then I think we should accept this change. Do you mind squashing your two commits into one? As your second one seems to be a fix of the first. You make a reference to CDATA, but I do not see how your commits are changing anything related to this, as CDATA is just another option after |
I thought removed lines 206 to 211 applied to CDATA sections, but maybe I misunderstood. Happy to squash. |
<!-- in comments is still invalid. Also end fix how comments can end. Fixes textmate#45. The check for CDATA sections was bogus to begin with, as far as I can tell. They have always allowed dashes and > per spec.
e5e482b
to
3c52e85
Compare
I have squashed. |
Ping 😊 |
ping |
I've been working on a major revamp to the HTML grammar for the last many weeks, this will be part of that. Had expected it to be done quicker but adding in various things like SVG and MathML took a while. It may just get pushed up in the next week or so or there may be a test branch first. It's mostly done at this point I just need to do a final test pass of each bundle that uses the HTML grammar to rule out any side effects of the changes. There are quite a few so that's a good bit of work. |
Imported as 3161d11 with some minor changes, thanks! |
Fixes #45.
The check for CDATA sections was bogus to begin with, as far as I
can tell. They have always allowed dashes and > per spec.