Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove error highlight for dashes in comments and CDATA sections #49

Closed
wants to merge 1 commit into from

Conversation

zcorpan
Copy link
Contributor

@zcorpan zcorpan commented Nov 22, 2016

Fixes #45.

The check for CDATA sections was bogus to begin with, as far as I
can tell. They have always allowed dashes and > per spec.

@zcorpan
Copy link
Contributor Author

zcorpan commented Dec 2, 2016

@sorbits friendly ping 🙂

@infininight
Copy link
Member

Hmm, I admit I don't really know what to think about this one. Both the HTML 4 and 5 spec say the grammar is correct, only this recent change in WhatWG suggests otherwise. I admit I haven't been super following HTML5 politics recently, what is the current relationship between W3C and WhatWG?

@zcorpan
Copy link
Contributor Author

zcorpan commented Dec 3, 2016

It wasn't correct per either HTML4 or HTML5 before the recent change. HTML4 used SGML rules which toggle comment on/off for each double dash.

<!-- on -- --> this is in a comment -- -- on -- >

HTML5 doesn't have that. A comment can end with --> or --!> (is invalid though); not -- >. So this PR fixes that aspect.

As is argued in the whatwg issue, making double dash an error mostly serves to confuse people about something that is no longer a problem. I think syntax highlighting and conformance checking should be helpful. 😊

As to W3C HTML, they usually copy changes from whatwg but are lagging behind a bit.

HTH

@sorbits
Copy link
Member

sorbits commented Dec 4, 2016

It wasn't correct per either HTML4 or HTML5 before the recent change. HTML4 used SGML rules which toggle comment on/off for each double dash

We do not highlight comments according to the SGML standard because few (if any) current browser support it.

But at the same time, we do highlight markup that is a violation of the spec as invalid, because when you write HTML in TextMate, it’s nice to be made aware that you’re currently using syntax that could cause problems (given that it is using a part of the specification that no browser has/will implement).

In the WhatWG’s introduction of this new comment rule, did they do any compatibility testing?

I wrote a HTML parser many many years ago, and back then my own findings (based on web sites at the time) was that the safest way to parse comments was using <!--.*--.*> (here . matches newlines).

So for example <!-- in -- out -- in -- out --> would not be a problem, but something like this would be: <!-- in -- out -- <em>in</em> -- out --> but not <!-- <em>in</em> -->.

@zcorpan
Copy link
Contributor Author

zcorpan commented Dec 4, 2016

Yes, extensive compat testing was done. Spaces before the > was dropped because it broke more than it fixed.

The whatwg spec matches what browsers implement. How comments can end has been stable for many years now. ☺

@sorbits
Copy link
Member

sorbits commented Dec 4, 2016 via email

@zcorpan
Copy link
Contributor Author

zcorpan commented Dec 4, 2016

"Small"?

This has been interoperably implemented in all major browser engines for about a decade. What can be a more convincing study?

@zcorpan
Copy link
Contributor Author

zcorpan commented Dec 4, 2016

(Everything is public, I can dig up relevant emails maybe tomorrow.)

@sorbits
Copy link
Member

sorbits commented Dec 4, 2016 via email

@zcorpan
Copy link
Contributor Author

zcorpan commented Dec 4, 2016

Maybe we are talking about different things. 😆

When I referred to compat testing, I meant about what is Web compatible way to close a comment. i.e., in reply to:

I wrote a HTML parser many many years ago, and back then my own findings (based on web sites at the time) was that the safest way to parse comments was using <!--.--.> (here . matches newlines).

Double dash being an error or not has no Web compat impact, because it does not change how comments are parsed, only what is flagged as an error.

When you say that extensive testing was done, I assume it was based on
looking at real-world documents, not real world parsers.

Well, both. Parsers in browsers were studied when designing the HTML parser in the HTML spec, and real-world documents were also studied. And of course, when it is implemented and shipped in a browser, you get wide-scale compatibility testing of real-world documents.

we want to help the
user not write obviously bad markup, which is why we highlight a lone
ampersand or two dashes in comments, because these could cause problems
(depending on what software works on this document).

That is what I want, too.

In which scenario is double dash in a comment a problem?

A few pointers for the history:
https://ln.hixie.ch/?start=1137799947&count=1
https://lists.w3.org/Archives/Public/public-whatwg-archive/2006Jan/0209.html
https://lists.w3.org/Archives/Public/public-whatwg-archive/2007Jun/0295.html
https://lists.w3.org/Archives/Public/public-whatwg-archive/2008Sep/0005.html
https://lists.w3.org/Archives/Public/public-html/2009Jun/0191.html
https://www.w3.org/Bugs/Public/show_bug.cgi?id=10578

@sorbits
Copy link
Member

sorbits commented Dec 5, 2016

In which scenario is double dash in a comment a problem?

I don’t think I understand this question. It is a violation of the HTML 4 and (X)HTML 5 standard. Any validator should therefore complain about this (I tested one, and it did complain).

Furthermore, parsers written to strictly conform to the standard may also complain, in my previous comment I pointed to the DOMPDF parser as an example of one.

@zcorpan
Copy link
Contributor Author

zcorpan commented Dec 5, 2016

I mean, in which scenario does using a double dash lead to a problem for the Web developer? That earlier versions of HTML said that it was invalid is not in itself a practical problem -- it's typically the other way around, the standard declares something invalid because violating it can lead to problems. But in this case the problems are gone, as far as I can tell, which is why the WHATWG standard was updated to reflect that.

Any validator should therefore complain about this (I tested one, and it did complain).

Which one did you test?

DOMPDF uses html5lib, which does not abort parsing on a parse error.

@sorbits
Copy link
Member

sorbits commented Dec 5, 2016

I mean, in which scenario does using a double dash lead to a problem for the Web developer?

How can you and I know? The W3C standard for comments is here and says that it is not allowed, hence it is not unreasonable to think that some (strict) software will consider it an error and act accordingly.

The DOMPDF I pointed to has this code:

$this->emitToken(array(
    'type' => self::PARSEERROR,
    'data' => 'unexpected-dash-after-double-dash-in-comment'
));

This PR is because WhatWG, a separate organization, with a standard that is now diverging from the official W3C recommendation, has relaxed the rules for comments.

I think WhatWG’s rules makes sense, and as a parser implementar, I would definitely follow them, but as a document author, I would follow the robustness principle and be conservative with what I send, and thus I would be hesitant to include characters in my HTML that is a violation of the W3C standard, hence I feel there is value in keeping this rule a bit longer.

Why do you think, that it is important to remove the rule?

Also, would anyone who relies on XML tools to process their XHTML not run into a problem with double-hyphens in comments?

@zcorpan
Copy link
Contributor Author

zcorpan commented Dec 5, 2016

How can you and I know?

We can never be 100% sure of course, but in my experience, HTML parsers are typically designed to be non-fatal upon finding a syntax error.

The DOMPDF I pointed to has this code:

Yes, and it is non-fatal, AFAICT.

Also, would anyone who relies on XML tools to process their XHTML not run into a problem with double-hyphens in comments?

Yes. I would suggest they use XML syntax highlighting instead, since HTML has various things that are allowed but are not allowed in XML (like optional tags, unquoted attribute values, lowercase "<!doctype", and so on).

I would be fine with waiting a bit longer before removing the check for double dash if you think it's too early. I don't think it's the most important thing, but I think it is not necessary today to draw the developer's attention to double-hyphen in a comment.

@zcorpan
Copy link
Contributor Author

zcorpan commented Dec 5, 2016

WhatWG, a separate organization, with a standard that is now diverging from the official W3C recommendation

This is right except I disagree with which one is official. 😉 The WHATWG is what browser vendors treat as official as a general policy. The W3C are creating forks and are lagging behind in copying upstream changes.

@@ -203,12 +203,6 @@
<key>name</key>
<string>constant.other.inline-data.html</string>
</dict>
<dict>
<key>match</key>
<string>(\s*)(?!--|&gt;)\S(\s*)</string>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zcorpan
Copy link
Contributor Author

zcorpan commented Mar 16, 2017

It appears that http://w3c.github.io/html/syntax.html#comments has adopted the whatwg change by now.

@sorbits
Copy link
Member

sorbits commented Jul 12, 2017

It appears that http://w3c.github.io/html/syntax.html#comments has adopted the whatwg change by now.

OK, then I think we should accept this change.

Do you mind squashing your two commits into one? As your second one seems to be a fix of the first.

You make a reference to CDATA, but I do not see how your commits are changing anything related to this, as CDATA is just another option after <!, i.e. we can have comments <!-- … --> or we can have character data <![CDATA[ … ]]>, and in TextMate’s grammar the two rules just share the begin for <! but it does not mean that the dash rules for comments is active in CDATA (AFAICT).

@zcorpan
Copy link
Contributor Author

zcorpan commented Aug 14, 2017

I thought removed lines 206 to 211 applied to CDATA sections, but maybe I misunderstood.

Happy to squash.

<!-- in comments is still invalid. Also end fix how comments can end.

Fixes textmate#45.

The check for CDATA sections was bogus to begin with, as far as I
can tell. They have always allowed dashes and > per spec.
@zcorpan
Copy link
Contributor Author

zcorpan commented Aug 25, 2017

I have squashed.

@zcorpan
Copy link
Contributor Author

zcorpan commented Dec 24, 2017

Ping 😊

@zcorpan
Copy link
Contributor Author

zcorpan commented Jun 19, 2018

ping

@infininight
Copy link
Member

I've been working on a major revamp to the HTML grammar for the last many weeks, this will be part of that. Had expected it to be done quicker but adding in various things like SVG and MathML took a while. It may just get pushed up in the next week or so or there may be a test branch first.

It's mostly done at this point I just need to do a final test pass of each bundle that uses the HTML grammar to rule out any side effects of the changes. There are quite a few so that's a good bit of work.

@infininight
Copy link
Member

Imported as 3161d11 with some minor changes, thanks!

@zcorpan zcorpan deleted the html-comment-dashes branch October 29, 2018 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants