Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML character reference processing is inconsistent #488

Closed
gibson042 opened this issue Oct 5, 2022 · 4 comments
Closed

HTML character reference processing is inconsistent #488

gibson042 opened this issue Oct 5, 2022 · 4 comments

Comments

@gibson042
Copy link
Contributor

#481 introduced processing of named character references, but is inconsistent with the HTML spec (cf. https://html.spec.whatwg.org/multipage/parsing.html#tokenization and specifically https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state )—it matches without a terminal semicolon but also fails to handle numeric references or the prefix matching that renders e.g. I'm &notit; I tell you as I'm ¬it; I tell you, and the inner logic lowercasing ≤ (which is its own bug and should be <) and & only with a terminal semicolon is inconsistent with it being optional in the initial match and also inconsistent with the lack of e.g. AMp; in https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references .

I think it should instead be either unconditionally strict1 or else better aligned with HTML2, and in either case should probably also be updated for numeric references and correct case handling.

Footnotes

  1. e.g.,

    --- src/formatter/text.ts
    +++ src/formatter/text.ts
    @@ -9,1 +9,1 @@
    -  text = text.replace(/&[a-zA-Z0-9]+;?/g, m => {
    +  text = text.replace(/&[a-zA-Z0-9]+;/g, m => {
    
  2. e.g.,

    --- src/formatter/text.ts
    +++ src/formatter/text.ts
    @@ -11,2 +11,4 @@
    -    if ({}.hasOwnProperty.call(entities, m) && entities[m] !== null) {
    -      return entities[m];
    +    const prefix = Array(m.length).fill().map((_, n) => n ? m.slice(0, -n) : m)
    +      .find(prefix => ({}).hasOwnProperty.call(entities, prefix) && entities[prefix] !== null);
    +    if (prefix) {
    +      return entities[prefix] + m.slice(prefix.length);
    
@bakkot
Copy link
Contributor

bakkot commented Oct 5, 2022

matches without a terminal semicolon

Only for legacy entities, which is consistent with HTML. The list of entities matched is taken directly from the HTML spec.

also fails to handle numeric references

I don't think we have any uses of that form to check if the decision to use that form was deliberate (e.g. the numeric value of the code point is relevant to the reader), so I didn't implement it. If there's some usages out there which you think are definitely not deliberate, I'd be happy to accept a PR adding that.

the prefix matching that renders e.g. I'm &notit; I tell you as I'm ¬it; I tell you

Omitting that case was a deliberate decision, which I stand by. Rather than formatting those, I would prefer to have a lint rule for unknown entity-like things which will warn on that case.

and the inner logic lowercasing ≤ (which is its own bug and should be <) and & only with a terminal semicolon

Fixed in #489, thanks.

and also inconsistent with the lack of e.g. AMp;

I don't think this is literally ever going to come up, so I don't want to spend time implementing it.

@gibson042
Copy link
Contributor Author

I don't think we have any uses of that form to check if the decision to use that form was deliberate (e.g. the numeric value of the code point is relevant to the reader), so I didn't implement it. If there's some usages out there which you think are definitely not deliberate, I'd be happy to accept a PR adding that.

I think it makes as much sense to convert numeric references as it does to convert named references, so I'll try to submit that.

the prefix matching that renders e.g. I'm &notit; I tell you as I'm ¬it; I tell you

Omitting that case was a deliberate decision, which I stand by. Rather than formatting those, I would prefer to have a lint rule for unknown entity-like things which will warn on that case.

and also inconsistent with the lack of e.g. AMp;

I don't think this is literally ever going to come up, so I don't want to spend time implementing it.

Perfect, that's the kind of context I was looking for. It would incline me towards requiring the semicolon for replacement and catching mistakes by linting (i.e., the first suggestion above). WDYT?

@bakkot
Copy link
Contributor

bakkot commented Oct 5, 2022

It would incline me towards requiring the semicolon for replacement and catching mistakes by linting

I prefer errors to be automatically fixed when possible, which only the formatter is capable of at the moment. &amp is unambiguously supposed to be &, so it would be nice for the tooling to fix it rather than just pointing it out. This is not true of e.g. &notit;, where it's not clear what the user intended, so that needs to be a lint rule.

(It would be nice to make the linter support --fix, but that's a bunch more work.)

So I'm inclined to stick with the current strategy unless there's some improvement to the user experience from the strategy you propose.

@gibson042
Copy link
Contributor Author

Works for me.

@gibson042 gibson042 closed this as not planned Won't fix, can't repro, duplicate, stale Oct 6, 2022
gibson042 added a commit to gibson042/ecmarkup that referenced this issue Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants