New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify scope of document in preface #10
Comments
It does actually define utf-8 in section 9 though. |
True! If that is an exact equivalent of 10646 then why duplicate it? |
10646 allows variation in how errors are handled as indicated in a note. The utf-8 encoder is duplicated for completeness sake. |
If it's only for completeness and isn't intended to replace the definition then it should be included non-normatively and the normative defining reference should be clearly included. Clarifications or modifications made in this spec only should (obviously) be normative. If it is intended to replace or modify the definition normatively then that rings big alarm bells. Specifying more precisely error handling does make sense but doesn't change the basic definition. |
The utf-8 decoder is intended to replace the 10646 definition since we don't want to have the variation in error handling. Given that we already that I don't see much point in not also defining the encoder. |
The general problem here is that forking and changing specs creates confusion and incompatibility. If you're not changing the encoder then it's really helpful to make that clear by including it only informatively (where "helpful" is not a strong enough word). Then the world can know that existing encoders built on 10646 don't need to be rebuilt and are intended to remain compatible. If I've understood correctly you're not really changing the decoder model either, just restricting and clarifying the error model and including a reference algorithm. In that case the delta relative to 10646 needs to be included too. |
I would be happy to add more notes or clarify something, PRs welcome if you're in a hurry, but this document has had review by some of the authors of 10646 and they did not consider anything problematic. I don't think I would want to mark any of the existing algorithms non-normative since having everything related to encodings be self-contained seems extremely useful. |
I'm not in a huge hurry, just like to make things neat and tidy. Marking the encoding algorithm non-normative would be helpful - it doesn't stop it from being self contained for reference within the document. I suppose it would be okay to keep it normative but add a statement along the lines of "This is [identical to|based on] what is specified in 10646". |
For the intended audience of this document, it actually is the canonical definition of utf-8. Adding notes that it matches other specification doesn't seem like it would hurt, but note that utf-8 is not the only encoding that is also defined elsewhere. |
@Ms2ger good point - everywhere there's a 'copy and include' pattern the original (and presumably definitive + already implemented) should be referenced. If there's a reverse engineering scenario to deal with a closed or non-existent "standard" then it makes sense to define something here. It looks like the intended audience is anyone creating a new UA, maintaining an existing UA, or defining new protocols and formats. The obvious danger is that existing protocols, formats and content will break - this spec is clearly taking pains to avoid that situation, so it would be worth making the derivations clear and obvious, as well as any deltas. |
Most existing specifications for the encodings in this document are a mess. That is why this document was created. This document mostly resulted from reverse engineering implementations, coupled with improvements around possible XSS attacks, and end-of-file handling. I'm happy to add a note for the utf-8 encoder, but I've no intent of putting effort into the other encodings as that mostly seems like busywork. I might accept PRs, though it depends on the specifics. |
Okay, that sounds like a reasonable way forward. Thanks! |
How do you want to appear in the acknowledgments? As nigelmegitt? |
[blushes] "Nigel Megitt", please, if it's warranted at all. |
The Preface claims:
Isn't that formalised in ISO/IEC 10646:2014 and Unicode? I'm not suggesting that the contents of this document aren't useful, just that they don't define utf-8.
I suggest resolving this issue by:
The text was updated successfully, but these errors were encountered: