Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify scope of document in preface #10

Closed
nigelmegitt opened this issue Oct 20, 2015 · 14 comments
Closed

Clarify scope of document in preface #10

nigelmegitt opened this issue Oct 20, 2015 · 14 comments

Comments

@nigelmegitt
Copy link

The Preface claims:

"… this specification … defines … the utf-8 encoding."

Isn't that formalised in ISO/IEC 10646:2014 and Unicode? I'm not suggesting that the contents of this document aren't useful, just that they don't define utf-8.

I suggest resolving this issue by:

  1. Removing "(and defines)" from the preface.
  2. Adding a reference to ISO/IEC 10646 to the References section and referring to it in the definition of utf-8.
@annevk
Copy link
Member

annevk commented Oct 20, 2015

It does actually define utf-8 in section 9 though.

@nigelmegitt
Copy link
Author

True! If that is an exact equivalent of 10646 then why duplicate it?

@annevk
Copy link
Member

annevk commented Oct 20, 2015

10646 allows variation in how errors are handled as indicated in a note. The utf-8 encoder is duplicated for completeness sake.

@nigelmegitt
Copy link
Author

If it's only for completeness and isn't intended to replace the definition then it should be included non-normatively and the normative defining reference should be clearly included. Clarifications or modifications made in this spec only should (obviously) be normative.

If it is intended to replace or modify the definition normatively then that rings big alarm bells. Specifying more precisely error handling does make sense but doesn't change the basic definition.

@annevk
Copy link
Member

annevk commented Oct 20, 2015

The utf-8 decoder is intended to replace the 10646 definition since we don't want to have the variation in error handling. Given that we already that I don't see much point in not also defining the encoder.

@nigelmegitt
Copy link
Author

The general problem here is that forking and changing specs creates confusion and incompatibility.

If you're not changing the encoder then it's really helpful to make that clear by including it only informatively (where "helpful" is not a strong enough word). Then the world can know that existing encoders built on 10646 don't need to be rebuilt and are intended to remain compatible.

If I've understood correctly you're not really changing the decoder model either, just restricting and clarifying the error model and including a reference algorithm. In that case the delta relative to 10646 needs to be included too.

@annevk
Copy link
Member

annevk commented Oct 20, 2015

I would be happy to add more notes or clarify something, PRs welcome if you're in a hurry, but this document has had review by some of the authors of 10646 and they did not consider anything problematic.

I don't think I would want to mark any of the existing algorithms non-normative since having everything related to encodings be self-contained seems extremely useful.

@nigelmegitt
Copy link
Author

I'm not in a huge hurry, just like to make things neat and tidy.

Marking the encoding algorithm non-normative would be helpful - it doesn't stop it from being self contained for reference within the document. I suppose it would be okay to keep it normative but add a statement along the lines of "This is [identical to|based on] what is specified in 10646".

@Ms2ger
Copy link
Member

Ms2ger commented Oct 20, 2015

For the intended audience of this document, it actually is the canonical definition of utf-8. Adding notes that it matches other specification doesn't seem like it would hurt, but note that utf-8 is not the only encoding that is also defined elsewhere.

@nigelmegitt
Copy link
Author

@Ms2ger good point - everywhere there's a 'copy and include' pattern the original (and presumably definitive + already implemented) should be referenced.

If there's a reverse engineering scenario to deal with a closed or non-existent "standard" then it makes sense to define something here.

It looks like the intended audience is anyone creating a new UA, maintaining an existing UA, or defining new protocols and formats. The obvious danger is that existing protocols, formats and content will break - this spec is clearly taking pains to avoid that situation, so it would be worth making the derivations clear and obvious, as well as any deltas.

@annevk
Copy link
Member

annevk commented Oct 20, 2015

Most existing specifications for the encodings in this document are a mess. That is why this document was created. This document mostly resulted from reverse engineering implementations, coupled with improvements around possible XSS attacks, and end-of-file handling.

I'm happy to add a note for the utf-8 encoder, but I've no intent of putting effort into the other encodings as that mostly seems like busywork. I might accept PRs, though it depends on the specifics.

@nigelmegitt
Copy link
Author

Okay, that sounds like a reasonable way forward. Thanks!

@annevk
Copy link
Member

annevk commented Oct 20, 2015

How do you want to appear in the acknowledgments? As nigelmegitt?

@nigelmegitt
Copy link
Author

[blushes] "Nigel Megitt", please, if it's warranted at all.

@annevk annevk closed this as completed in adb5f84 Nov 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants