Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tout est Terrible. Endian problems with original RFC 4122 + case problems #119

Closed
safinaskar opened this issue Jul 14, 2022 · 8 comments
Closed

Comments

@safinaskar
Copy link

safinaskar commented Jul 14, 2022

In 2013 I reported this errata to original UUID RFC (RFC 4122): https://www.rfc-editor.org/errata/eid3546 . In short, big endian / little endian are not used consistently in original RFC, so (in my opinion) it is impossible to create working translator from binary UUID form to textual and vice-versa based on RFC text alone. That problem still is not fixed. So, I propose to replace original UUID RFC with newer version instead of merely adding new UUID types. Obviously, all existing errata ( https://www.rfc-editor.org/errata_search.php?rfc=4122 ) should be considered when creating this new version.

Back in 2013 I tried to write my own UUID C library based on RFC text alone, and I failed precisely because of that endian problem.

Also, original RFC has the second problem: some implementations generate textual UUIDs in uppercase, some - in lower case. This complicates textual comparing of UUIDs and sometimes lead to bugs. My proposal: textual UUIDs should always be generated in lower case (i. e. consuming uppercase UUIDs is OK, but generating - no). This proposal justifies replacing original RFC, too. If you are not convinced, then, please read this text: "Tout est Terrible" ( https://ferd.ca/tout-est-terrible.html ). The author talks about various problems, how they complicate writing usual programs, including UUIDs uppercase/lowercase. (Okey, if you are not convinced, at least, please, mandate lowercase UUIDs for new types only.)

@kyzer-davis
Copy link
Contributor

@safinaskar, I have actually reviewed your errata for the original RFC4122 in terms of big endian and little endian and I am fairly certain the current Draft 04 has no ambiguity in regards to that topic as is pertains to the items in my this specific draft.

Please let me know if you feel otherwise but I tried to go out of my way and provide concise verbiage around this topic.

My proposal: textual UUIDs should always be generated in lower case

This is a pretty good problem and the topic of "text encodings change then or provide alternatives" is being worked under
https://github.com/uuid6/new-uuid-encoding-techniques-ietf-draft

Also CCing @ben221199 who as been reviewing the original RFC4122 for a fresh coat of paint on the v1 through v5 topics from that RFC.

@ben221199
Copy link

I'm mentioned I see. The standard of this repository will add UUIDv6 and only that (as far as I know). This standard will update RFC 4122, because it adds version 6 to RFC 4122.


I'm also working on another standard. That standard will be deprecate RFC 4122 (and also this standard, I think), because it will redefine what already was in RFC 4122.

The idea is that the standard redefines all existing UUID variants (Apollo, NCS, Microsoft, Full-UUID, etc.) and versions (v1-v6) and directly registers the variants and versions in a IANA registry, so that there will be no doubt about it. If you (@safinaskar) think that RFC 4122 (and its errata) aren't fully correct, that standard is the place to be to fix it, as I think I want to register different serialization types too.

Also, I want to post a timeline here, because I think this could be order to release this standards in (but the second and third could also be swapped):

[RFC 4122] -> [RFC xxxx: UUIDv6] (this repository) -> [RFC xxxx: UUID + IANA] (my standard) -> [RFC xxxx: UUIDv7 and UUIDv8]

Hope this explaination helps a bit.

@safinaskar
Copy link
Author

I'm also working on another standard

Cool, thanks!

@peterbourgon
Copy link

peterbourgon commented Aug 13, 2022

There lies the risk. By default, most strings in most languages have comparison operators that are case sensitive. This means that these 3 UUIDs, despite being identical in their authentic binary representation, wouldn't compare as equal as strings.

Well, of course not? The dashed-hex form of a UUID is one of infinitely many encodings of the actual UUID, which is a specific sequence of 16 bytes. The UUID is the bytes, not the string — all encoded forms of a UUID must be decoded before they can be compared.

In short, big endian / little endian are not used consistently in original RFC, so (in my opinion) it is impossible to create working translator from binary UUID form to textual and vice-versa based on RFC text alone.

Is a sequence of bytes represented as ASCII hexadecimal not invariant to endianness? Given a dashed-hex string representation of a UUID e.g. 210fc7bb-8186-39ac-48a4-c6afa2f1581a is the actual value not unambiguously

0x21 0x0f 0xc7 0xbb
0x81 0x86 0x39 0xac
0x48 0xa4 0xc6 0xaf
0xa2 0xf1 0x58 0x1a

?

@safinaskar
Copy link
Author

@peterbourgon , in 2013 I tried to create my program in C, which converts UUID from binary representation to ASCII (or vice-versa) based on RFC 4122 and I failed because of endian problems. So the text is ambiguous. Unfortunately I don't remember exact details, i. e. I don't remember which parts of the standard caused problems. But you still can see my errata and response of one editor to my errata in RFC errata tracker ( https://www.rfc-editor.org/errata/eid3546 ): he acknowledges problems exist

@ben221199
Copy link

I can imagine that GUID can do some strange things. Everything is normal, except when you are Microsoft:

Bytes:
image
GUID in Data Inspector:
image

This is HxD, that only uses Microsofts GUID encoding, not the normal UUID encoding, so you don't get the expected 00112233-4455-6677-8899-AABBCCDDEEFF. In this case, all the fields in the first half of the GUID have little endian and all the fields in the second half have big endian.

Wikipedia mentions it too:

The binary encoding of UUIDs varies between systems. Variant 1 UUIDs, nowadays the most common variant, are encoded in a big-endian format. For example, 00112233-4455-6677-8899-aabbccddeeff is encoded as the bytes 00 11 22 33 44 55 66 77 88 99 aa bb cc dd ee ff.[9][10]

Variant 2 UUIDs, historically used in Microsoft's COM/OLE libraries, use a mixed-endian format, whereby the first three components of the UUID are little-endian, and the last two are big-endian. For example, 00112233-4455-6677-c899-aabbccddeeff is encoded as the bytes 33 22 11 00 55 44 77 66 c8 99 aa bb cc dd ee ff.[11][12] See the section on Variants for details on why the '88' byte becomes 'c8' in Variant 2.

Note that this quote mentions variants in combination with encoding and COM/OLE. I think this text is partly written by an idiot.

@peterbourgon
Copy link

Wow! You learn something new, and frequently terrible, every day.

@kyzer-davis
Copy link
Contributor

Topic moved covered as per errata of https://github.com/ietf-wg-uuidrev/rfc4122bis
Archiving this to clean up the issue tracker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants