Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Alternate Text Encoding Methods (Crockford's Base32, etc) #3

Open
fabiolimace opened this issue Mar 3, 2020 · 79 comments
Labels
Discussion Open Ended Conversations

Comments

@fabiolimace
Copy link

I think the 'base32hex' encoding, decribed in the section 7 of the RFC-4648, should be used instead of creating a new one. I know that the algorithm is the same with a different alphabet, but the 'base32hex' already exists for the same purpose.

@bradleypeabody
Copy link

The reason I wanted to change the base32 alphabet is because it makes implementation (particularly as a db primary key) simpler when UUIDs always sort exactly the same when treated as raw bytes. This holds true for the binary format I'm proposing, but I think it would be even better if it was also true for the text format(s). For some applications sequence is critical, and having the text format when sorted return values in one sequence and the binary another has no up side, only down sides. In my opinion this outweighs the benefit of using the same alphabet that has been used in the past.

@douglascrockford 's base32 (https://www.crockford.com/base32.html) actually looks like a better alternative to what I've laid out in the draft and has the same properties. I'm considering updating the draft to use that.

@fabiolimace
Copy link
Author

fabiolimace commented Mar 7, 2020

I understand why you wanted to change the alphabet order 'base32' from A-Z2-7 to 2-7A-Z. However, RFC-4648 also describes another alphabet called 'base32hex' that has this char sequence: 0-9A-V. According to section 7 of the document, "a property of this alphabet, which lacks the base64 and base32 alphabets, is that the encoded data maintains its sorting order when the encoded data is compared in bits".

I agree that Crockford base32 is much better than 'base32' and 'base32hex'. In addition to the property of keeping the sort order, it avoids the confusion of similar characters and prevents accidental obscenity. I am not a native English speaker, but I know some obscene words that can be formed with the letter 'U'. There are other extremely obscene words with 'U' in romance languages ​​like Italian, Spanish, French and Portuguese. There is no need to mention them here.

Before opening this issue, I was thinking of suggesting Crockford encoding, but I was afraid it would not be approved, since 'base32hex' already exists for the sort order problem.

I vote for @douglascrockford 's base32 too.

P.S.: another point is that the Crockford's encoding is adopted by other projects like the ULID specification.

@bradleypeabody
Copy link

Okay, great. Unless some better idea comes along I'll update the spec with Crockford's base32 version the next time I do a round of changes. (I was not aware of the difference between 'base32' and 'base32hex' - thanks for pointing that out.)

I think the key idea here is to balance and weigh the concept of "don't invent something new if the existing standard solves the problem" against "use the best tool for the job". I'm sure the question of why not use RFC 4648 will come up again during the standardization process, but as it stands now I think the argument in favor of using something that sorts correctly ("best tool for the job") is stronger than the "use the existing standard" argument.

@fabiolimace
Copy link
Author

Great! I have no objections. This issue is solved to me.

@fabiolimace fabiolimace changed the title Use base32hex alphabet as alternate text format Use base32hex alphabet as alternative text format Dec 21, 2020
@fabiolimace
Copy link
Author

fabiolimace commented Dec 21, 2020

@bradleypeabody , have you seem this IETF draft that proposes a compact representation for UUIDs?

The draft aims to update the RFC-4122 to include a new textual representation for UUIDs encoded to base64url or base32. The new representation is temporarily called "UUID-NCName".

@bradleypeabody
Copy link

Interesting - I will definitely look into it further and see if I can connect up with the author to collaborate.

@bradleypeabody
Copy link

I wrote to Dorian Taylor (the draft author) about the possibility of collaborating. We'll see where that discussion goes.

@fabiolimace
Copy link
Author

Great!

Should we open this issue again to show that there is work in progress? Or is it better to leave it closed?

@bradleypeabody
Copy link

Good point, reopening it.

@bradleypeabody bradleypeabody reopened this Apr 9, 2021
@daegalus
Copy link

I just wanted to add my voice that we should consider base32 encoding for sure, or base58 encoding that just adds both cases of letters. 123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz which doesnt include 0s and Os. And it will make UUIDs fairly short.

If we do use base32, we should require it be lowercase, easier to distinguish certain characters in most fonts when they are lowercase.

I do also hope you merge with the NCName draft, the UUID format is a bit verbose and long for most usecases. so a different encoding would be great.

@bradleypeabody
Copy link

The current proposal does not include an alternate encoding, but let's see how the discussion goes with the IETF. Personally I am in favor of documenting Crockford's base32 (https://www.crockford.com/base32.html) as a valid alternative encoding.

@kyzer-davis kyzer-davis added the Out of Scope Not part of the Draft label Jan 31, 2022
@kyzer-davis kyzer-davis reopened this Feb 22, 2022
@kyzer-davis kyzer-davis added Discussion Open Ended Conversations and removed Out of Scope Not part of the Draft labels Feb 22, 2022
@fabiolimace
Copy link
Author

fabiolimace commented Feb 23, 2022

In this comment I will list my concerns about the base-32 format.

What algorithm will be used: the one described in RFC-4648 or a general modulo division?

RFC-4648 algorithm is more appropriate for a variable-length bitstream, for example, a variable-length UUID. But it can only be used with some bases (radix): 16, 32 and 64.

The generic module algorithm is useful to encode integers such as int, long, i32, u32, i64, u64 etc. It can also be used for (almost) any base (radix): 2, 8, 10, 16, 32, 36, 58, 62, 64 and so on. There are some project on GitHub that aim to be a universal base-n encoders.

So if we choose RFC-4648, we can only encode to base-16, base-32 and base-64. But we can't encode to base-58 or base-62, for example.

If we choose modulo division, it will not be easy in some languages to encode UUIDs due to the maximum sizes of primitive types. The implementer will have to use big numbers, like BigInteger in Java, and make some optimizations if efficiency is needed.

The RFC-4648 algorithm is much faster than the modulo division algorithm, but it produces a completely different result. If you want to generate an RCF-4648 output using modulo division, you must pad the input with bits to complete a group of 40 bits.

How to represent UUID as a URN using base-32?

Can we just concatenate the prefix urn:uuid: with the base-32 string like this example?

urn:uuid:01BX5ZZKBKACTAV9WEVGEMMVRY

Is the alphabet used to encode a UUID UPPERCASE or lowercase?

The Crockford alphabet is UPPERCASE encoded, just like the alphabet in RFC-4648. Will base-32 UUIDs also be encoded in uppercase? I think it must be lowercase to be consistent with the current canonical string representation. Either way, this needs to be present in the draft. Although decoders are case insensitive, if a database field contains mixed cases, the sort order can be compromised: UPPERCASE values come before lowercase in the "C" collation.

Reminder: ULIDs are UPPERCASE.

Are there other reasons to use the Crockford's alphabet?

The Crockford's alphabet is very good for our eyes. But the 'flexibility' to decode 'I', 'O' and 'L' as 'ONE', 'ZERO' and 'ONE', respectively, can cause a small problem: it maps more than one string to a single UUID. For example these 2 strings can exist as different primary keys in a database 'OLBX5ZZKBKACTAV9WEVGEMMVRY', '01BX5ZZKBKACTAV9WEVGEMMVRY'. Breaks the uniqueness and sort order promise of UUIDs.

Also, a UUID encoded with the Crockford base-32 alphabet can be confused with a ULID.

Calculate the size of variant-length UUIDs in base-32

The formula for calculating the size of a base-32 variable length UUID MUST be included in the draft. We know that a base 32 encoded UUID is 26 characters long (no padding), but we will have to calculate this for each UUID length other than the default length of 128 bits.

About the optional check symbol

The check character at the end of a Crockford-encoded UUID could be a problem with the variable-length UUID. Is the last character part of the UUID or just a check symbol?

About NCName representation

I think it would be cool to join forces with the proposal of the NCName format. Its author did a great job in my opinion. The algorithm looks a bit complicated at first, but there are good reasons for it.

If it is not possible to work together. maybe it's better not to compete.

EDIT: the base-32 and base-64 alphabets, 'A-V2-7' and 'A-Za-z0-9', respectively, used by the NCName proposal do not produce lexicographically sortable strings. So it is not desirable to use it with UUIDv6 and UUIDv7. It breaks the promise of ascending order.

Finally...

If I could, I would vote to leave base-32 out of scope for the many questions that may arise.

@kyzer-davis kyzer-davis changed the title Use base32hex alphabet as alternative text format Discussion: Alternate Text Encoding Methods (Crockford's Base32, etc) Feb 23, 2022
@kyzer-davis
Copy link
Contributor

Re-opened and linking thread from Draft 03 PR:
uuid6/uuid6-ietf-draft#58 (comment)

@daegalus
Copy link

daegalus commented Feb 23, 2022

I think we should stay modern and use lowercase, regardless of what is chosen.

Also for base32 we can always use our own alphabet. There are many alternatives, I wrote a base32 library for Dart and implemented multiple variants, including the option for a custom alphabet.

Example:

  • StandardRFC4658 - the default standard encoding
    • ABCDEFGHIJKLMNOPQRSTUVWXYZ234567
    • Padded with =
  • base32Hex
    • 0123456789ABCDEFGHIJKLMNOPQRSTUV
    • Padded with =
  • crockford
    • 0123456789ABCDEFGHJKMNPQRSTVWXYZ
    • Not Padded
  • z-base-32
    • ybndrfg8ejkmcpqxot1uwisza345h769
    • Not Padded
  • geohash
    • 0123456789bcdefghjkmnpqrstuvwxyz
    • Padded with =

If you notice some of them are lowercase.

We could also consider Base58.

If you want, I can write up some quick code to encode using the various bases to post as examples to see what we feel we like best.

Since I have UUID6/7/8 (draft 1) implemented in my dart uuid library and a base32 library I created for an OTP library, I can spit out examples pretty quick. I also have a local Base58 library and base64 is built in.

@fabiolimace
Copy link
Author

fabiolimace commented Feb 23, 2022

About base-32

As for the Crockford alphabet, to adopt a subset of it, we have to list the restrictions in the document: (1) all lowercase letters; (2) no = padding; (3) RFC-4648 algorithm (assuming variable-length UUID); (4) with bit pads to form groups of 40 bits (assuming RFC-4648 algorithm); (5) do not decode the letters 'Oo', 'Ii' and 'Ll'; (6) do not confuse with ULID (!).

If the base32hex alphabet defined in RFC-4648 is adopted (don't confuse it with base32), the only restrictions are: (1) all lowercase letters; (2) no = padding. Much easier to explain.

If the main reason for the adoption of the Crockford alphabet is human frendlyness, then that goal can also be achieved with GEOHASH, which is already lowercase and prohibits 'a', 'i', 'l' and 'o'.

The other base-32 alphabets are not lexicographically sortable.

About base-58 and base-62

I was also considering base-58 and base-62 as they are the same size as base-64 strings (no pads). But they MIX uppercase and lowercase letters. It's a problem because one of the reasons for this draft is the lexicographical order. A typical default collation is 'C', which sorts all uppercase letters BEFORE all lowercase letters. It works differently than other collations like 'en_US'. Below is an example using PostgreSQL. Note that the sort orders at the end of the code block are VERY different. The conclusion is that if you want a consistent sort order on the varchar key, don't mix uppercase and lowercase letters.

----------------------------------------------------
-- CREATING TABLES WITH DIFFERENT COLLATIONS
----------------------------------------------------

-- Standard Collation (ASCII order)
create temporary table tmp_collate_c (
	name character varying(255) COLLATE "C"
);

-- Collation for American Englisth ('natural' order?)
create temporary table tmp_collate_us (
	name character varying(255) COLLATE "en_US"
);

----------------------------------------------------
-- INSERTING VALUES
----------------------------------------------------

insert into tmp_collate_c values ('AAAAAA');
insert into tmp_collate_c values ('BBBBBB');
insert into tmp_collate_c values ('CCCCCC');
insert into tmp_collate_c values ('aaaaaa');
insert into tmp_collate_c values ('bbbbbb');
insert into tmp_collate_c values ('cccccc');
insert into tmp_collate_c values ('AAAaaa');
insert into tmp_collate_c values ('BBBbbb');
insert into tmp_collate_c values ('CCCccc');
insert into tmp_collate_c values ('aaaAAA');
insert into tmp_collate_c values ('bbbBBB');
insert into tmp_collate_c values ('cccCCC');

insert into tmp_collate_us values ('AAAAAA');
insert into tmp_collate_us values ('BBBBBB');
insert into tmp_collate_us values ('CCCCCC');
insert into tmp_collate_us values ('aaaaaa');
insert into tmp_collate_us values ('bbbbbb');
insert into tmp_collate_us values ('cccccc');
insert into tmp_collate_us values ('AAAaaa');
insert into tmp_collate_us values ('BBBbbb');
insert into tmp_collate_us values ('CCCccc');
insert into tmp_collate_us values ('aaaAAA');
insert into tmp_collate_us values ('bbbBBB');
insert into tmp_collate_us values ('cccCCC');

----------------------------------------------------
-- THE RESULTS
----------------------------------------------------

select * from tmp_collate_c order by 1;

AAAAAA
AAAaaa
BBBBBB
BBBbbb
CCCCCC
CCCccc
aaaAAA
aaaaaa
bbbBBB
bbbbbb
cccCCC
cccccc

select * from tmp_collate_us order by 1;

aaaaaa
aaaAAA
AAAaaa
AAAAAA
bbbbbb
bbbBBB
BBBbbb
BBBBBB
cccccc
cccCCC
CCCccc
CCCCCC

@bradleypeabody
Copy link

bradleypeabody commented Feb 24, 2022

The conclusion is that if you want a consistent sort order on the varchar key, don't mix uppercase and lowercase letters.

I agree. With how common case-insensitivity is in real-world applications, I don't think it would be wise to use a text format that relies on case sensitivity for proper ordering (one of the key things UUIDv6 and 7 address).

As for the Crockford alphabet, to adopt a subset of it, we have to list the restrictions in the document: (1) all lowercase letters; (2) no = padding; (3) RFC-4648 algorithm (assuming variable-length UUID); (4) with bit pads to form groups of 40 bits (assuming RFC-4648 algorithm); (5) do not decode the letters 'Oo', 'Ii' and 'Ll'; (6) do not confuse with ULID (!).

I agree but I think the simplicity is we just say something to the effect of:

"Encode every five bits with this alphabet: 0123456789abcdefghjkmnpqrstvwxyz If you don't have 5 bits for the last character then fill with zeros. There is no padding. When decoding discard final bits that don't form a complete byte. Encoding SHOULD emit lower case letters as per the alphabet shown. When decoding, implementations MUST accept either upper or lower case letters. Implementations MAY choose to interpret additional characters per "Crockford base32" or otherwise, as long as it doesn't conflict with the above alphabet."

It's terse and I'm sure will need to be expanded a bit, but I believe it does cover everything and makes it so applications can just use an existing Crockford base32 library for the encoding or just make something with that specific alphabet.

(6) do not confuse with ULID (!).

You will notice that if we use 48 bits for the timestamp and move to the var-ver field then ULID is actually compatible both in text and binary format with UUIDv7, with the exception of the requirement of setting var-ver at byte 9 to 0xE7. While this is not vital to the success of the spec, I don't think it's a bad thing either.

@bradleypeabody
Copy link

And I'll also add here that I do think we should just pick one additional text format that is as useful as possible (specifically: as compact as possible while being case insensitive, and when treated as raw ASCII bytes sort the same as the raw UUID bytes, and it should be easy and reliable to distinguish from the old hex+dashes format). We don't want a proliferation of various possible formats, because we would like whatever ParseUUIDFromString() functions get written to be simple and unambiguous.

Keeping in mind one of the key goals here: If you were using this as an ID for a database record, and you did SELECT id FROM x;, what would you want to see come back for that ID column? The existing hex format with dashes is just unnecessarily long - that's really the key thing we'd be addressing with this.

Also in some cases there may not be great binary encoding/decoding support built into the database and someone might end up just storing UUIDs in text format as a string in the id field, in which case the length of the text format has a direct impact on size and performance, so it "matters" in that case and is not just about aesthetics and preference.

@fabiolimace
Copy link
Author

fabiolimace commented Feb 24, 2022

You will notice that if we use 48 bits for the timestamp and move to the var-ver field then ULID is actually compatible both in text and binary format with UUIDv7

I disagree on this point. Even with the same timestamp size and the same charset, the base 32 strings will be different because the ULID specification uses modulo operations to encode the 128 bits to 26 characters. See the table below:

| algorithm | 01234567-0123-0123-0123-012345678abc | ffffffff-ffff-ffff-ffff-ffffffffffff |
| RFC-4648  | 04hmasr14c0j609304hmaswaqg           | zzzzzzzzzzzzzzzzzzzzzzzzzw           |
| Modulo    | 014d2pe09304hg28r14d2pf2nw           | 7zzzzzzzzzzzzzzzzzzzzzzzzz           |

@bradleypeabody
Copy link

@fabiolimace Wow, I totally missed that. You're right. That is really strange - it seems like they prioritized breaking the string representation up into a specific timestamp part and random part, which seems like an odd choice over the simplicity of just assembling the raw bytes and then encoding the whole thing. They also waste a couple of bits (10 characters is 50 bits, not 48).

Okay, yeah well forget ULID then. I do think the rest of the approach is sound though.

@daegalus
Copy link

Ok, I've been avoiding chiming in, because like I said previously I tend to hold unpopular opinions that I tend to try and ignore for the sake of playing nice with the general programming community and work. But I will try to be more detailed and explain some thoughts.

  1. I think @broofa has a good point on what we are trying to achieve and what we are doing with this spec. Are we just extending the old RFC or trying to trely modernize UUID for the future.

If we are trying to just extend, we should honestly drop encodings, the extension stuff, the LUID stuff, etc. We should stick strictly to adding the new UUID versions and their benefits.

If we are truely trying to modernize UUID, we need to take a stronger approach to it and stop being wishy/washy on stuff and just treat this as a new spec that we call UUID Modern or UUID 2.0 or something. The reason UUIDs are falling out of favor is one because of sorting and such, but also encoding and format. The 8-4-4-4-12 encoding is old, inefficient and messy. Especially because of the dashes making things much harder to parse and all the version/variant information that stays static. With a new encoding, that info is there, but encoded into a consistent string like everything else.

  1. It's of my opinion that if library authors choose not to implement it because they are conservative, new libraries will be made and gain popularity. Just because a library isn't supporting it, doesn't mean there is desire to use it by other people, and someone will come in and build it.

Here are my unpopular opinions, and then I will modify them to be more pallettable for everyone:

  • Take a hard-line stance and make the changes we want to see, without worrying about backwards compatibility of encoding

  • We want to make an efficient, future proof RFC. While it's impossible to truely future proof, between UUIDv8 and supporting any base encoding, or multiple, we can make this last.

  • By that regard, we should require Base2-128 or even 256 support, not just make it optional. It's not hard to implement and there are fast, simple implementations that work well.

Now, I understand there are many conservative developers that will not like this. So maybe a balanced approach would be easier to stomach.

  • Split the RFC into 2. One includes UUID6,7,8 similar to the first few drafts. No new encodings, no 7E or LUID.

  • In the new modern UUID RFC, add extensions, LUID, force new encodings and jettison the old encoding.

Though this will cause confusion and issues. So maybe stick to just 1 RFC, make a hard-line decision on one new encoding or pair.

  • Choose 1 pair or 1 encoding period. Make it mandatory but don't allow any others so that library developers have more streamlined work.when updating their libraries to support the new encoding.

I dunno, I personally still think we should do only 1 RFC, with all the changes forced, toss any annoying backwards compatibility and go with it. I personally am ready to do the work to update my library. But again, I am a sea of unpopular opinions in development.

Also apologies for poor formatting, wrote this on my phone.

@LiosK
Copy link

LiosK commented Mar 12, 2022

Okay, the multiple alternatives idea is not popular. I don't really stick to this idea, but let me elaborate on it again as my point seems still misread. I would propose something like the following text:


Text Format

The canonical text representation of UUID is 8-4-4-4-12 and every library MUST support this format. Additionally, libraries SHOULD support the Base-36 format and Base-62 format (see Appendix XX for detailed algorithm and digit sets). The format used by a UUID string can be detected by the length of textual representation.


In this way the spec can define unambiguous data exchange formats (setting variable-length stuff aside) that all the implementations should follow. I have exactly the same opinion as the following @broofa's.

A Standard should provide a canonical means by which systems interact. The one thing a Standard must do, and do well, is eliminate any ambiguity and uncertainty about how a system should behave.

We're on the same page I believe.

@bradleypeabody
Copy link

Just one other idea of how we might approach this I wanted to throw out there:

What if we associated the new combined var-ver field (the 111b variant) with the Crockford base 32 text format? I.e. "The canonical text format for UUIDs with variant 10b (versions <= 6) remains 8-4-4-4-12 as per RFC4122. For the new variant 1110b specified in this document (versions 7 and 8) the canonical text format is Crockford base 32."

This would leave existing implementations as-is but also allow us to switch to Crockford base 32 format for v7 and v8, and have only one canonical format for a given version. Implementation burden associated with the new versions would be present but minimal.

(@LiosK I think it's fine to mention the idea of encodings in other bases - my opinion is that SHOULD is too strong and alternate encodings are a MAY, but regardless I think a core concept is that since UUIDs should be "as opaque as possible" it also stands to reason that plenty of applications will want to just generate UUIDs and treat them as regular strings from there on, in which case the encoding is solely a matter of application-specific requirements.)

@martinheidegger
Copy link

One thing about Crockford's format is that it is not a RFC and also has edge-cases that are not well defined. (I had a conversation about this here, referencing a much longer article) It may be necessary to prepare a proper crockford compatible base32 RFC to reference to.

@daegalus
Copy link

daegalus commented Mar 12, 2022

I think when we refer to Crockfords base32 we only mean his encoding alphabet and order, not the algorithm I think we refer to the RFC4648 algorithm but just use Crockfords alphabet choice. At least that's how I have it implemented in the Dart base32 library and many other base32 libraries I've seen.

@LiosK
Copy link

LiosK commented Mar 12, 2022

@bradleypeabody,

my opinion is that SHOULD is too strong

You're right! It's my bad I implicitly meant "libraries SHOULD" by "implementations SHOULD". I adjusted the previous comment to clarify this. The updated text is a little bit awkward as a spec text, but it illustrates my intention well.

UUIDs should be "as opaque as possible"

I agree to this to some extent, but

plenty of applications will want to just generate UUIDs and treat them as regular strings from there on, in which case the encoding is solely a matter of application-specific requirements

I don't necessarily agree to this. A standard is helpful only when it coordinates multiple implementations to interact and work together. In this sense, the alternative encodings recommended in the spec are not application-specific things. If the spec clearly defines some encodings, then many libraries will implement the encodings in common. In this way, the spec can help multiple libraries and applications talk with each other using the new encodings.

@ben221199
Copy link

Maybe we could reintroduce the legacy UUID format, next to the 8-4-4-4-12 format, so:
34dc23469000.0d.00.00.7c.5f.00.00.00 (legacy) and 8a885d04-1ceb-11c9-9fe8-08002b104860.
This format is specific for Variant#0 and leaves out the reserved field.

Other encodings seem to be out of scope for me, because I think the decision to use other "encodings" is up to the developer. The only requirement should be that there is space for 128 bits.

@broofa
Copy link

broofa commented Mar 12, 2022

What if we associated the new combined var-ver field (the 111b variant) with the Crockford base 32 text format?

@bradleypeabody Is the implication here that applications should not use 8-4-12 with the new versions? I don't see users responding well to that. Too many existing DB columns / function signatures / UI widgets are set up for 8-4-12.

@broofa
Copy link

broofa commented Mar 12, 2022

Having fun on a weekend morning ....

From the PR4 draft (emphasis mine):

Where required, UUIDs defined by this specification and [RFC4122] MAY be encoded utilizing new techniques such as, but not limited to, Base32, Base36, or Base64. Applications MAY also utilize other encoding techniques such as modulo division or alternate alphabets such as Crockford's base32

Based on the above, I believe the following would all be valid encodings of 08a0c2eb-57c8-4bc5-ae66-1e7e39fd1d99:

  • ⢙⠝⣽⠹⡾⠞⡦⢮⣅⡋⣈⡗⣫⣂⢠⠈
  • 𓊉𓐋𓎕𓇅𓊄𓇌𓁊𓁪𓈧𓂷𓎆𓐙𓀄
  • 🂹🃖🂱🃟🃙🃘🃇🂧🃆🃙🂺🃑🂫🃁🃜🂵🃋🂫🂬🃈🂨
  • ̸̷̩̲̦͈̹̞̓̓͑̓ͮ̎͗͌̃́͡
  • 𒐱𒐖𒐿𒐬𒐽𒑰𒑂𒑚𒐺𒑧𒑨𒑐𒐀𒐲𒑝𒐭𒑲𒑛

See https://codepen.io/broofa/pen/ZEJKWOQ?editors=0010 for details.

@bradleypeabody
Copy link

bradleypeabody commented Mar 12, 2022

In reply to a few recent posts:

@broofa

Too many existing DB columns / function signatures / UI widgets are set up for 8-4-12.

Unfortunately, you're probably very right about this. The question would then be if it's too much to ask for implementations doing the work to implement the new var-ver field to be updated. Probably it is, but just wanted throw it out there.

If this is generally the case, then my position would stay with the idea of keeping 8-4-4-4-12 as the standard, defining one additional format and call it "compact" or something like that, and then also just mention that people can do whatever they want for their own use cases, just don't expect it to be included in every UUID library.

We would basically be telling library authors that their toString() stays the same, we recommend adding compactString(), and if they do then also update parse() to accept either one, and if they don't like that then there's no law against using whatever encoding is convenient for their own application, it just won't be in the UUID library.

@LiosK

A standard is helpful only when it coordinates multiple implementations to interact and work together. In this sense, the alternative encodings recommended in the spec are not application-specific things.

I agree but I don't follow how having a bunch of variation in the encoding aligns with this idea (and reading this back again I realize I might be conflating your latest proposal with an earlier one). But overall, here's the thing I don't understand: Let's say we had a database that implemented several different text formats for UUID, including base32 and base64, and then you have an application which is using these generated IDs. Your application inserts a new record, which generates a new UUID (let's say for this example the database generated the UUID, although a similar problem exists regardless of where it is created). Now the app does "SELECT ID FROM ...". What format does the application expect to be returned? The only answers I can think of that make any sense are:

  • A single, specific, standard predefined format (e.g. 8-4-4-4-12).
  • It doesn't matter because the application will just use the opaque string.
  • Some predetermined format, e.g. specified as an option using a database feature designed for this. In this case, from my perspective, this is "application specific", because the only way to get this right is for the database and the application to agree ahead of time on which encoding was chosen.

Does it really help to list out a number of different possible formats? If I am a database vendor that is implementing this, how do I choose with of these various possible encodings I should actually spend time on and implement? And why is this the database vendor's problem, as opposed to just letting applications use their own language's encoding functions? Most languages already implement things like base32, base64 and others, so if the application already has to decide ahead of time which encoding to use, would they not be free to use whatever is convenient and available to them? (Like why is base36, mentioned in your text, somehow better than base64?) What if what is useful to me or available is not one of the ones that we thought to mention in the spec? Is that now "less valid" than choosing one of the options you've outlined? I just think there is way too much variation here to expect UUID library authors and databases, etc. to include all of this encoding in each implementation. And even if we just pick the two that you suggest - base36 and base62, in a way that could easily end up being worse because now everyone making a UUID implementation will feel compelled, but not required to implement these, and users will probably end up using base32 or base64 anyway, just because it's more convenient and familiar, and they won't know what encodings are available on the other side (whatever app we're worried about reading these UUIDs).

I think whatever recommendation is provided for this needs to help answer things like "I am a UUID library author, what encoding(s) should I implement vs leave to the user to figure out?" and the same thing for database vendors, etc.

And I don't understand how we can say that base36 and base62 are somehow a better answer than base32 and base64 or base32hex, or any number of other possible encodings. Is there a specific analysis here that makes base36 and base62 the optimum choice?

My goal with championing Crockford base32 has been to just introduce one format that is significantly more useful than 8-4-4-4-12, and (hopefully) to do it in a way that doesn't break existing implementations. I didn't pick Crockford base32 just because I liked it or for one or two specific reasons - my analysis as to why it has the highest utility is outlined above (works in many places - email, DNS, file names, URLs, case insensitive, plus "less swear words", and it can be reliably distinguished from 8-4-4-4-12 regardless of length, and probably a few points above I'm forgetting).

Anyway, sorry the above turned into a bit of rant. To reel it in, my specific feedback is, given this text:

The canonical text representation of UUID is 8-4-4-4-12 and every implementation MUST support this format. Additionally, implementations SHOULD support the Base-36 format and Base-62 format (see Appendix XX for detailed algorithm). The format used by a UUID string can be detected by the length of textual representation.

Why two formats, and why those two formats specifically? Every format we add is more variation and means people will have to figure out which format was chosen (and telling people to use the length I think is not a good idea because since we're leaving this open ended and what happens if someone uses bas32, or base32hex - you can't tell the difference between those from the length). And these two base36 and base62 are not common formats, so every library implementor will have to figure out how to deal with this (whereas Crockford base32 is pretty common).

@daegalus

Are we just extending the old RFC or trying to trely modernize UUID for the future.

I think we need to work this back and forth against the proposal instead of trying to answer this by itself. In an ideal world we would do both with one RFC. If this cannot be accomplished, then we can discuss other options.

@ben221199
Copy link

I think we need to work this back and forth against the proposal instead of trying to answer this by itself. In an ideal world we would do both with one RFC. This cannot be accomplished, then we can discuss other options.

This draft SHOULD "update" RFC 4122 by adding new versions, like UUIDv6, etc. I have started another draft apart from this repository that describes literally everything about UUID, also things that were not in RFC 4122, but SHOULD have been added back then. That draft will also describe UUIDv6 AFTER publication of this repository (uuid6-ietf-draft) and will "obsolete" RFC 4122. I think that is the best way. So, for this draft here, focus on the main purpose of this draft: new versions. Not the other "out-of-scope" shit. I will make an issue in the near future where I will sum up some things.

@bradleypeabody
Copy link

@ben221199 I think that regardless of how it is organized we should collaborate on the work. The current draft here definitely has some things that various people think is "out of scope" which I think is "necessary in order to actually make UUIDs useful for modern applications, so if we're not addressing it why are we doing all this work here". I look forward to your description separately of what you're referring to.

@ben221199
Copy link

@bradleypeabody Yeah, collaboration is fine, but I think we should seperate things that "update" the spec and things that "obsolete" the spec. I think I will write the first draft of what I have in mind and will show it then. I hope it will be somewhat clear then :)

@LiosK
Copy link

LiosK commented Mar 13, 2022

@bradleypeabody,

having a bunch of variation in the encoding

list out a number of different possible formats

I've never meant this. The any-base any-alphabet code I posted was just for illustration of the algorithm. At this stage, I would propose only two specific, concrete, unambiguous alternative formats: a case-insensitive but longer format and a shorter but case-sensitive format. Under this scenario, the spec can easily provide a concrete and unambiguous way to encode and parse multiple UUID formats. Applications that go for other application-specific encodings might face application-specific hassles but that's not what the standard should (or can) address.

Accordingly, most of questions you threw are not really relevant to my point, and other relevant questions to my point are also relevant to your single Crockford32 alternative approach. What does a database return to SELECT ID FROM ... if the spec defines the canonical 8-4-4-4-12 and the alternative Crockford32? Currently, database vendors can only return 8-4-4-4-12 because there is no other consensus format to follow, but if the spec defines Crockford32 in addition, then the vendors can provide an option to return Crockford32 and can accept Crockford32 as a valid input in INSERT statements. Ultimately, the choice of format is application-specific, but the spec can help provide multiple options in making such a choice, and that is the only sensible way in which the spec can advocate Crockford32 to the real world.

To put it differently, if the spec add one alternative Base-36 format, then UUID libraries will generally support Base36String() method and as a result an application will be able to choose the Base-36 format for its application-specific ID. And if the spec add one more alternative Base-62 format, then UUID libraries will generally support Base62String() and an application can choose this, too. Does the latter Base-62 do a lot of incremental harm? This is my point. Adding one alternative changes a lot of things already, but adding one more does not change as many things as the first alternative, while it can possibly double the number of customers served. IMO the cost-benefit profiles of "canonical 8-4-4-4-12 + one alternative" approach and "canonical 8-4-4-4-12 + two alternatives" approach are quite similar.

Also, the key problem I want to solve is that currently it is not a trivial task at all for an application to go for a compact UUID format. A modern application can be implemented using multiple languages (e.g. JS/Swift/Kotlin for frontend, Go for backend, SQL for database, Python for log analysis) and it is not an easy job to find right Base-X libraries for all of these languages involved. As you've observed, UUID is generally used as opaque string but I would suspect this practice is just a forced choice due to the difficulty to switch to a right encoding depending on contexts.

Is there a specific analysis here that makes base36 and base62 the optimum choice?

Actually, I don't have a strong opinion over specific bases or alphabets, so in the proposed text I would actually mean Base-36ish and Base-62ish, but a lot of people have misread this notation as the any-base any-alphabet approach, and thus I tried to be specific there.

That said, I personally believe that Base-36 is better than Base-32 from a library user's point of view just because Base-36 saves one more character, and Base-62 is better than Base-64 just because Base-62 uses alphanumeric characters only. These two benefits are sufficient reasons to push library authors to harder work, but I admit it's just a matter of view point.

And these two base36 and base62 are not common formats, so every library implementor will have to figure out how to deal with this (whereas Crockford base32 is pretty common).

A little bit different topic, but I think it's an implicit assumption we've been making. I've observed that most of ULID implementations implement their own Crockford32 encoder/decoder instead of relying on an external library. Based on this observation and from my experience implementing base32hex in many languages in my personal project, I'd speculate that Base-X encoding/decoding is such a small function that library authors feel hesitant to add another external dependency. Therefore, availability of existing implementations might not be an important factor in choosing a right encoding/decoding algorithm, because library authors are likely to implement it from scratch anyway.

@ben221199,

I think updating RFC 4122 doesn't mean that no addition of features is permitted. We can add a lot of things to RFC 4122 without obsoleting any of its components. Only a Sith deals in absolutes. We have a bunch of options in between.

@bradleypeabody
Copy link

@LiosK Okay and thanks for taking the time to write that up and clarify. I think I understand where you're coming from.

In terms of a specific proposal, it sounds like you're proposing two things: 1. the bases you've selected and 2. having two different alternate encodings (as opposed to just one). Just to state my concerns on these two:

Regarding base36 vs base32 crockford, my main concern is just what is required to implement and how familiar this will/won't be for developers. Crockford base32 is a lot more well-known and well supported, and can be adapted easily from any base32 encoder. I think this makes the bar lower in terms of what hoops developers will have to jump through to implement, and I think this is an important concern.

I'd speculate that Base-X encoding/decoding is such a small function that library authors feel hesitant to add another external dependency

Yeah it's hard to say and varies from language to language. In Go for example there is a base32 encoder/decoder which supports custom alphabets in the standard library, so there's essentially zero benefit to writing your own. I think the same is true of Python. But I realize the situation is different in JS and probably some other languages too.

Regarding adding two more encodings instead of one, my concern here is making it so library authors can still have a single Parse() function which can reliably deduce the format from the input. Yes, I know that you can use the length to do this for base36 vs base62, but it seems brittle in the face of UUID Long (yes I know a bunch of people hate this idea and it's a separate subject - please yell at me/others on the appropriate thread about that one). There is also a bit of a "slippery slope" aspect to each additional encoding that is added - "why not support base32 Crockford AND base36, since we can use the length to distinguish", and so on. After a decision is made here, it will have to be defended against future reviewers and the IETF.

If we pick just one additional format that is easy for people to implement, the whole thing becomes pretty simple. The String() method in people's libraries stay the same (8-4-12). A CompactString() or similar is added. And Parse() is updated to accommodate the output of CompactString() in addition to String(). (I feel like there's a better name than "compact"...). In the database example from earlier, the database vendor would need to provide an option to send the result back in "compact" format, but at least any Parse() function that is updated to match the spec can reliably receive either text format - I think this property goes a long way toward making this approach work.

If we could deprecate the old format, that in some ways we be ideal, but do I think it's too big of a change and could easily harm adoption more than it helped by declaring every existing implementation as deprecated - as opposed to an improvement that can be implemented by library authors as time allows. Deprecating things in RFC4122 I suspect will also reduce the odds of this new spec making to an RFC.

@ben221199
Copy link

@LiosK

I think updating RFC 4122 doesn't mean that no addition of features is permitted. We can add a lot of things to RFC 4122 without obsoleting any of its components. Only a Sith deals in absolutes. We have a bunch of options in between.

When updating an RFC, it adds things to the specification. Take a look at RFC 3501 for IMAP 4rev1. It describes the whole IMAP protocol. It is UPDATED by some other RFCs, that describe extensions, and it is OBSOLETED by one other RFC, that describes the whole IMAP protocol again, but IMAP 4rev2. I think we should do something similar here.

@LiosK
Copy link

LiosK commented Mar 13, 2022

@bradleypeabody,

I understand your point. I don't mean to stick to my idea and I'm willing to follow the consensus. I've just believed that the two alternatives approach is one viable option that the new RFC can take and wanted it to be put on the table correctly understood.

A couple of technical points:

it seems brittle in the face of UUID Long

I believe alternative formats can work with the variable-length strategy if designed carefully, but they definitely constrain each other. They are in a trade-off relationship; we might be forced to make an exclusive choice. We have to carefully choose which group of customers to serve, and as a result I am against the variable-length idea.

Crockford base32 is a lot more well-known and well supported

Base-62 is tricky, but JS/Java/Rust have native encoding/decoding support for Base-36 and Python have native decoding support. JS/Rust/Python do not even require an import statement. We can also find some hidden efforts by Swift developers. In JS, for example, Base-36 requires literally zero effort because the following one line perfectly works:

0x0622e7c2_d01a_7d65_8fe3_f57b68c36204n.toString(36).padStart(25, "0");

Base-36 tends to be implemented as part of BigInt operations and thus it's more widely available than you believe (than Crockford32 I guess). Base-36 sounds more prominent to me based on your argument.

Anyway, Base-X encoding/decoding needs only a hundred lines of code and I just feel like working for library users rather than authors.

@ben221199,

We can add new versions to RFC 4122 without obsoleting it, and similarly we can add new text formats to RFC 4122 without obsoleting it, can't we?

@ben221199
Copy link

@LiosK

We can add new versions to RFC 4122 without obsoleting it, and similarly we can add new text formats to RFC 4122 without obsoleting it, can't we?

Yes, we can add new versions without obsoleting RFC 4122. So, we are "updating" it. Text formats should be able too, but should think enough about the purpose of new text formats of course.

The draft I'm writing in another repository at the moment is planned to "obsolete" RFC 4122, because the specification describes already existing variants and versions AND could/will (maybe) have new ones that are described in this one.

Here is a post that explains it a little bit and ironically points to RFCs defining the terms: https://stackoverflow.com/questions/32873577/whats-the-difference-between-obsoletes-and-updates-in-rfcs

@kyzer-davis
Copy link
Contributor

kyzer-davis commented Mar 14, 2022

Cross-posting my comment variable length UUID thread for visibility.


Group, great discussion, this is why I author the proposed RFC text!

Converting form GitHub threads to "RFC Speak" always drives great conversations and uncover things that may not have been considered. PR uuid6/uuid6-ietf-draft#85's text has been written in a way that I can easily remove E Variant, Alt Encoding, and UUID Long or transpose that XML structure to an alternate Draft that focuses on these topics.

That being said, we have a few engineering challenges with these sections but I am confident this group will be able to derive a great solution! I reviewed the last comments and I will summarize a few of the topics as usual.


Signaling UUID Alt Encoding Method(s)

[..snip..] If System A creates a UUID / UUID Long, encodes it with a random method and sends it to System B as urn:uuid:<encoded_uuid> how does system B determine how to decode that UUID?

  • Shared Knowledge Systems: We all know we are steering away from shared systems.
  • draft-multiformats-multibase: This is another Draft RFC. Until IANA adopts it I cannot in good faith adopt it into this draft.
  • Pick one base encoding: Everybody is on the same page and encode/decode works perfectly fine.
    • Downsides: Everybody has an opinion on this matter and usually has a reason for why their base is the best base. Possible solution: Pick one and steer all others to v8/v8E?
  • Extend UUID's URN: Append the base along with urn:uuid: thus urn:uuid:<base>:<encoded_uuid> such as
    urn:uuid:base32:3JT57U1QAH859QLSQGSJJ0H7CT

[..other topics from comment truncated.. see the original for more...]


General URN Author note, I would need to do a deep dive on RFC8141 to ensure any potential URN proposals are valid syntax before authoring text.

Editor Note: Must include text about assuming urn:uuid: implicitly equals urn:uuid:base16hd:128: for backwards compatibility reasons. base16hd - Base16 UUID with Hex + Dashes.

@sergeyprokhorenko
Copy link

I am convinced that JSON is a much more flexible and convenient message format than URN. See #2

@kyzer-davis kyzer-davis added the Out of Scope Not part of the Draft label Mar 16, 2022
@kyzer-davis
Copy link
Contributor

kyzer-davis commented Mar 16, 2022

Announcement

I had a great discussion with @bradleypeabody and this topic has officially been marked out of scope for Draft 03 (and any future draft.)
The XML text is retained and over the next few weeks I will author a separate Draft 00 which includes this topic specifically.

For now please focus on the technical challenges proposed by my previous comment: #3

Edit: To further clarify, Draft 03 will cover UUIDv6 through v8 + Max UUID. The new Draft 00 will cover E Variant, Alternate Encoding and UUID Long. Two drafts that cover different topics so implementations may choose what they want to support. i.e An implementation supports RFC8675309 for v7 but not RFC123456789 for alt encodings.

@kyzer-davis
Copy link
Contributor

Group,

Following the previous announcement I have drafted up a new RFC Draft document to cover this topic.

Since the discussions threads are in this repo I decided to create new folder for the topic.
Github Pages picks up the folder nicely and thus Draft 00 of "New UUID Encoding Techniques" can be found here:


Additionally, I have authored an "Extended UUID URN namespace" for conveying encoding type and length of a UUID to other applications defined by this document. I still have more research on URNs to do but I feel confident enough the proposed URN is backwards compliant with RFC4122 and also compatible with RFC8141.


Personal Note: I do think "inclusive vs exclusive" is the best way to ensure the there we benefit as many folks as possible. If we can solve the problem of conveying the encoding type (and length for UUID Long) then I believe we give implementers a tremendous ammount of power they did not have before.

That being said, as always, I look forward to your responses bring up issues, problems, caveats, and things to think about based on your point of view!

@daegalus
Copy link

Based on the most recent post about the draft on Hacker News, an alternate encoding is the most commonly mentioned desire. https://news.ycombinator.com/item?id=31715119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Open Ended Conversations
Projects
None yet
Development

No branches or pull requests

10 participants