New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Variable Length UUIDs | UUID Long #2
Comments
I think the more common name for "variable-length UUID" is "string" :) |
Throwing my 2 cents in:
LengthQ: Are we sticking with 128-bits or are we introducing variable-length.
Sources:
|
No. Not enough real-world use cases, evidence of need, or experience among contributors to merit including this. |
Agreed. But this is also kind of the whole point, see next.
Let's break it down like this: We're making a new UUID spec. One of the whole points of this is to make it useful for database identifiers. I'm hoping that we will see things such as SQL statements like So the question becomes: is there a use for case for supporting
If this is really the case and it's not needed, then fine, maybe it's a moot point. I admittedly haven't studied the collision probabilities carefully enough to have a firm opinion on it. But I just don't see how we can say that one degree of collision resistance is "good enough for everyone", that's the point I'm stuck on. Maybe an idea here would be to get a list put together of collision probabilities at various lengths. I still don't know how to convert "1 in (some huge number)" into "good enough for most applications" - I'm open to ideas on how to evaluate this. The other questions brought up, while valid, do have straightforward answers:
It's not, the thing holding the value is responsible for knowing it's length.
Entropy/collision resistance
No. Just like lengths in most systems are specified in bytes, we would limit to byte-aligned lengths.
Add one more hyphen after the 16th byte and then no more in the ensuing digits.
It's just a shorter form with less entropy. I'm not stuck on this one, if we added the ability to make longer UUIDs but not shorter ones, that would at least handle the "is this enough entropy" concern, which is the primary one.
It depends on protocol using the UUID. If there is an existing protocol/format which expects to transmit a UUID as exactly 16 bytes well than yeah that's not going to work with variable length. But a lot of transports already have a separate encoding mechanism for the length. (I.e. if it's in a field in the database then the database already keeps track of how many bytes that field is, since it would be a string or a blob - JSON has delimiters, msgpack has a length number, and so on) So again, I understand why some applications wouldn't need or want variable length UUIDs and couldn't use them. But is there really a reason to explicitly make NewUUID(20) "invalid"? - even though databases could easily handle this because they have a means of storing the length outside of the actual value.
Not our problem. RFC4122 is fixed length and that's fine. UUIDv8 allows people to do whatever they want. I think there is enough flexibility here for people to do what they need. If someone needs some specific variable length UUID that isn't UUIDv7, they just use UUIDv8. |
+1 to the opinions that the variable length stuff should NOT be in scope for the new RFC. But if it were:
I don't think the shorter UUID is feasible. Short ID implementations usually utilize application-specific shared knowledge to ensure the uniqueness, while the UUID specs are not designed so. It is unlikely possible to ensure practical collision resistance by simply shortening the 128-bit UUID versions. UUID is not an all-in-one ID standard but just a universally unique ID standard; it will be just confusing to include a spec that will never be universally unique. IMO, the longer UUID spec is not necessary until 128-bit length is proven to be insufficient and many applications seek for longer IDs. At that time a new standard will be necessary to coordinate many applications and libraries, but in the meantime, an application, if concerned about collisions, can simply extend UUID on their own, for example, by utilizing the following struct: struct MyUniqueId {
uint8_t uuid[16];
uint8_t ext_entropy[8];
uint8_t machine_id[8];
}; I have never seen this kind of approach employed to append extra entropy, but I think it is often common to add a type or namespace tag at the application level to discriminate UUIDs (e.g. It would be valuable if the new RFC includes a section that identifies the source of collision resistance and its limitations and suggests the possible approaches to mitigate such limitations. |
I am of the opinion that the variable length UUID should be out of scope. @LiosK, I couldn’t agree with you more. |
Understood on the above points, and that a fair number of people dislike this idea. How about this as an approach that would significantly simply the whole thing and both remove variable length from the scope of UUID implementations, but also provide guidance to those applications that absolutely must have more entropy because 128 bits is not enough. How about this text such as this for spec: UUID as a Source of Opaque Bytes or StringsFor many applications, UUIDs are treated as an opaque (i.e. never parsed) sequences of bytes. Or the text form is used as an opaque string of characters. The purpose being solely to uniquely identify a resource, and its exact form is not otherwise relevant. In these cases, if an application wishes to emit longer random byte sequences or strings in order to add more entropy, there is nothing stopping them from doing so. Such outputs are not covered by this specification, and as such, library and application implementors are not obligated to support them. If used, applications should take care not to call such values "UUIDs" as they are not. E.g. the more generic term "ID" or simply "unique string" would be more appropriate. In this way, nobody reading the spec and trying to implement a UUID library is obligated to anything with this at all. UUIDs are not variable length. Parsers (which I believe should be minimal anyway) don't have to deal with it if they don't want to, and so on. However, for those people are who are like "No you don't understand, I absolutely must have more entropy for my application and I don't want to invent something new", they can just turn the UUID generator into a generator for whatever length they want. This handles my concern and I removes the complexity that would have been added if "variable length" were specified as a feature. It also means that a database implementor could make something like "NewID(20)" work and output something in the form of UUIDv7 plus more bytes, if they felt it necessary and they wouldn't have a "non-standard implementation", they are just doing what the above section says. |
@bradleypeabody Such verbiage serves no purpose other than to confuse things. It's just saying that UUIDs can be embedded as part of a larger data structure and that such data structures should not be considered UUIDs. This has always been the case. I think it's generally understood that all standards can be (ab)used in this way should implementors choose to do so. There's no need to encourage it by actually talking about it. If we're going to say longer-form UUIDs are out of scope we should just omit all discussion of them. [Edit: ... other than to say, "Longer UUIDs were considered by the authors but were not deemed to be worth addressing at this time."] |
This question of "so I'm implementing a 'make me an ID function, what should I do?' is quite literally the use case that drove this project in the first place. And on this same line, "Should we allow longer IDs for applications that need more entropy/collision resistance" is a very relevant question that has come up many, many times. Here on GitHub and also in prior discussions on the IETF mailing list (if someone is interested I can try to go dig up the links to these things). You will notice there is an overarching theme to the current draft where concerns that are only applicable in certain cases were removed from the actual specification and instead turned into a short section to discuss the topic, as guidance to the implementor. In my mind this falls into that category. Even if longer values are "not a UUID", it is still a relevant concern to people reading the document. Because the goal is not just to "make a new UUID format", but also to answer the question "how should I generate unique IDs in a distributed environment". I doubt I will be able to convince you personally of the importance of this, but hopefully the above makes it clear that this subject is in fact relevant to other use cases and users. And considering it adds no implementation requirements, and clearly addresses a concern that I've been asked about easily a dozen time in the course of doing all this, I think it's relevant enough to include. If you have ideas of how to improve the wording, suggestions are welcome. |
Seems to me that if you talk about UUID, you talk about 128 bits. Imagine the work is needed to update all that software that just made an 128 bit field. Also for me, variable-length UUID is out of scope. I you want a UUID with variable length, it would be better to create a new type of identifier. For example, lets introduce the UOID (Universal Object ID). Everyone talking about UUID knows about 128 bits and everybody talking about UOID knows that it follows a different standard that could possibly have a variable-length version. |
@bradleypeabody My concern is that we risk sending a mixed message. For example, in the proposed text you say:
... which I agree with 100%. But your explanatory text seems to suggest otherwise (emphasis mine):
A UUID is 128 bits. If an ID is longer or shorter, or composed in way that does not comply with the specification, it should not be considered standard, or even "not non-standard".
I actually quite like this section. Nor am I averse to addressing the "should IDs be longer" topic there. As you say, they're a much-discussed topic. I just think it should be done in a way that is unambiguous about the fact longer ids are not part of this standard.
"Since you asked..." 😄 Longer UUIDs, Composition With Other DataThere has been much debate on the utility of longer forms of ids that provide additional uniqueness guarantees, or that allow for encoding additional information. Such ids are deliberately not addressed by this specification, as it is felt the requirements are too specialized to be effectively addressed at this time. That said, no prohibition is made that prevents applications from using a longer form of ID that combines a UUID with other data. Such constructs should be considered non-standard, however, and care should be taken not to refer to them as "UUIDs". In such cases, applications are encouraged to use a more generic term such as "ID" or "unique string", or invent a new term so as to avoid confusion with standard, 128-bit UUIDs. |
@broofa You have destroyed the main purpose of this specification: Introduce new UUIDs which make good database keys. Good database keys should be long enough to contain the metadata, but not at the expense of the random part. Otherwise, developers will have to accompany the UUID with additional fields, which is bad for the database architecture. A length of 160 bits is absolutely necessary. By the way, 160 is divisible by 5, which is convenient for Crockford base32. Your fundamentalist position will result in just one more standard that many will have to give up. It seems that the bad example of UUID taught you nothing. ULID and similar identifiers appeared due to the fact that the authors of the standard did not pay attention to the needs of the database architecture. Competing standards will emerge that will bypass your artificial length limitation. Is this what you want? By the way, I don't like the very term Variable Length UUID. It seems to imply that you can lengthen or shorten the UUID. But it's not. I prefer the more precise term UUID variants of various lengths. |
Personal attacks are not warranted. Nor do they help convince people of your argument.
I'm not convinced this has been established. For example, a database with 3.3 quintillion v4 UUIDs has a one-in-a-million chance of collision (P = 0.000001%). In real world terms, such a data Now before anyone cries foul over how contrived such examples are... trust me, I know. "What if you need P=0.0000000001%?"... "Such databases exist!!" My point is not that longer IDs aren't needed. It's (ironically) that we have to contrive use cases for them. We're not the only ones. To my mind, this is clear evidence that we don't understand the problem space well enough to be authoring a specification for them. For example, @sergeyprokhorenko, why are we so convinced 160 bits is sufficient? Why not 256? Or 1024? In 2005, when the RFC was authored, the idea of a system capable of guessing hashes at a rate of 24 x 1018 / second was just idle speculation. The stuff of science fiction. But Bitcoin was invented just 3 years later, and was requiring exactly that level of computation in 2018 when the article in that last link was written. And now, four years later, that hash rate has increased another order of magnitude. This is why I've been so persistent in my resistance to changing the variant, by the way. I foresee much bigger, sweeping changes to what UUIDs look like in the future. As much as I'd like to be contribute to authoring a spec that will last into the next century, I simply don't have the hubris to believe what we're doing here will come anywhere close to that. To my mind, we should limit our efforts to the problem space we understand. Meaning, primarily, providing a form of ID that fits within the parameters within which most of the UUID-using community operates, with easily identifiable (and justifiable) improvements. |
Couldn't agree more. |
@broofa You are confusing database table keys with cryptographically strong access tokens. They have completely different purposes. Database developers don't care about key guessability at all, and in most cases they get by with auto-increment. The length of 160 bits comes from real practical needs for metadata in keys. I am a systems analyst with many years of experience in many of the world's leading banks and companies, and unlike you, I don't have to invent these needs. I see them firsthand. And I would be sorry for my time wasted discussing a useless standard. If, as you yourself admit, you do not understand the problem, just trust a professional. You are also trying to save on key lengths while degrading the database architecture. But a greedy man pays twice or even more. You refer to the non-existent experience of using UUIDs as database keys in abstract community. This argument is worthless. I actually have experience with UUIDs as database keys. It was very convenient, very slow, and there was a severe lack of metadata in keys. |
@bradleypeabody I like your idea of describing prospect IDs for databases as a combination of UUID + other data in one field of a database table. I can offer the following concise wording: The surrogate key MAY be a concatenation of the UUID followed by an additional random part and metadata. By default, its length is 160 bits. |
@sergeyprokhorenko Interesting. The discussion so far has primarily focused on the extra entropy to guarantee collision resistance. Subsuming metadata under a surrogate key sounds like a very different story. What do the world's leading banks and companies exactly do with such a data structure? What kind of metadata do they embed? What kind of data items do such keys refer to? Why do they not use a composite key? |
Am I? From RFC4122, §6:
|
@LiosK It's not new in this discussion. See example I would suggest multiple types of metadata:
They use auto-increment, or UUID v4, or non-standard time-based UUIDs like in Laravel + checksum, or surrogate keys that have the following structure: operation type code or other code + date + sequence
They use composite keys very extensively and suffer greatly from this. |
I'm not sure you can make a generalized statement like this. Some database folks care about guesasbility, some don't.
I don't think that is the main purpose of this specification. If it is, where is it stated? |
@peterbourgon Here is the proof |
That indicates that "good database keys" is a goal, but it doesn't indicate that it's the main purpose of the specification. |
Is there any other reason for generating monotonic UUIDs? If you look at the selected prototypes of this specification, then you will no longer have doubts. |
There are tons of reasons to generate [monotonic] UUIDs that have nothing to do with databases. I've used ULIDs in many situations where they were simply record locators in files, or file names. |
How is monotonic behavior useful in such cases? I.e. as opposed to just using a v4 UUID (or ids of similar ilk.) |
The progenitor of oklog/ulid is oklog/oklog, which leverages ULIDs to assign (roughly) monotonic identifiers to each ingested log record. Those IDs establish a deterministic global order which is leveraged as an invariant at many points throughout the system, including as names for segment files (LO-HI.txt) which thereby become self-describing as to their contents, and naturally sortable. |
Group,
|
@kyzer-davis I missed your comment from a few days ago with the proposed text. My apologies for not paying attention. I will admit this whole UUID Long section caught me by surprise. I felt (hoped) that we'd settled this as being out of scope. Be that as it may, we've officially crossed my threshold for what does / does not qualify as an extension to RFC4122. Between "UUID Long" and the proposed alternate text encodings (#2 ), I can't support the spec in the proposed form. The number of possible permutations for UUID forms is too large to qualify as a "Standard". We're just enumerating how a variable number of bytes may be represented in a variety of encodings. This is not helpful. It will cause more problems than it solves. For example, if a system receives a UUID Long encoded in an unknown base, how are the base and length determined? Is there even a canonical solution to that? (I believe @bradleypeabody has raised this concern previously.) If we're going to continue down this path we need to simplify things to bring this back to the level of specificity that I believe a Standard demands. We should create a new RFC that drops the 8-4-12 notation, pick one (and one only!) new encoding, I guess adopt the new variant, and deprecate 4122 altogether. |
The complexity of the tool have to correspond to the actual complexity of the subject area. Developers cannot get by with a stone ax for all occasions. The variety of formats is caused by a variety of needs, and is not a thoughtless combination of possible components. |
@sergeyprokhorenko: Please convert this UUID to its 8-4-4-4-12 hex form: [Edit: screwed up the encoding 😦 ]
This should be, must be, a trivial problem if the spec is well-formed. If it's non-trivial (which I obviously think it is or I wouldn't be posing this question) then I assert this effort has gone astray and we need to rethink what exactly it is we're trying to accomplish here.
|
It's not a UUID at all, because the encoding is not Crockford's base32. There are many transcoding libraries out there, so leave that up to the developers. |
Well, a UUID doesn't have to be encoded in Crockford base32 to be a UUID — any encoding is fine, as long as the (decoded) bytes satisfy the relevant specification. But, with that said, 8-4-4-4-12 hex is already an encoding, so if you base64 encode that I guess you've gone one encoding too many 😉 However I do agree with the underlying point made by @broofa. Variable width types are enormously less efficient to parse than fixed width types: parsing one requires two phases and at least one conditional, and parsing a sequence of them is O(n) rather than O(1). |
@peterbourgon Nobody talks about dynamically variable width, which is different in different records of the table. It's just that for different purposes there should be UUIDs of different widths. |
So uuid6/uuid6-ietf-draft#85 defines UUID Long as
Which means that parsing a UUID Long requires the consumer to know how many additional bytes have been appended by the producer. As far as I can see, that makes a UUID Long a type with variable width. Typically the approach for types like this is to encode the width preamble or header to each value. But if that's not the case here, and the number of additional bytes are expected to be communicated out-of-band somehow, then UUID Long by itself isn't actually useful, because it doesn't contain enough information to enable end-to-end communication. You would need to declare that a field is of type "UUID Long 10", or "UUID Long 48", which is equally well expressed as "UUIDvN plus 10 bytes" or "UUIDvN plus 48 bytes" etc. |
It's right.
This is not true. All parties need to know the approved length of the UUID before use of the information system. The length should not change from time to time. There is no need to report the length of each UUID.
Yes, that's right. |
If UUID Long must be combined with a width in order to be usable, then there's no reason for UUID Long to exist. |
@peterbourgon, There is no logic in these words |
@sergeyprokhorenko I think @peterbourgon's point is that If all participants have to agree on the width of a UUID in order for it to be usable, then they can just as easily agree on other, better ways to share whatever information is encoded in the extra bytes. (E.g. composite data structure or separate DB column). UUIDs should be completely self-contained, with no need to consult any outside source (other than this spec) to understand how they are created, used, or parsed. This is what gives them their utility and uniqueness. As soon as you require applications to coordinate, to share other information about them, this specification quickly stops being useful. [Edit to add: And if the extra bytes are strictly there to insure uniqueness, which is the only actually-valid reason for extending a UUID imho, then we're just looping back to the debate about what should / should not be addressed in a new RFC.] |
Right. I can't say a column is of type |
I suggest wording into section "DBMS and Database Considerations", which will probably suit everyone. It replaces UUID Long.
|
Group, great discussion, this is why I author the proposed RFC text! Converting form GitHub threads to "RFC Speak" always drives great conversations and uncover things that may not have been considered. PR uuid6/uuid6-ietf-draft#85's text has been written in a way that I can easily remove That being said, we have a few engineering challenges with these sections but I am confident this group will be able to derive a great solution! I reviewed the last comments and I will summarize a few of the topics as usual. Signaling UUID Alt Encoding Method(s)@broofa makes a good point on this topic, how do we determine the method to unpack this? I believe the point is also relevant for regular UUID + alt encoding. If System A creates a UUID / UUID Long, encodes it with a random method and sends it to System B as
Signaling UUID Long LengthI'll be honest, this one slipped my mind and should have been included with my discussions here.
General URN Author note, I would need to do a deep dive on RFC8141 to ensure any potential URN proposals are valid syntax before authoring text. Editor Note: Must include text about assuming |
I wouldn't hardcode the signaling data identifiers into a UUID or into a message, but instead use links, alike links to imported libraries and functions. In addition, signaling data may only be needed when forwarding messages (similar to checksums). They should not be stored in the database, as they greatly increase database lookup time, especially if the signal data precedes the UUID in the key. A JSON-like envelope would be useful, in which the UUID will be sent. Inside this envelope will be checksums and information about the encoding. At the same time, some metadata must be stored in the database key along with the UUID, and cannot be assigned to separate JSON fields. |
AnnouncementI had a great discussion with @bradleypeabody and this topic has officially been marked For now please focus on the technical challenges proposed by my previous comment: #2 Edit: To further clarify, Draft 03 will cover UUIDv6 through v8 + Max UUID. The new Draft 00 will cover E Variant, Alternate Encoding and UUID Long. Two drafts that cover different topics so implementations may choose what they want to support. i.e An implementation supports RFC8675309 for v7 but not RFC123456789 for alt encodings. |
Group, Following the previous announcement I have drafted up a new RFC Draft document to cover this topic. Since the discussions threads are in this repo I decided to create new folder for the topic.
Additionally, I have authored an "Extended UUID URN namespace" for conveying encoding type and length of a UUID to other applications defined by this document. I still have more research on URNs to do but I feel confident enough the proposed URN is backwards compliant with RFC4122 and also compatible with RFC8141. |
It is curious to me why you chose
over
in the |
I toyed with both for a while in draft 00. Final: And for placing Lastly, this allows for future extension if need be. Say for example somebody wanted to extend further and describe Edit: The variable length of the |
All things Variable Length UUIDs!
Biggest Question up front: Should this be in scope for this RFC Draft?
If so as per @broofa
The text was updated successfully, but these errors were encountered: