Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not sure I like the new spec's "increment within same ms" change + more thoughts #3

Closed
RobThree opened this issue Aug 4, 2016 · 16 comments

Comments

@RobThree
Copy link

RobThree commented Aug 4, 2016

You changed the spec to include:

To ensure sortability for IDs generated within the same millisecond, implementations should remember the most recently generated randomness string and increment the most significant character that is not the last character in the encoding.

I don't think you understand the impact of this change; it brings a whole lot of hurt into your world. For instance, you can't rely on wall-clock-time (which you fell for). Unless you can get a reliable, monotonic, strictly increasing timesource (which is (near?) impossible AFAIK in JS) you'll run into trouble later. Why? Because: wall-clock-time goes back (DST) / stands still (leap seconds are sometimes implemented as such) / does other weird stuff (and there's even more). Trust me, I know 😉 Sidenote: some systems may not even have 'ms-resolution' timers, something to consider too.

There's 80 bits of random data; I think it's safe to assume (even with the birthday paradox in mind) that the chance on a collision (within any, random, millisecond) is similar to the chance of a GUID collision. However, that would be if the ULID's would be spread evenly across time (which would, in effect, make a ULID similar or even equal to a GUID). So, given that ULID's are typically 'clustered' in a certain timespan the chance on a collision goes up. Worst-case would be a DST change + some other 'random event' that would cause the same 'millisecond'-point-in-time to exist (say) 3 or 4 times. The chance on a collision would increase a lot but (I haven't done the math yet) would still suffice for most purposes unless you're Google maybe and generating many millions or even billions of ULID's p/ms.

So I guess my advice is to leave that part out of the spec (and, let's be honest, this method is no beauty either (it could be improved but wouldn't be worth the trouble IMHO)) and keep implementations much simpler that way. You could compare a ULID it with a v4 UUID with the only difference that the first 6 bytes (48 bits) are reserved for a timestamp for better ordering. However, it does mean that, within the millisecond, the ordering is 'random' so that the spec may need to be more clear about it / reflect that fact.

Besides time, by the way, this tiny change in the spec also requires you to keep state between generating ULID's (so you'll need to keep the generator-object around for example), introduces chances for race conditions (multithreading; N threads generating ULID's will, without proper locking etc. inevitably generate exactly the same ULID's) and all other sorts of things that either need to be taken for granted (which will bite you in the *ss sooner than later) or need to be worked out / specified.

For inspiration about some more of the problems you might run into you might want to look at Twitter's Snowflake (retired but still available) or have a look at my "Snowflake-alike" IdGen. Both are 64 bit (actually 63*) equivalents of ulid.

* Come to think of it: You might also want to steer clear of the sign-bit; when systems would order on the (internal) 128 bit representation / byte-array (for 'speed') the order may 'break' as soon as the sign bit is set to 1. Twitter thought of it way before it happened but I think it's a good idea. Sacrificing 1 bit out of 128 should't be that much of a problem (though it does halve your "ulid-space" from 2^128 to 2^127).

I was/am doing a ulid .Net port (waaay more extensive than fvilvers', with all due respect!) and also implemented the binary form, created methods to convert to/from UUID's (since ULID's and UUID's are both 128 bit you can easily convert one to the other and vice versa). I also implemented a Parse(...) method and some others and will publish on GitHub very soon have put it on GitHub. There are some things I thought of / came across which might need 'agreement' and be put in the spec:

  • ulid's are, IMHO, case-insensitive (e.g. NOT case-sensitive) We seem to agree on that; missed it in your README 😛
  • We may want to allow / think about / specify a(n optional?) separator for readability. No { and } mess like (G/U)UID's but something like ttttt-ttttt-rrrrrr-rrrrr-rrrrr (lengths 5-5-6-5-5) might make ulid's more readable. Other formats/separators are ofcourse a possibility (as well as none of this ofcourse; I'm just bringing my ideas to the table)
  • You/we may want to decide on pronunciation and document it: "you-lid"? "you-el-i-dee"? "uhl-id"? Something else? You/we want to avoid the GIF's "Jif" v.s. "Gif" wars
  • Besides the pronunciation, same goes for a "standardized" writing. I catch myself writing ulid, ULID, UlId, ... so that may have to be standardized too
  • I had more... but can't think of it now. Will add here / open new issue when I think of it.

So far, for now, another $0.02 of mine 😝


Edit: FYI; I just released V1.0 of NUlid. 🎉 Maybe you can add it to your list of community contributed ports?

@waaghals
Copy link

waaghals commented Aug 4, 2016

Firstly, Dôh. I've just spent the last two hour or so implementing a proper port to .Net not knowing one already existed. I pushed what I have, but I'll just abandon mine because yours is much more complete.

I'm unsure about leaving out the 'increment within same ms' as this will guaranty that the id's are always monotonically increasing. However I was thinking of creating a generator for this instead of implementing it in the Ulid Type itself so it does not have that responsibility. This way it could be left out of the spec, while still be available for consumers which need this guaranty.

The race conditions are solvable within C#, but I'm not so confident for Javascript because it does not have a notion of locking. However Firebase has been successfully using a JavaScript approach for some time now.

You/we may want to decide on pronunciation and document it: "you-lid"? "you-el-i-dee"? "uhl-id"?

To late, I already started calling the "you-lid" 😜

@RobThree
Copy link
Author

RobThree commented Aug 4, 2016

However I was thinking of creating a generator for this instead of implementing it in the Ulid Type itself so it does not have that responsibility. This way it could be left out of the spec, while still be available for consumers which need this guaranty.

Mine already has that possibility; you could implement an IUlidRng that generates increasing values within the same millisecond. However, as pointed out, this requires you to keep state, tip-toe around datetime (wall-clock or not) issues, think about locking etc. and, if not very carefully implemented, is a disaster waiting to happen whereas random (with 80 bits of entropy) is safe enough for most purposes; however ordering will only be reliable to the ms resolution, below that ordering would be random. Either way is fine with me, by the way, but having seen the other implementations (which also don't implement this increasing-within-the-same-timeslot) I'd favor the 'old' (or current by others implementations) behavior.

The race conditions are solvable within C#

They'll be solvable in most, if not any, language (JS is usually run in a single-threaded environment anyway AFAIK). 😉 My point was that it introduces complexity (state, races, ....) for little gain (unless you really want your ordering to be guaranteed to even within the ms-range. If that's the case then I agree; use a custom "Rng" (thinking of providing one myself) but it shouldn't be part of the spec.

Also: If we'd choose the "increasing" route; Why increase by 1? Why not increase with a random amount every call? If we increase by 1 the ULID's would become 'enumerable' again (meaning: if a ULID appears in a URL for example then I could try +1 or -1 to explore the system). One of the nice properties of GUID/UUID's is that they aren't sequential and 'hard to guess' (although that doesn't mean it's secure or you should rely on that fact for security purposes!). Care would have to be taken if a lot of ULID's were generated in the same ms to handle overflow of the 80bit value (which could (more) easily happen if the increments aren't 1 but, say, rand(0..1000000) and the initial 80 bits are close to the overflow value) but other than that I'd opt for random increments instead of 1. Ofcourse, the increment itself (be it a 'fixed' value of 1, 42, ... or a 'random' (clamped) value) could be another property of the generator.

To late, I already started calling the "you-lid" 😜

That makes two of us! 🎉


Edit: I have written a (proposed) "sequential ULID RNG" that should work with NUlid. You can find the source here. It's not done yet (see line 68), though, since I need to implement addition (and handle overflow/wraparound) over a 10-byte integer. I haven't decided on how to implement that (easy route would be to use System.Numerics's BigInteger but a) it's slow as hell for this specific purpose and b) I believe it's not available in .Net core it is!). As soon as I've decided I will implement and include the SequentialUlidRng in the next release.

In essence it's a 'wrapper' for a given RNG. It will use any IUlidRng to create random values but will increment the last value when called within the same millisecond by either a) a fixed value like 1, 42, ... or b) a random value between min and max.

Feedback and ideas are appreciated 👍

@rafaelsales
Copy link

I also don't like the current approach for that issue, it's too flaky.

In Ruby I'm able to get time up to nano, and then increasing the time section of the ULID improves that.

But in JS, not sure

@savonarola
Copy link
Contributor

savonarola commented Aug 11, 2016

Hello!

It isn't quite clear how we can increase the most significant character. As far as I understand this, starting, for example, from "01ASXJXJJF 0T7FMQ0S7B9K1D6T", we should generate the following:

01ASXJXJJF 0T7FMQ0S7B9K1D6T
01ASXJXJJF 1T7FMQ0S7B9K1D6T
...
01ASXJXJJF ZT7FMQ0S7B9K1D6T
01ASXJXJJF ZU7FMQ0S7B9K1D6T
01ASXJXJJF ZV7FMQ0S7B9K1D6T
...
01ASXJXJJF ZZ7FMQ0S7B9K1D6T
01ASXJXJJF ZZ8FMQ0S7B9K1D6T
...
01ASXJXJJF ZZZZZZZZZZZZZZZZ

(space added to visually separate time part from random one)
If we act like this, we will be able to generate at most 32*16 = 512 ULIDs within a millisecond, and it's very few.

Have I misunderstood something? In my opinion we should better increase the 16-char random tail as if it were an 16-digit 32-based integer.

@alizain
Copy link
Collaborator

alizain commented Aug 11, 2016

Okay so clearly the change in spec was not as well thought out as it should have been. I admit, it was made in (hopefully) uncharacteristic haste. While the twitter/snowflake implementation gets around that by remembering the highest last seen time, I agree that it's not worth enforcing the complicating on implementations. I'd be happy to roll it back if everyone agrees

@alizain
Copy link
Collaborator

alizain commented Aug 11, 2016

Come to think of it: You might also want to steer clear of the sign-bit; when systems would order on the (internal) 128 bit representation / byte-array (for 'speed') the order may 'break' as soon as the sign bit is set to 1. Twitter thought of it way before it happened but I think it's a good idea. Sacrificing 1 bit out of 128 should't be that much of a problem (though it does halve your "ulid-space" from 2^128 to 2^127).

@RobThree, so if I understand it correctly, this may have an affect on the binary representation, but not the string encoded one. Also, if everything is stored as unsigned_ints, would this problem still hold?

@alizain
Copy link
Collaborator

alizain commented Aug 11, 2016

We may want to allow / think about / specify a(n optional?) separator for readability. No { and } mess like (G/U)UID's but something like ttttt-ttttt-rrrrrr-rrrrr-rrrrr (lengths 5-5-6-5-5) might make ulid's more readable. Other formats/separators are ofcourse a possibility (as well as none of this ofcourse; I'm just bringing my ideas to the table)

@RobThree, I don't have any particular preference for or against separators. It could be made optional for better readability. Would you like to open another issue to discuss in? This one is getting kind of cramped 😁

@alizain
Copy link
Collaborator

alizain commented Aug 11, 2016

@RobThree @waaghals I prefer "you-lid" for pronunciation as well 🎉 , and have no preference between all-caps or no-caps when written. However, I'm not a big fan of "Ulid" 😟

@alizain
Copy link
Collaborator

alizain commented Aug 11, 2016

@waaghals AFAIK node has no notion of locking because it is single-threaded. All code in the JS implementation of ULID is also completely synchronous

@alizain
Copy link
Collaborator

alizain commented Aug 11, 2016

If we act like this, we will be able to generate at most 32*16 = 512 ULIDs within a millisecond, and it's very few.

@savonarola, you're quite right, it was a hasty change on my part. It has now been reverted

@alizain
Copy link
Collaborator

alizain commented Aug 11, 2016

@RobThree NUlid has been added to the readme 🎉

@RobThree
Copy link
Author

RobThree commented Aug 13, 2016

@savonarola

It isn't quite clear how we can increase the most significant character.

We should never 'increase' anythin in the base32 space. Base32 is just a representation of the actual (6 or 10 byte) value. It could've just as well been base 10 but base32 has other advantages. So increasing anything should be done as you would increase any other value. The only difference is that we're used to work with 1, 2, 4 or 8 bytes (byte, word, dword, long or whatever you call it in your preferred language) and this time we happen to work with 6 (time) or 10 ('random') byte sized values. The resulting base32 representation will follow from that automatically.

In my opinion we should better increase the 16-char random tail as if it were an 16-digit 32-based integer.

We should only increment the "random" part (if increments are decided upon at all; be it increments of 1 or fixed "X" values or random values as per my earlier suggestion) and take care of not 'overflowing' into the time part of the ulid; that would seriously affect the reliability of knowing when a ulid was generated; one of it's key 'selling points' to begin with. So only affect the 10 "random" bytes, leave the 6 "time" bytes in tact; throw an overflowexception or something similar if the increment causes the 'random part' to overflow; not using the bits from the "time" part to overflow into.

I was personally thinking along the following lines: when creating an "initial random value" (so first value in a 'fresh timeslot / ms') don't use the first (leftmost) byte; leave it at 0. Use the other 9 for random value. This will give you 256 'overflows' of the 9 other bytes before you'll come close to even thinking about touching the 'time part' of the ulid. This way you can still increment by 1, X or <random_value> a lot of times and still gives you enough entropy (the other 9 bytes) to avoid most collisions.

Long story short: I still like the 'simplicity' of a ulid and still think it's a nice idea (which I've had several times in my career as well) but I also think that we're moving towards a snowflake-alike thing in the end if we don't watch out for feature creep; we're (already) reinventing the wheel a bit. If ulid's are to become widely adopted then it should have simplicity (not too much stuff encoded, not too hard to implement), readability (yay base32) and a very tiny chance of collisions (so the 'best' feature of a UUID/GUID).

@alizain

Also, if everything is stored as unsigned_ints, would this problem still hold?

Yes; some systems don't have a notion of signed/unsigned which is what not-using the MSB is attempting to prevent in the first place. It messes up the sorting if you use the actual (16 byte) value because of the 'wraparound' because of the SIGNED(note: not UNsigned)-ness.

I don't have any particular preference for or against separators. It could be made optional for better readability.

Sure; no problem. Simplest is you just find 26 chars [0-9ABCDEFGHJKMNPQRSTVWXYZ] and filter out any allowed separators (like .,-/_: for example) so any of the following would be valid:

01ARYZ6S41-DEAD-BEEF-DEAD-BEEF
01ARYZ6S41-DEADBE-EFDEA-DBEEF
01AR:YZ6S:41DE:ADBE:EFDEAD:BEEF
01AR-YZ6S_41/DEAD/BEEF:DEAD.BEEF

However, that would severely impact the recognizability of a ulid (though I exaggerated the above example greatly with wildly varying notations and way too many allowed separators). Coming up with / "standardizing" one (maybe 2) ways to write ulid's (and preferably enforcing those, and only those, 'standardized upon' ways) besides them being one big 26-char string is probably a good idea.

NUlid has been added to the readme 🎉

Yay! 🎉

@waaghals
Copy link

I was thinking of using the last two (or more) LSB as incrementation space. This way there would be less randomness sacrificed. However this would yield in very similar Ulids when a collision occurs. (Not a issue for me) This would allow for 65536 increments of one within the same millisecond.

I was personally thinking along the following lines: when creating an "initial random value" (so first value in a 'fresh timeslot / ms') don't use the first (leftmost) byte; leave it at 0. Use the other 9 for random value. This will give you 256 'overflows' of the 9 other bytes before you'll come close to even thinking about touching the 'time part' of the ulid. This way you can still increment by 1, X or a lot of times and still gives you enough entropy (the other 9 bytes) to avoid most collisions.

I really like this approach, first I though you where sacrificing the MSB of randomness as a counter to allow 256 collisions within each millisecond, but this is not the case. This approach allows for random increments which makes the chances of a collision lower. However I think a full byte might be to large in most cases.

I think most of the time a single overflow bit would be sufficient. It would be really nice if it was possible to indicate where the overflow bits are. This way the number of overflow bits can be decided at runtime for the best possible fit. (i.e sacrifice randomness to prevent collisions, or vice-versa) The increment can then also be variable because a larger overflow space allows for much larger random increments)

The Ulids could be formatted as follows (! is the overflow separator):
01AN4Z07BY-!79KA1307SR9X4MV3 (no overflow bit, high change of overflow, very random)
01AN4Z07BY-7!9KA1307SR9X4MV3 (5 bits of overflow space)
01AN4Z07BY-79!KA1307SR9X4MV3 (10 bits of overflow space)

It might also be possible to use overflow spaces other then a multiple of five by introducing a extra character.

I just thought of this so not a lot of thought work went into it.

Thoughts?

@RobThree
Copy link
Author

RobThree commented Aug 19, 2016

overflow separator

I think it's overcomplicating things. We should reserve at least one, but preferably a few more bits to 'overflow into' when incrementing random amounts within the same ms. If an overflow happens you'll be able to determine it did because one or more of the bits designated as 'overflow bit' will be non-zero. No need for separator(s). If the number of bits, let's take 4 as a compromise, are 'fixed' then there's no need to complicate a ulid with "configurable/customizable overflow configuration".

@alizain
Copy link
Collaborator

alizain commented Jan 30, 2017

Sorry for the lack of activity on my part, is there still interest in properly implementing incrementation within the same ms?

@alizain
Copy link
Collaborator

alizain commented Mar 30, 2017

Closing issue, feel free to re-open!

@alizain alizain closed this as completed Mar 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants