(Very Small) Chance of Out-of-Order IDs May Cause Problems #5228

aschmitz · 2017-10-05T13:08:24Z

I'm filing an issue to take this out of a Mastodon conversation and have a place to discuss the actual issue.

Brief background: Status IDs (since #4801) contain a timestamp, plus 16 bits of a counter with a random base. Because this is a counter mod 2**16, if multiple statuses occur in the same millisecond, there is a small chance of the counter wrapping, and a status occurring "out of order" where a client that requests statuses during that millisecond may not see the newer status, despite it having a lower ID.

This is extraordinarily unlikely in practice. Some quick math (throw it in https://sagecell.sagemath.org/ if you want to change numbers):

# Borrowed from https://gist.github.com/noskill/8640130
def poisson(lmbda, x):
    return e ** (-lmbda) * lmbda ** x / factorial(x)

statuses = 1000000
time_in_ms = 1000 * 60 * 60 * 24 * 30

rate = statuses / time_in_ms

at_most_one = poisson(rate, 0) + poisson(rate, 1)

two_in_same_ms_odds = 1 - at_most_one

two_in_same_ms_hours = (1 / two_in_same_ms_odds) / 1000 / 60 / 60

print "Hours between two statuses in the same millisecond: %f" % two_in_same_ms_hours

max_count = 100
covered = two_in_same_ms_odds
freq_problems = 0
for count in range(2, max_count):
    rate_at_count = poisson(rate, count)
    covered += rate_at_count
    # For our purposes, the "starting" low 16 bits are random,
    # regardless of method. If we have 2 statuses in the same ms
    # there is one "bad" starting value (2**16-1). If there are
    # 3 in the same ms, there are 2 bad starting values, etc.
    odds_of_problem = (count - 1) / 2**16
    freq_problems += rate_at_count * odds_of_problem

problems_ms = 1 / freq_problems
problems_days = problems_ms / 1000 / 60 / 60 / 24
problems_years = problems_days / 365.25

print "Years between wrapped status IDs in same millisecond: %f" % problems_years

Hours between two statuses in the same millisecond: 3.733440
Years between wrapped status IDs in same millisecond: 27.908198

Note that there are some really wildly incorrect assumptions here: you see a million statuses per 30 days in your timeline (which is pretty ridiculously high), and they are uniformly distributed over the day (which is definitely not the case, but possibly offsets the high number of statuses? It makes the math much easier).

Under current assumptions, if operations take no time (worst case), and you never refresh the whole feed, and you only request statuses with IDs higher than the latest one you've seen, you'll miss one every 27.9 years, which isn't too bad. (Note that if it would notify you because you were tagged in it, you'll see it anyway.)

I see a few options, and would be open to more:

Do nothing. Just ignore the issue and hope it doesn't seriously bite anyone. This isn't an option I like, but I'll throw it out just in case anyone really dislikes the others.
Reset the counter to zero every millisecond. This is practically intractable, because it would add a pretty massive amount of overhead to something centralized (probably the database, but possibly a round-trip to Redis instead... which may well take more than a millisecond anyway, if Redis is on a different machine than the app server).
Give all statuses that are in the included milliseconds that the user requests (that is, if a user requests a max_id of 123456789, instead return all statuses with IDs up to 123456789 | 0xffff, to include any statuses that could have occurred in that millisecond). This is easy enough to implement, but requires clients to deduplicate statuses they receive.
Simply don't return statuses from the most recent millisecond when users request them. This artificially delays statuses, but only a very tiny amount, with network latency almost certainly dominating any calculations here. This may be the least impactful option, although I'm not 100% sure how to do it off the top of my head.

The text was updated successfully, but these errors were encountered:

nightpool · 2017-10-05T14:37:31Z

Honestly, with a mean time of 30 years between failures, I'm very much inclined not to do something. However, if we choose to do something, #3 seems to be the correct option.

…

On Thu, Oct 5, 2017 at 9:08 AM aschmitz ***@***.***> wrote: I'm filing an issue to take this out of a Mastodon conversation and have a place to discuss the actual issue. Brief background: Status IDs (since #4801 <#4801>) contain a timestamp, plus 16 bits of a counter with a random base. Because this is a counter mod 2**16, if multiple statuses occur in the same millisecond, there is a small chance of the counter wrapping, and a status occurring "out of order" where a client that requests statuses during that millisecond may not see the newer status, despite it having a lower ID. This is extraordinarily unlikely in practice. Some quick math (throw it in https://sagecell.sagemath.org/ if you want to change numbers): # Borrowed from https://gist.github.com/noskill/8640130def poisson(lmbda, x): return e ** (-lmbda) * lmbda ** x / factorial(x) statuses = 1000000 time_in_ms = 1000 * 60 * 60 * 24 * 30 rate = statuses / time_in_ms at_most_one = poisson(rate, 0) + poisson(rate, 1) two_in_same_ms_odds = 1 - at_most_one two_in_same_ms_hours = (1 / two_in_same_ms_odds) / 1000 / 60 / 60 print "Hours between two statuses in the same millisecond: %f" % two_in_same_ms_hours max_count = 100 covered = two_in_same_ms_odds freq_problems = 0for count in range(2, max_count): rate_at_count = poisson(rate, count) covered += rate_at_count # For our purposes, the "starting" low 16 bits are random, # regardless of method. If we have 2 statuses in the same ms # there is one "bad" starting value (2**16-1). If there are # 3 in the same ms, there are 2 bad starting values, etc. odds_of_problem = (count - 1) / 2**16 freq_problems += rate_at_count * odds_of_problem problems_ms = 1 / freq_problems problems_days = problems_ms / 1000 / 60 / 60 / 24 problems_years = problems_days / 365.25 print "Years between wrapped status IDs in same millisecond: %f" % problems_years Hours between two statuses in the same millisecond: 3.733440 Years between wrapped status IDs in same millisecond: 27.908198 Note that there are some really wildly incorrect assumptions here: you see a million statuses per 30 days in your timeline (which is pretty ridiculously high), and they are uniformly distributed over the day (which is definitely not the case, but possibly offsets the high number of statuses? It makes the math *much* easier). Under current assumptions, if operations take no time (worst case), and you never refresh the whole feed, and you only request statuses with IDs higher than the latest one you've seen, you'll miss one every 27.9 years, which isn't too bad. (Note that if it would notify you because you were tagged in it, you'll see it anyway.) I see a few options, and would be open to more: 1. Do nothing. Just ignore the issue and hope it doesn't seriously bite anyone. This isn't an option I like, but I'll throw it out just in case anyone really dislikes the others. 2. Reset the counter to zero every millisecond. This is practically intractable, because it would add a pretty massive amount of overhead to *something* centralized (probably the database, but possibly a round-trip to Redis instead... which may well take more than a millisecond anyway, if Redis is on a different machine than the app server). 3. Give all statuses that are in the included milliseconds that the user requests (that is, if a user requests a max_id of 123456789, instead return all statuses with IDs up to 123456789 | 0xffff, to include any statuses that could have occurred in that millisecond). This is easy enough to implement, but requires clients to deduplicate statuses they receive. 4. Simply don't return statuses from the most recent millisecond when users request them. This artificially delays statuses, but only a very tiny amount, with network latency almost certainly dominating any calculations here. This may be the least impactful option, although I'm not 100% sure how to do it off the top of my head. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5228>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAORV7WR1KfHuk8SQc5IFHGntVl7pRJcks5spNTMgaJpZM4PvDbM> .

nightpool · 2017-10-05T14:40:21Z

Wait, if we're doing this calculation based on milliseconds, then we'll run into the problem that all federated statuses only have seconds-based precision, and all "happen" simultaneously in the same millisecond. This maybe *also* needs to be fixed by giving federated statuses a random or pseudo random ms value.

…

On Thu, Oct 5, 2017 at 10:37 AM Evan G ***@***.***> wrote: Honestly, with a mean time of 30 years between failures, I'm very much inclined not to do something. However, if we choose to do something, #3 seems to be the correct option. On Thu, Oct 5, 2017 at 9:08 AM aschmitz ***@***.***> wrote: > I'm filing an issue to take this out of a Mastodon conversation and have > a place to discuss the actual issue. > > Brief background: Status IDs (since #4801 > <#4801>) contain a timestamp, > plus 16 bits of a counter with a random base. Because this is a counter mod > 2**16, if multiple statuses occur in the same millisecond, there is a small > chance of the counter wrapping, and a status occurring "out of order" where > a client that requests statuses during that millisecond may not see the > newer status, despite it having a lower ID. > > This is extraordinarily unlikely in practice. Some quick math (throw it > in https://sagecell.sagemath.org/ if you want to change numbers): > > # Borrowed from https://gist.github.com/noskill/8640130def poisson(lmbda, x): > return e ** (-lmbda) * lmbda ** x / factorial(x) > > statuses = 1000000 > time_in_ms = 1000 * 60 * 60 * 24 * 30 > > rate = statuses / time_in_ms > > at_most_one = poisson(rate, 0) + poisson(rate, 1) > > two_in_same_ms_odds = 1 - at_most_one > > two_in_same_ms_hours = (1 / two_in_same_ms_odds) / 1000 / 60 / 60 > print "Hours between two statuses in the same millisecond: %f" % two_in_same_ms_hours > > max_count = 100 > covered = two_in_same_ms_odds > freq_problems = 0for count in range(2, max_count): > rate_at_count = poisson(rate, count) > covered += rate_at_count > # For our purposes, the "starting" low 16 bits are random, > # regardless of method. If we have 2 statuses in the same ms > # there is one "bad" starting value (2**16-1). If there are > # 3 in the same ms, there are 2 bad starting values, etc. > odds_of_problem = (count - 1) / 2**16 > freq_problems += rate_at_count * odds_of_problem > > problems_ms = 1 / freq_problems > problems_days = problems_ms / 1000 / 60 / 60 / 24 > problems_years = problems_days / 365.25 > print "Years between wrapped status IDs in same millisecond: %f" % problems_years > > Hours between two statuses in the same millisecond: 3.733440 > Years between wrapped status IDs in same millisecond: 27.908198 > > Note that there are some really wildly incorrect assumptions here: you > see a million statuses per 30 days in your timeline (which is pretty > ridiculously high), and they are uniformly distributed over the day (which > is definitely not the case, but possibly offsets the high number of > statuses? It makes the math *much* easier). > > Under current assumptions, if operations take no time (worst case), and > you never refresh the whole feed, and you only request statuses with IDs > higher than the latest one you've seen, you'll miss one every 27.9 years, > which isn't too bad. (Note that if it would notify you because you were > tagged in it, you'll see it anyway.) > > I see a few options, and would be open to more: > > 1. Do nothing. Just ignore the issue and hope it doesn't seriously > bite anyone. This isn't an option I like, but I'll throw it out just in > case anyone really dislikes the others. > 2. Reset the counter to zero every millisecond. This is practically > intractable, because it would add a pretty massive amount of overhead to > *something* centralized (probably the database, but possibly a > round-trip to Redis instead... which may well take more than a millisecond > anyway, if Redis is on a different machine than the app server). > 3. Give all statuses that are in the included milliseconds that the > user requests (that is, if a user requests a max_id of 123456789, > instead return all statuses with IDs up to 123456789 | 0xffff, to > include any statuses that could have occurred in that millisecond). This is > easy enough to implement, but requires clients to deduplicate statuses they > receive. > 4. Simply don't return statuses from the most recent millisecond when > users request them. This artificially delays statuses, but only a very tiny > amount, with network latency almost certainly dominating any calculations > here. This may be the least impactful option, although I'm not 100% sure > how to do it off the top of my head. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#5228>, or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAORV7WR1KfHuk8SQc5IFHGntVl7pRJcks5spNTMgaJpZM4PvDbM> > . >

clworld · 2017-10-05T19:16:03Z

I found out-of-order arrival of status id is already happen with non-snowflake sequential id.
That situation seems more common than within millisecond statuses arrival.

Without fixing this, I prefer 1 because some statuses are already lost via REST API.

Gargron · 2017-10-05T19:24:15Z

I have a few general comments about the snowflake IDs, and want to put them down here. Maybe I merged too soon - but I was getting impatient and didn't want to hold up a new contributor too long either.

the timestamp_id() function in Postgres uses the vanilla now() epoch. But we don't need to store any dates prior to when Status Net was first created! I think that's 2010 but just to be sure we could take 2009. Going forward that would allow us a lot more storage capacity.
I think the md5 salt in the sequence in the timestamp_id() function is over-engineered and unnecessary. It solves a problem I do not consider to be a problem. Using md5 on every id generation is surely not good for performance either, even though md5 is the weakest hashing function.
I think using the full sequence number rather than only the last two bytes is okay. Like this solution, we can even modulo that sequence to 1024 or some limited number like that. Because timestamp_id() is only ever called for millisecond-precision timestamps, this means we would have a throughput of 1024 (or whatever) new IDs per millisecond, which I think is enough. Also, when the modulo spills over, Twitter's snowflake implementation for example simply waits for the next millisecond to tick over so you can start from scratch.

Flickr's snowflake generator implementation: https://github.com/formspring/flake/blob/master/flake.py
Twitter's snowflake generator implementation: https://github.com/twitter/snowflake/blob/b3f6a3c6ca8e1b6847baa6ff42bf72201e2c2231/src/main/scala/com/twitter/service/snowflake/IdWorker.scala

aschmitz · 2017-10-05T20:30:01Z

@nightpool I think we'd only have the "rounded off milliseconds" problem with backdated statuses, in which case they'd be inserted "out of order" anyway, so I'm not immediately concerned about it. We could (and probably should?) pick a random millisecond if the milliseconds on the timestamp are 0, but I don't think it's a major concern in terms of ordering, because they'll already be behind the current IDs (at least after #5211 gets merged).

@clworld Have you seen this happening in practice? I certainly believe that it could happen, but I don't know how common it is. If it's already common enough, I'd just stick with option 1 and say "yeah, it could be unfortunate, but it's so rare that you'll drop statuses for other reasons anyway a lot more often".

@Gargron In order:

Let's assume we can only go up to 63 bits because somehow we're using signed 64-bit ints (which I think we are, in Postgres). This gives (2**(63-16)/1000/60/60/24/365) ~= 4662 years of IDs without running out. Sure, we could save a whole 30 years by using a nonstandard epoch, but I figured I'd just go with the Unix epoch for consistency.
To an extent, it solves a problem I'd like to avoid, but I guess for some purposes it could feel over-engineered. I benchmarked it with Postgres on a single core of a midrange CPU and didn't find it to take significant amounts of time compared to other status creation actions. (In terms of MD5 being weak, I agree, but it's reasonably fast and included in all supported versions of Postgres and has no publicly known preimage attacks, making it secure enough for the intended purpose.)
I'm not entirely sure what you're saying here. If you're suggesting a sequence number within the millisecond, Postgres provides no way to do this (without storing it in a temporary table at significant levels of overhead), and doing it in the app servers poses a different distribution problem (that would be a lot more difficult to wrangle). Otherwise, you'll still have something that wraps at some point when you take it modulo some number. (I'd actually like the idea of a sequence number that restarts every millisecond on the millisecond, but I can't find a good way to do it.)

Gargron · 2017-10-05T21:01:34Z

@aschmitz

Postgres provides no way to do this (without storing it in a temporary table at significant levels of overhead)

But isn't that what nextval() does? Take nextval(), like you already do, modulo 1024, if the result is 0 it means we wrapped around -> wait a millisecond and try again. Shouldn't that be possible?

aschmitz · 2017-10-06T01:03:01Z

@Gargron Ah, I see what you're saying. You'd just wait if we wrapped, not reset at 0 each millisecond. That's not compatible with what I'm looking for (it easily reveals whether or not two statuses with nearby (but not identical) timestamps are sequential), but I can stick with my generator on my instance and we can swap out a different one for general use if you feel it's necessary, so it doesn't bother me all that much (though I'd like to get the marginal benefits to everyone else, it's better than nothing, particularly on many-user instances).

I can try to look into something like that this weekend if necessary. (In particular, the waiting for fractional milliseconds part will be interesting.) I don't actually see any significant advantage to doing that rather than what we have now, but if you'd rather not live without it, I can look into what'd be required.

If there were an atomic check-and-set in Postgres, this would be trivial, but alas I'm not aware of any.

Gargron closed this as completed Apr 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Very Small) Chance of Out-of-Order IDs May Cause Problems #5228

(Very Small) Chance of Out-of-Order IDs May Cause Problems #5228

aschmitz commented Oct 5, 2017 •

edited

nightpool commented Oct 5, 2017 via email

nightpool commented Oct 5, 2017 via email

clworld commented Oct 5, 2017

Gargron commented Oct 5, 2017 •

edited

aschmitz commented Oct 5, 2017

Gargron commented Oct 5, 2017

aschmitz commented Oct 6, 2017

(Very Small) Chance of Out-of-Order IDs May Cause Problems #5228

(Very Small) Chance of Out-of-Order IDs May Cause Problems #5228

Comments

aschmitz commented Oct 5, 2017 • edited

nightpool commented Oct 5, 2017 via email

nightpool commented Oct 5, 2017 via email

clworld commented Oct 5, 2017

Gargron commented Oct 5, 2017 • edited

aschmitz commented Oct 5, 2017

Gargron commented Oct 5, 2017

aschmitz commented Oct 6, 2017

aschmitz commented Oct 5, 2017 •

edited

Gargron commented Oct 5, 2017 •

edited