Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-Serial ("Snowflake") IDs #4801

Merged
merged 20 commits into from Oct 4, 2017

Conversation

@aschmitz
Copy link
Contributor

commented Sep 4, 2017

Overview

This is a rather involved change, with the goal of moving IDs in all Mastodon tables away from the existing serial IDs to less-predictable IDs. While #1059 is open for this as the general idea of using "Snowflake" IDs (to borrow Twitter's term), it's not 100% clear what the different motivations for that issue are. My motivation is purely to hide the total number of entries in each table. (In particular, for single-user instances, this makes it significantly more difficult to determine the number of followers-only statuses, followed/associated accounts, etc.) However, this also implements most of the requests from #1059, with the exception of a "unique status IDs across the entire federation", which cannot be done in 64 bits (the limit of the integer type used by Postgres, and what we have to stick with unless we want even more massive changes).

The IDs generated by this scheme are guaranteed to be unique on a given database (as you'd expect), and are also well-ordered to within 1ms. (That is, multiple rows created within a millisecond may appear with out-of-order IDs, but any rows created more than a millisecond apart will have IDs with the ordering you would expect: the first one has a lower ID. In fact, within a given millisecond, most IDs will be in order, but there is a chance that some entries will appear "before" the first group, if many rows are inserted within the same millisecond.) This is done on the database side of the equation, to avoid depending on the time synchronization of (possibly multiple) Rails app servers. (Database servers, if separated from the app server, will tend to be a single server or at the very least have well-synchronized clocks if multiple masters are somehow involved.)

Probably-Breaking Change

The API now returns strings for IDs rather than IDs. There is a good chance that this will break some API clients, although more testing is probably required. (To my surprise, Tusky handled this change with no trouble.) The rationale for this is that JavaScript cannot parse JSON with numbers with more than 53 bits precisely, and we intend to use 64 bits (or at the very least, currently 57 bits). While we could go the Twitter route and provide both a semantically-correct integer and a JavaScript-usable string, it appears as though serving only the string is likely to lead to fewer issues in the long term. I'm open to exposing IDs as both strings and integers if that decision appears more prudent, however. (Ruby can output this with no problem.) More rationale for this decision is in the commit message for 86bf2c3.

Internal Changes

  • A number of database columns that handle IDs have been adjusted to be bigints from ints. This may be a long migration for some larger sites, although it ensures that all IDs are consistently using the non-serial generator.
  • A post-migration hook has been added to ensure that all id columns use the new ID generator. (This is done as a Rake task to ensure that it is not accidentally skipped, and to keep the burden light for creators of new tables.)
  • The Redis feeds for users now store both a score and value of a status ID as the sorted set entry, leaving reblog deduplication to a separate sorted set. This code has new tests to ensure it works as expected, and is documented in 9cbfe0a. Note that existing feeds will temporarily appear with reblogged statuses shown but not as reblogs. This was believed to be preferable to rebuilding all feeds, although code for doing so could be implemented. (As new updates are provided to feeds, they will be handled appropriately.)
  • Finally, several locations that modify user feeds have been restructured to move all feed-changing logic into FeedManager. See the commit message in 9cbfe0a for more information.

I know that this is a rather intense set of patches, so I'm happy to work with the maintainers to get it to a mergeable state. Please feel free to make any comments about changes that should be made and I can try to make them. (Alternatively, this branch is open to edits from maintainers, if there are minor changes.)

Closes #1059.

aschmitz added 4 commits Aug 17, 2017
This change makes a number of nontrivial tweaks to the data model in
Mastodon:

* All IDs are now 8 byte integers (rather than mixed 4- and 8-byte)
* IDs are now assigned as:
  * Top 6 bytes: millisecond-resolution time from epoch
  * Bottom 2 bytes: serial (within the millisecond) sequence number
  * See /lib/tasks/db.rake's `define_timestamp_id` for details, but
    note that the purpose of these changes is to make it difficult to
    determine the number of objects in a table from the ID of any
    object.
* The Redis sorted set used for the feed will have values used to look
  up toots, rather than scores. This is almost always the same as the
  existing behavior, except in the case of boosted toots. This change
  was made because Redis stores scores as double-precision floats,
  which cannot store the new ID format exactly. Note that this doesn't
  cause problems with sorting/pagination, because ZREVRANGEBYSCORE
  sorts lexicographically when scores are tied. (This will still cause
  sorting issues when the ID gains a new significant digit, but that's
  extraordinarily uncommon.)

Note a couple of tradeoffs have been made in this commit:

* lib/tasks/db.rake is used to enforce many/most column constraints,
  because this commit seems likely to take a while to bring upstream.
  Enforcing a post-migrate hook is an easier way to maintain the code
  in the interim.
* Boosted toots will appear in the timeline as many times as they have
  been boosted. This is a tradeoff due to the way the feed is saved in
  Redis at the moment, but will be handled by a future commit.

This would effectively close Mastodon's #1059, as it is a
snowflake-like system of generating IDs. However, given how involved
the changes were simply within Mastodon, it may have unexpected
interactions with some clients, if they store IDs as doubles
(or as 4-byte integers). This was a problem that Twitter ran into with
their "snowflake" transition, particularly in JavaScript clients that
treated IDs as JS integers, rather than strings. It therefore would be
useful to test these changes at least in the web interface and popular
clients before pushing them to all users.
Somewhat predictably, the JS interface handled IDs as numbers, which in
JS are IEEE double-precision floats. This loses some precision when
working with numbers as large as those generated by the new ID scheme,
so we instead handle them here as strings. This is relatively simple,
and doesn't appear to have caused any problems, but should definitely
be tested more thoroughly than the built-in tests. Several days of use
appear to support this working properly.

BREAKING CHANGE:

The major(!) change here is that IDs are now returned as strings by the
REST endpoints, rather than as integers. In practice, relatively few
changes were required to make the existing JS UI work with this change,
but it will likely hit API clients pretty hard: it's an entirely
different type to consume. (The one API client I tested, Tusky, handles
this with no problems, however.)

Twitter ran into this issue when introducing Snowflake IDs, and decided
to instead introduce an `id_str` field in JSON responses. I have opted
to *not* do that, and instead force all IDs to 64-bit integers
represented by strings in one go. (I believe Twitter exacerbated their
problem by rolling out the changes three times: once for statuses, once
for DMs, and once for user IDs, as well as by leaving an integer ID
value in JSON. As they said, "If you’re using the `id` field with JSON
in a Javascript-related language, there is a very high likelihood that
the integers will be silently munged by Javascript interpreters. In most
cases, this will result in behavior such as being unable to load or
delete a specific direct message, because the ID you're sending to the
API is different than the actual identifier associated with the
message." [1]) However, given that this is a significant change for API
users, alternatives or a transition time may be appropriate.

1: https://blog.twitter.com/developer/en_us/a/2011/direct-messages-going-snowflake-on-sep-30-2011.html
This was necessary because the previous behavior used Redis zset scores
to identify statuses, but those are IEEE double-precision floats, so we
can't actually use them to identify all 64-bit IDs. However, it leaves
the code in a much better state for refactoring reblog handling /
coalescing.

Feed-management code has been consolidated in FeedManager, including:

* BatchedRemoveStatusService no longer directly manipulates feed zsets
* RemoveStatusService no longer directly manipulates feed zsets
* PrecomputeFeedService has moved its logic to FeedManager#populate_feed

(PrecomputeFeedService largely made lots of calls to FeedManager, but
didn't follow the normal adding-to-feed process.)

This has the effect of unifying all of the feed push/unpush logic in
FeedManager, making it much more tractable to update it in the future.

Due to some additional checks that must be made during, for example,
batch status removals, some Redis pipelining has been removed. It does
not appear that this should cause significantly increased load, but if
necessary, some optimizations are possible in batch cases. These were
omitted in the pursuit of simplicity, but a batch_push and batch_unpush
would be possible in the future.

Tests were added to verify that pushes happen under expected conditions,
and to verify reblog behavior (both on pushing and unpushing). In the
case of unpushing, this includes testing behavior that currently leads
to confusion such as Mastodon's #2817, but this codifies that the
behavior is currently expected.
I could swear I made these changes already, but I must have lost them
somewhere along the line.
@nightpool

This comment has been minimized.

Copy link
Collaborator

commented Sep 4, 2017

If you want to conceal the velocity of all of your followed accounts (not sure what the threat model for this is but I'll take it as given) why not just create a bot that makes private messages to itself at random intervals? Or makes (X - statuses in last hour) private posts every hour?

It seems like there are vastly simpler ways to solve this problem

@aschmitz

This comment has been minimized.

Copy link
Contributor Author

commented Sep 4, 2017

There are obviously other ways to conceal that information to an extent. Covering the magnitude is more difficult by doing that: to provide good concealment, I'd still end up with very large numbers sooner or later, which would eventually lead to the rest of these changes being necessary (at the very least, the 4-byte ints to 8-byte bigint changes, although the full range of status IDs in the UI is an obvious thing to want to support anyway). So, I'd rather try to get these fixes in sooner rather than later.

@nightpool

This comment has been minimized.

Copy link
Collaborator

commented Sep 4, 2017

This seems like a huge change for no perceivable benefit to >90% of instances.

Also if StreamEntry id's are still sequential then you're leaking that information ANYWAY.

@aschmitz

This comment has been minimized.

Copy link
Contributor Author

commented Sep 4, 2017

I'll agree that it's a large change, although I think there are reasonable justifications behind it. In particular, the large changes, namely the lingering 4-byte IDs in some tables and ints-as-doubles in Redis and JavaScript issues, both feel like technical debt that will be bumped up against sooner or later. I'd argue that it's much easier to plan to fix them now - and thoroughly test before rolling out any changes - than in an "oh whoops, now what?" situation for a large instance later.

I'm definitely recommending carefully considering and testing what might go on with these changes, and I'm happy to help to the extent that I can. I made this PR not to get it merged immediately, but to try to see what can be done about some of these things in the near-to-medium term.

(In this PR, all IDs (including StreamEntry IDs) are converted to non-serial IDs, to avoid any problems like that, but yes, that's one of the reasons that just incrementing some counters on a random/timed basis is a risky solution. 😉)

@Gargron

This comment has been minimized.

Copy link
Member

commented Sep 4, 2017

@nightpool A couple comments:

  • ActivityPub makes no use of StreamEntry and we recently changed embed URLs to be based on status, not StreamEntry. StreamEntry is a legacy from the earliest days and it's on its way out - one day we'll be able to get rid of it
  • If this PR does what I think it does, the real benefit is not unguessability of database size (it's a small perk though I don't really see a threat model here), it's that we'll finally be able to backload old statuses without messing up chronology. This is what I'd been wishing for since I opened that Snowflake IDs issue

Most id columns have already been bigserial/bigint for future-proofness, but there's a good point here about the fact that if we keep putting them into JSON integers, one day, on any instance, it's going to cause trouble with integer overflow. So that change needs to happen one way or another.

Now, about the timing. This is a quite big PR so it will take a while for me to read through. We're also in the middle of the 1.6 release candidate phase, and I don't think such a big set of changes should go into 1.6 at this point, it might break something in subtle ways and delay our release even longer. However as we plan to bump to 2.0 soon after, this might be a good fit for that. Stringifying IDs in the API can be one of the breaking changes allowed by a major version bump.

@aschmitz

This comment has been minimized.

Copy link
Contributor Author

commented Sep 5, 2017

Being able to properly backfill statuses without messing up ordering is also a good point: the importer may have to twiddle a few of the lower bits at times to avoid duplicate IDs, but that can be done easily enough (particularly in batch backfill jobs).

I'd agree that this makes sense as more of a major version bump feature, particularly with the stringified IDs in the API: I have no particular need to get this in to 1.6, so if 2.0 is not far off, that seems like a good target.

(I expect that there will be unrelated database changes between now and then, and I can rebase this branch if/as necessary.)

Copy link
Collaborator

left a comment

Since 1.6.1 is out, it is better to merge as soon as possible to earn time enough to review (unless 1.6.2 is planned)

I made some comments, though they are not blocking merging.

@@ -7,6 +7,9 @@ class FeedManager

MAX_ITEMS = 400

# Must be less than MAX_ITEMS or the tracking sets will grow forever

This comment has been minimized.

Copy link
@akihikodaki

akihikodaki Sep 18, 2017

Collaborator

That should be in spec.

# either action is appropriate.
def add_to_feed(timeline_type, account, status)
timeline_key = key(timeline_type, account.id)
reblog_key = "#{timeline_key}:reblogs"

This comment has been minimized.

Copy link
@akihikodaki

akihikodaki Sep 18, 2017

Collaborator

reblog_key should be a method for the consistency.

@@ -0,0 +1,18 @@
class IdsToBigints < ActiveRecord::Migration[5.1]
def change

This comment has been minimized.

Copy link
@akihikodaki

akihikodaki Sep 18, 2017

Collaborator

It would be better if it is reversible (currently it is not because of the regex.)

This comment has been minimized.

Copy link
@aschmitz

aschmitz Sep 18, 2017

Author Contributor

Given that we currently have a mix of bigint and integer types, the only way to do this is to list them all explicitly. If that's desirable, I can do it. (Mostly, I did it this way so I could move the migration around to catch the latest columns before this gets merged. Now that it's actually getting merged, explicitly naming all the integer _id columns is fine, if lengthy.)

aschmitz added 3 commits Sep 18, 2017
This addresses the first two comments from review of this feature:

#4801 (comment)
#4801 (comment)

This adds an optional argument to FeedManager#key, the subtype of feed
key to generate. It also tests to ensure that FeedManager's settings are
such that reblogs won't be tracked forever.
This addresses a comment during review:
#4801 (comment)

This means we'll need to make sure that all _id columns going forward
are bigints, but that should happen automatically in most cases.
@aschmitz

This comment has been minimized.

Copy link
Contributor Author

commented Sep 18, 2017

Thanks for the review! I've made changes to address those comments. I do have one more fix for the JavaScript UI where I missed a numeric cast, so please don't merge this until I get that sorted out this evening.

@nightpool nightpool dismissed akihikodaki’s stale review Sep 18, 2017

see comment by aschmitz above

@Gargron

This comment has been minimized.

Copy link
Member

commented Sep 18, 2017

What is the benefit of using timestamp based IDs for accounts and literally all other tables? Can we limit that part to statuses only? Most of the tables aren't even exposed to the outside

@aschmitz

This comment has been minimized.

Copy link
Contributor Author

commented Sep 18, 2017

@Gargron

This comment has been minimized.

Copy link
Member

commented Sep 18, 2017

Oh but I don't mean the bigint upgrade. In fact, bigint is default for primary IDs starting with Rails 5.1 (but Mastodon tables have been created before that). So that upgrade is good. But we only need chronological sorting for statuses. I'd like to minimize the surface area of this change.

@akihikodaki

This comment has been minimized.

Copy link
Collaborator

commented Sep 18, 2017

By the way, if other tables do not get snowflake IDs, should they abandon string representations in JSON, or should they be string at the same time status IDs become?
In my opinion, they should be stringified as well for consistency and future changes.

@Gargron

This comment has been minimized.

Copy link
Member

commented Sep 18, 2017

Yes, all primary IDs should be bigint and all bigints should be stringified in API, I do not argue with that, I am in favour of that.

@aschmitz

This comment has been minimized.

Copy link
Contributor Author

commented Sep 18, 2017

@akihikodaki

This comment has been minimized.

Copy link
Collaborator

commented Sep 19, 2017

I do not think IDs of other tables really matters (both of snowflake and old good serial are fine), but here I leave some arguments:
Snowflake IDs is assuming the objects are ordered chronologically, but there is no reason to adopt the order in other tables; maybe they will have different arrangements. We cannot really foresee that, and therefore introducing snowflake IDs does not mean to broaden the possibility of feature extensions.

aschmitz added 2 commits Sep 19, 2017
These should be the last two. These were identified using eslint to try
to identify any plain casts to JavaScript numbers. (Some such casts are
legitimate, but these were not.)

Adding the following to .eslintrc.yml will identify casts to numbers:

~~~
  no-restricted-syntax:
  - warn
  - selector: UnaryExpression[operator='+'] > :not(Literal)
    message: Avoid the use of unary +
  - selector: CallExpression[callee.name='Number']
    message: Casting with Number() may coerce string IDs to numbers
~~~

The remaining three casts appear legitimate: two casts to array indices,
one in a server to turn an environment variable into a number.
@aschmitz

This comment has been minimized.

Copy link
Contributor Author

commented Sep 19, 2017

"Normal" serial numbers are of course ordered chronologically as well, and it's not clear why you'd benefit from any other order for IDs that must be unique but determinable in advance. That's probably the nature of unforeseen future features, though.

I do still think that using snowflake IDs for everything has benefits I've outlined above, but given that the change can be done in a relatively simple post-migration hook that shouldn't conflict with anything else, I'd be fine adding that hook for my own instance, and won't fight over it here. (As long as we're agreed on bigint IDs and stringified IDs in JSON, that shouldn't run into any issues.)

If you think it's best to restrict those changes to a few tables, let me know, and I can rework the code to do so. A configurable flag might be nicer, but I suspect not of interest(?).

(Separately, I've found and fixed what I believe to be the last two Number/string issues in the JS UI in a05351e.)

@aschmitz

This comment has been minimized.

Copy link
Contributor Author

commented Sep 19, 2017

One last tradeoff to consider: in the initial PR, I had said "Note that existing feeds will temporarily appear with reblogged statuses shown but not as reblogs. This was believed to be preferable to rebuilding all feeds, although code for doing so could be implemented." I believe we can have a migration to "fix" this (that is, not change whether old reblogs are displayed as reblogs): it will have to visit all feeds in Redis, but it would not have to hit the database beyond getting a list of user accounts. If this is desirable, let me know, and I can write it up.

I believe once the snowflakiness of non-Status tables and this are resolved, this should be ready to merge, although I'm always open to additional review.

accounts = Account.where(id: account_ids).select('id')
# .where doesn't guarantee that our results are in the same order
# we requested them, so return the "right" order to the requestor.
@accounts = accounts.index_by(&:id).values_at(*account_ids)

This comment has been minimized.

Copy link
@krainboltgreene

krainboltgreene Sep 19, 2017

Member

We should be ordering in SQL as much as possible.

This comment has been minimized.

Copy link
@aschmitz

aschmitz Sep 19, 2017

Author Contributor

In general, I agree. However, the API for this function allows users to supply an arbitrarily-ordered set of IDs, and the tests imply they're expected to be returned in the same order (although I can't find any docs saying one way or the other). Doing this ordering in SQL would require synthesizing a CASE to sort by, which seems unlikely to be cleaner or faster.

In practice, this should be roughly O(n) (to the extent that you believe hashes are really O(1) insertion/lookup), with very small values of n. (Also in practice, the endpoint currently doesn't actually guarantee you get entries back in the same order you requested them, so we could just toss this change.)

If you think we should generate the CASE and do it in SQL, I can write that code, it just felt unnecessary here.

This comment has been minimized.

Copy link
@nightpool

nightpool Sep 20, 2017

Collaborator

My understanding is that our code doesn't depend on this anywhere, and it was just ease of use for the tests. I can ask app developers though.

@krainboltgreene

This comment has been minimized.

Copy link
Member

commented Sep 19, 2017

I would love to see the frontend int -> string diff as a standalone PR.

aschmitz added 2 commits Oct 3, 2017
No functional changes.
conn = ActiveRecord::Base.connection

# Make sure we don't already have a `timestamp_id` function.
unless conn.execute(

This comment has been minimized.

Copy link
@akihikodaki

akihikodaki Oct 3, 2017

Collaborator

You should do:

unless conn.execute(<<~SQL)
  SELECT …;
SQL

This comment has been minimized.

Copy link
@aschmitz

aschmitz Oct 3, 2017

Author Contributor

TIL.

@Gargron

This comment has been minimized.

Copy link
Member

commented Oct 3, 2017

Not to push too hard but can we get an example backfilling code, maybe as separate PR based on this branch? Because like, unless remote statuses get the correct sorting there is not enough benefit to justify such a huge change.

@aschmitz

This comment has been minimized.

Copy link
Contributor Author

commented Oct 3, 2017

Sure. Are there multiple places the backfilling is being done these days, or is it all in one place already?

@Gargron

This comment has been minimized.

Copy link
Member

commented Oct 3, 2017

Grep for "Status.create" in app/lib/activitypub and app/lib/ostatus, I believe those are the only places

@Gargron Gargron referenced this pull request Oct 3, 2017
14 of 14 tasks complete
Copy link
Collaborator

left a comment

I meant to apply the previous comment for all heredocs.

@aschmitz

This comment has been minimized.

Copy link
Contributor Author

commented Oct 4, 2017

D'oh. Yes, just a moment. (I also have code for @Gargron's request, but I'll do the heredoc fixes first.) I also want to fix rake db:setup with RAILS_ENV=development, where Rake appears to be trying to do both the test and development databases separately and skips declaring the timestamp_id function in one of them.

aschmitz added 2 commits Oct 4, 2017
This matches the behavior in Rails'
ActiveRecord::Tasks::DatabaseTasks.each_current_configuration, which
would otherwise break `rake db:setup` in development.

It also moves some functionality out to a library, which will be a good
place to put additional related functionality in the near future.
@aschmitz

This comment has been minimized.

Copy link
Contributor Author

commented Oct 4, 2017

I'm going to claim Rails is behaving badly in doing internal, private things to handle loading the schema for test and development simultaneously in development, but 07baf2f should fix that either way.

Code Climate gives new reports from Rubocop for define_timestamp_id being too long (most of it being comments, and almost all of it being multiline SQL queries) and Brakeman claiming that there may be SQL injection (because it contains the string table_name? Or because it inlines SecureRandom.hex(16)?). As far as I'm concerned, both are ignorable so I can add the appropriate directives, but let me know if you think they should be addressed some other way.

I'll submit a WIP PR to generically handle backdating timestamp IDs shortly, once I move stuff around into the new library.

Copy link
Collaborator

left a comment

Thank you for the contribution of a month.

@Gargron Gargron merged commit 468523f into tootsuite:master Oct 4, 2017
1 of 2 checks passed
1 of 2 checks passed
codeclimate 2 issues to fix
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Gargron added a commit that referenced this pull request Oct 8, 2017
Regression from #4801
Gargron added a commit that referenced this pull request Oct 8, 2017
Regression from #4801
rutan added a commit to rutan/mastodon that referenced this pull request Oct 11, 2017
This addresses a comment during review:
tootsuite#4801 (comment)

This means we'll need to make sure that all _id columns going forward
are bigints, but that should happen automatically in most cases.
rutan added a commit to rutan/mastodon that referenced this pull request Oct 11, 2017
* Use non-serial IDs

This change makes a number of nontrivial tweaks to the data model in
Mastodon:

* All IDs are now 8 byte integers (rather than mixed 4- and 8-byte)
* IDs are now assigned as:
  * Top 6 bytes: millisecond-resolution time from epoch
  * Bottom 2 bytes: serial (within the millisecond) sequence number
  * See /lib/tasks/db.rake's `define_timestamp_id` for details, but
    note that the purpose of these changes is to make it difficult to
    determine the number of objects in a table from the ID of any
    object.
* The Redis sorted set used for the feed will have values used to look
  up toots, rather than scores. This is almost always the same as the
  existing behavior, except in the case of boosted toots. This change
  was made because Redis stores scores as double-precision floats,
  which cannot store the new ID format exactly. Note that this doesn't
  cause problems with sorting/pagination, because ZREVRANGEBYSCORE
  sorts lexicographically when scores are tied. (This will still cause
  sorting issues when the ID gains a new significant digit, but that's
  extraordinarily uncommon.)

Note a couple of tradeoffs have been made in this commit:

* lib/tasks/db.rake is used to enforce many/most column constraints,
  because this commit seems likely to take a while to bring upstream.
  Enforcing a post-migrate hook is an easier way to maintain the code
  in the interim.
* Boosted toots will appear in the timeline as many times as they have
  been boosted. This is a tradeoff due to the way the feed is saved in
  Redis at the moment, but will be handled by a future commit.

This would effectively close Mastodon's tootsuite#1059, as it is a
snowflake-like system of generating IDs. However, given how involved
the changes were simply within Mastodon, it may have unexpected
interactions with some clients, if they store IDs as doubles
(or as 4-byte integers). This was a problem that Twitter ran into with
their "snowflake" transition, particularly in JavaScript clients that
treated IDs as JS integers, rather than strings. It therefore would be
useful to test these changes at least in the web interface and popular
clients before pushing them to all users.

* Fix JavaScript interface with long IDs

Somewhat predictably, the JS interface handled IDs as numbers, which in
JS are IEEE double-precision floats. This loses some precision when
working with numbers as large as those generated by the new ID scheme,
so we instead handle them here as strings. This is relatively simple,
and doesn't appear to have caused any problems, but should definitely
be tested more thoroughly than the built-in tests. Several days of use
appear to support this working properly.

BREAKING CHANGE:

The major(!) change here is that IDs are now returned as strings by the
REST endpoints, rather than as integers. In practice, relatively few
changes were required to make the existing JS UI work with this change,
but it will likely hit API clients pretty hard: it's an entirely
different type to consume. (The one API client I tested, Tusky, handles
this with no problems, however.)

Twitter ran into this issue when introducing Snowflake IDs, and decided
to instead introduce an `id_str` field in JSON responses. I have opted
to *not* do that, and instead force all IDs to 64-bit integers
represented by strings in one go. (I believe Twitter exacerbated their
problem by rolling out the changes three times: once for statuses, once
for DMs, and once for user IDs, as well as by leaving an integer ID
value in JSON. As they said, "If you’re using the `id` field with JSON
in a Javascript-related language, there is a very high likelihood that
the integers will be silently munged by Javascript interpreters. In most
cases, this will result in behavior such as being unable to load or
delete a specific direct message, because the ID you're sending to the
API is different than the actual identifier associated with the
message." [1]) However, given that this is a significant change for API
users, alternatives or a transition time may be appropriate.

1: https://blog.twitter.com/developer/en_us/a/2011/direct-messages-going-snowflake-on-sep-30-2011.html

* Restructure feed pushes/unpushes

This was necessary because the previous behavior used Redis zset scores
to identify statuses, but those are IEEE double-precision floats, so we
can't actually use them to identify all 64-bit IDs. However, it leaves
the code in a much better state for refactoring reblog handling /
coalescing.

Feed-management code has been consolidated in FeedManager, including:

* BatchedRemoveStatusService no longer directly manipulates feed zsets
* RemoveStatusService no longer directly manipulates feed zsets
* PrecomputeFeedService has moved its logic to FeedManager#populate_feed

(PrecomputeFeedService largely made lots of calls to FeedManager, but
didn't follow the normal adding-to-feed process.)

This has the effect of unifying all of the feed push/unpush logic in
FeedManager, making it much more tractable to update it in the future.

Due to some additional checks that must be made during, for example,
batch status removals, some Redis pipelining has been removed. It does
not appear that this should cause significantly increased load, but if
necessary, some optimizations are possible in batch cases. These were
omitted in the pursuit of simplicity, but a batch_push and batch_unpush
would be possible in the future.

Tests were added to verify that pushes happen under expected conditions,
and to verify reblog behavior (both on pushing and unpushing). In the
case of unpushing, this includes testing behavior that currently leads
to confusion such as Mastodon's tootsuite#2817, but this codifies that the
behavior is currently expected.

* Rubocop fixes

I could swear I made these changes already, but I must have lost them
somewhere along the line.

* Address review comments

This addresses the first two comments from review of this feature:

tootsuite#4801 (comment)
tootsuite#4801 (comment)

This adds an optional argument to FeedManager#key, the subtype of feed
key to generate. It also tests to ensure that FeedManager's settings are
such that reblogs won't be tracked forever.

* Hardcode IdToBigints migration columns

This addresses a comment during review:
tootsuite#4801 (comment)

This means we'll need to make sure that all _id columns going forward
are bigints, but that should happen automatically in most cases.

* Additional fixes for stringified IDs in JSON

These should be the last two. These were identified using eslint to try
to identify any plain casts to JavaScript numbers. (Some such casts are
legitimate, but these were not.)

Adding the following to .eslintrc.yml will identify casts to numbers:

~~~
  no-restricted-syntax:
  - warn
  - selector: UnaryExpression[operator='+'] > :not(Literal)
    message: Avoid the use of unary +
  - selector: CallExpression[callee.name='Number']
    message: Casting with Number() may coerce string IDs to numbers
~~~

The remaining three casts appear legitimate: two casts to array indices,
one in a server to turn an environment variable into a number.

* Only implement timestamp IDs for Status IDs

Per discussion in tootsuite#4801, this is only being merged in for Status IDs at
this point. We do this in a migration, as there is no longer use for
a post-migration hook. We keep the initialization of the timestamp_id
function as a Rake task, as it is also needed after db:schema:load (as
db/schema.rb doesn't store Postgres functions).

* Change internal streaming payloads to stringified IDs as well

This is equivalent to 591a9af from
tootsuite#5019, with an extra change for the addition to FeedManager#unpush.

* Ensure we have a status_id_seq sequence

Apparently this is not a given when specifying a custom ID function,
so now we ensure it gets created. This uses the generic version of this
function to more easily support adding additional tables with timestamp
IDs in the future, although it would be possible to cut this down to a
less generic version if necessary. It is only run during db:schema:load
or the relevant migration, so the overhead is extraordinarily minimal.

* Transition reblogs to new Redis format

This provides a one-way migration to transition old Redis reblog entries
into the new format, with a separate tracking entry for reblogs.

It is not invertible because doing so could (if timestamp IDs are used)
require a database query for each status in each users' feed, which is
likely to be a significant toll on major instances.

* Address review comments from @akihikodaki

No functional changes.

* Additional review changes

* Heredoc cleanup

* Run db:schema:load hooks for test in development

This matches the behavior in Rails'
ActiveRecord::Tasks::DatabaseTasks.each_current_configuration, which
would otherwise break `rake db:setup` in development.

It also moves some functionality out to a library, which will be a good
place to put additional related functionality in the near future.
rutan added a commit to rutan/mastodon that referenced this pull request Oct 11, 2017
takayamaki added a commit to takayamaki/mastodon that referenced this pull request Oct 12, 2017
This addresses a comment during review:
tootsuite#4801 (comment)

This means we'll need to make sure that all _id columns going forward
are bigints, but that should happen automatically in most cases.
takayamaki added a commit to takayamaki/mastodon that referenced this pull request Oct 12, 2017
* Use non-serial IDs

This change makes a number of nontrivial tweaks to the data model in
Mastodon:

* All IDs are now 8 byte integers (rather than mixed 4- and 8-byte)
* IDs are now assigned as:
  * Top 6 bytes: millisecond-resolution time from epoch
  * Bottom 2 bytes: serial (within the millisecond) sequence number
  * See /lib/tasks/db.rake's `define_timestamp_id` for details, but
    note that the purpose of these changes is to make it difficult to
    determine the number of objects in a table from the ID of any
    object.
* The Redis sorted set used for the feed will have values used to look
  up toots, rather than scores. This is almost always the same as the
  existing behavior, except in the case of boosted toots. This change
  was made because Redis stores scores as double-precision floats,
  which cannot store the new ID format exactly. Note that this doesn't
  cause problems with sorting/pagination, because ZREVRANGEBYSCORE
  sorts lexicographically when scores are tied. (This will still cause
  sorting issues when the ID gains a new significant digit, but that's
  extraordinarily uncommon.)

Note a couple of tradeoffs have been made in this commit:

* lib/tasks/db.rake is used to enforce many/most column constraints,
  because this commit seems likely to take a while to bring upstream.
  Enforcing a post-migrate hook is an easier way to maintain the code
  in the interim.
* Boosted toots will appear in the timeline as many times as they have
  been boosted. This is a tradeoff due to the way the feed is saved in
  Redis at the moment, but will be handled by a future commit.

This would effectively close Mastodon's tootsuite#1059, as it is a
snowflake-like system of generating IDs. However, given how involved
the changes were simply within Mastodon, it may have unexpected
interactions with some clients, if they store IDs as doubles
(or as 4-byte integers). This was a problem that Twitter ran into with
their "snowflake" transition, particularly in JavaScript clients that
treated IDs as JS integers, rather than strings. It therefore would be
useful to test these changes at least in the web interface and popular
clients before pushing them to all users.

* Fix JavaScript interface with long IDs

Somewhat predictably, the JS interface handled IDs as numbers, which in
JS are IEEE double-precision floats. This loses some precision when
working with numbers as large as those generated by the new ID scheme,
so we instead handle them here as strings. This is relatively simple,
and doesn't appear to have caused any problems, but should definitely
be tested more thoroughly than the built-in tests. Several days of use
appear to support this working properly.

BREAKING CHANGE:

The major(!) change here is that IDs are now returned as strings by the
REST endpoints, rather than as integers. In practice, relatively few
changes were required to make the existing JS UI work with this change,
but it will likely hit API clients pretty hard: it's an entirely
different type to consume. (The one API client I tested, Tusky, handles
this with no problems, however.)

Twitter ran into this issue when introducing Snowflake IDs, and decided
to instead introduce an `id_str` field in JSON responses. I have opted
to *not* do that, and instead force all IDs to 64-bit integers
represented by strings in one go. (I believe Twitter exacerbated their
problem by rolling out the changes three times: once for statuses, once
for DMs, and once for user IDs, as well as by leaving an integer ID
value in JSON. As they said, "If you’re using the `id` field with JSON
in a Javascript-related language, there is a very high likelihood that
the integers will be silently munged by Javascript interpreters. In most
cases, this will result in behavior such as being unable to load or
delete a specific direct message, because the ID you're sending to the
API is different than the actual identifier associated with the
message." [1]) However, given that this is a significant change for API
users, alternatives or a transition time may be appropriate.

1: https://blog.twitter.com/developer/en_us/a/2011/direct-messages-going-snowflake-on-sep-30-2011.html

* Restructure feed pushes/unpushes

This was necessary because the previous behavior used Redis zset scores
to identify statuses, but those are IEEE double-precision floats, so we
can't actually use them to identify all 64-bit IDs. However, it leaves
the code in a much better state for refactoring reblog handling /
coalescing.

Feed-management code has been consolidated in FeedManager, including:

* BatchedRemoveStatusService no longer directly manipulates feed zsets
* RemoveStatusService no longer directly manipulates feed zsets
* PrecomputeFeedService has moved its logic to FeedManager#populate_feed

(PrecomputeFeedService largely made lots of calls to FeedManager, but
didn't follow the normal adding-to-feed process.)

This has the effect of unifying all of the feed push/unpush logic in
FeedManager, making it much more tractable to update it in the future.

Due to some additional checks that must be made during, for example,
batch status removals, some Redis pipelining has been removed. It does
not appear that this should cause significantly increased load, but if
necessary, some optimizations are possible in batch cases. These were
omitted in the pursuit of simplicity, but a batch_push and batch_unpush
would be possible in the future.

Tests were added to verify that pushes happen under expected conditions,
and to verify reblog behavior (both on pushing and unpushing). In the
case of unpushing, this includes testing behavior that currently leads
to confusion such as Mastodon's tootsuite#2817, but this codifies that the
behavior is currently expected.

* Rubocop fixes

I could swear I made these changes already, but I must have lost them
somewhere along the line.

* Address review comments

This addresses the first two comments from review of this feature:

tootsuite#4801 (comment)
tootsuite#4801 (comment)

This adds an optional argument to FeedManager#key, the subtype of feed
key to generate. It also tests to ensure that FeedManager's settings are
such that reblogs won't be tracked forever.

* Hardcode IdToBigints migration columns

This addresses a comment during review:
tootsuite#4801 (comment)

This means we'll need to make sure that all _id columns going forward
are bigints, but that should happen automatically in most cases.

* Additional fixes for stringified IDs in JSON

These should be the last two. These were identified using eslint to try
to identify any plain casts to JavaScript numbers. (Some such casts are
legitimate, but these were not.)

Adding the following to .eslintrc.yml will identify casts to numbers:

~~~
  no-restricted-syntax:
  - warn
  - selector: UnaryExpression[operator='+'] > :not(Literal)
    message: Avoid the use of unary +
  - selector: CallExpression[callee.name='Number']
    message: Casting with Number() may coerce string IDs to numbers
~~~

The remaining three casts appear legitimate: two casts to array indices,
one in a server to turn an environment variable into a number.

* Only implement timestamp IDs for Status IDs

Per discussion in tootsuite#4801, this is only being merged in for Status IDs at
this point. We do this in a migration, as there is no longer use for
a post-migration hook. We keep the initialization of the timestamp_id
function as a Rake task, as it is also needed after db:schema:load (as
db/schema.rb doesn't store Postgres functions).

* Change internal streaming payloads to stringified IDs as well

This is equivalent to 591a9af from
tootsuite#5019, with an extra change for the addition to FeedManager#unpush.

* Ensure we have a status_id_seq sequence

Apparently this is not a given when specifying a custom ID function,
so now we ensure it gets created. This uses the generic version of this
function to more easily support adding additional tables with timestamp
IDs in the future, although it would be possible to cut this down to a
less generic version if necessary. It is only run during db:schema:load
or the relevant migration, so the overhead is extraordinarily minimal.

* Transition reblogs to new Redis format

This provides a one-way migration to transition old Redis reblog entries
into the new format, with a separate tracking entry for reblogs.

It is not invertible because doing so could (if timestamp IDs are used)
require a database query for each status in each users' feed, which is
likely to be a significant toll on major instances.

* Address review comments from @akihikodaki

No functional changes.

* Additional review changes

* Heredoc cleanup

* Run db:schema:load hooks for test in development

This matches the behavior in Rails'
ActiveRecord::Tasks::DatabaseTasks.each_current_configuration, which
would otherwise break `rake db:setup` in development.

It also moves some functionality out to a library, which will be a good
place to put additional related functionality in the near future.
takayamaki added a commit to takayamaki/mastodon that referenced this pull request Oct 12, 2017
cobodo pushed a commit to cobodo/mastodon that referenced this pull request Oct 20, 2017
* Use non-serial IDs

This change makes a number of nontrivial tweaks to the data model in
Mastodon:

* All IDs are now 8 byte integers (rather than mixed 4- and 8-byte)
* IDs are now assigned as:
  * Top 6 bytes: millisecond-resolution time from epoch
  * Bottom 2 bytes: serial (within the millisecond) sequence number
  * See /lib/tasks/db.rake's `define_timestamp_id` for details, but
    note that the purpose of these changes is to make it difficult to
    determine the number of objects in a table from the ID of any
    object.
* The Redis sorted set used for the feed will have values used to look
  up toots, rather than scores. This is almost always the same as the
  existing behavior, except in the case of boosted toots. This change
  was made because Redis stores scores as double-precision floats,
  which cannot store the new ID format exactly. Note that this doesn't
  cause problems with sorting/pagination, because ZREVRANGEBYSCORE
  sorts lexicographically when scores are tied. (This will still cause
  sorting issues when the ID gains a new significant digit, but that's
  extraordinarily uncommon.)

Note a couple of tradeoffs have been made in this commit:

* lib/tasks/db.rake is used to enforce many/most column constraints,
  because this commit seems likely to take a while to bring upstream.
  Enforcing a post-migrate hook is an easier way to maintain the code
  in the interim.
* Boosted toots will appear in the timeline as many times as they have
  been boosted. This is a tradeoff due to the way the feed is saved in
  Redis at the moment, but will be handled by a future commit.

This would effectively close Mastodon's tootsuite#1059, as it is a
snowflake-like system of generating IDs. However, given how involved
the changes were simply within Mastodon, it may have unexpected
interactions with some clients, if they store IDs as doubles
(or as 4-byte integers). This was a problem that Twitter ran into with
their "snowflake" transition, particularly in JavaScript clients that
treated IDs as JS integers, rather than strings. It therefore would be
useful to test these changes at least in the web interface and popular
clients before pushing them to all users.

* Fix JavaScript interface with long IDs

Somewhat predictably, the JS interface handled IDs as numbers, which in
JS are IEEE double-precision floats. This loses some precision when
working with numbers as large as those generated by the new ID scheme,
so we instead handle them here as strings. This is relatively simple,
and doesn't appear to have caused any problems, but should definitely
be tested more thoroughly than the built-in tests. Several days of use
appear to support this working properly.

BREAKING CHANGE:

The major(!) change here is that IDs are now returned as strings by the
REST endpoints, rather than as integers. In practice, relatively few
changes were required to make the existing JS UI work with this change,
but it will likely hit API clients pretty hard: it's an entirely
different type to consume. (The one API client I tested, Tusky, handles
this with no problems, however.)

Twitter ran into this issue when introducing Snowflake IDs, and decided
to instead introduce an `id_str` field in JSON responses. I have opted
to *not* do that, and instead force all IDs to 64-bit integers
represented by strings in one go. (I believe Twitter exacerbated their
problem by rolling out the changes three times: once for statuses, once
for DMs, and once for user IDs, as well as by leaving an integer ID
value in JSON. As they said, "If you’re using the `id` field with JSON
in a Javascript-related language, there is a very high likelihood that
the integers will be silently munged by Javascript interpreters. In most
cases, this will result in behavior such as being unable to load or
delete a specific direct message, because the ID you're sending to the
API is different than the actual identifier associated with the
message." [1]) However, given that this is a significant change for API
users, alternatives or a transition time may be appropriate.

1: https://blog.twitter.com/developer/en_us/a/2011/direct-messages-going-snowflake-on-sep-30-2011.html

* Restructure feed pushes/unpushes

This was necessary because the previous behavior used Redis zset scores
to identify statuses, but those are IEEE double-precision floats, so we
can't actually use them to identify all 64-bit IDs. However, it leaves
the code in a much better state for refactoring reblog handling /
coalescing.

Feed-management code has been consolidated in FeedManager, including:

* BatchedRemoveStatusService no longer directly manipulates feed zsets
* RemoveStatusService no longer directly manipulates feed zsets
* PrecomputeFeedService has moved its logic to FeedManager#populate_feed

(PrecomputeFeedService largely made lots of calls to FeedManager, but
didn't follow the normal adding-to-feed process.)

This has the effect of unifying all of the feed push/unpush logic in
FeedManager, making it much more tractable to update it in the future.

Due to some additional checks that must be made during, for example,
batch status removals, some Redis pipelining has been removed. It does
not appear that this should cause significantly increased load, but if
necessary, some optimizations are possible in batch cases. These were
omitted in the pursuit of simplicity, but a batch_push and batch_unpush
would be possible in the future.

Tests were added to verify that pushes happen under expected conditions,
and to verify reblog behavior (both on pushing and unpushing). In the
case of unpushing, this includes testing behavior that currently leads
to confusion such as Mastodon's tootsuite#2817, but this codifies that the
behavior is currently expected.

* Rubocop fixes

I could swear I made these changes already, but I must have lost them
somewhere along the line.

* Address review comments

This addresses the first two comments from review of this feature:

tootsuite#4801 (comment)
tootsuite#4801 (comment)

This adds an optional argument to FeedManager#key, the subtype of feed
key to generate. It also tests to ensure that FeedManager's settings are
such that reblogs won't be tracked forever.

* Hardcode IdToBigints migration columns

This addresses a comment during review:
tootsuite#4801 (comment)

This means we'll need to make sure that all _id columns going forward
are bigints, but that should happen automatically in most cases.

* Additional fixes for stringified IDs in JSON

These should be the last two. These were identified using eslint to try
to identify any plain casts to JavaScript numbers. (Some such casts are
legitimate, but these were not.)

Adding the following to .eslintrc.yml will identify casts to numbers:

~~~
  no-restricted-syntax:
  - warn
  - selector: UnaryExpression[operator='+'] > :not(Literal)
    message: Avoid the use of unary +
  - selector: CallExpression[callee.name='Number']
    message: Casting with Number() may coerce string IDs to numbers
~~~

The remaining three casts appear legitimate: two casts to array indices,
one in a server to turn an environment variable into a number.

* Only implement timestamp IDs for Status IDs

Per discussion in tootsuite#4801, this is only being merged in for Status IDs at
this point. We do this in a migration, as there is no longer use for
a post-migration hook. We keep the initialization of the timestamp_id
function as a Rake task, as it is also needed after db:schema:load (as
db/schema.rb doesn't store Postgres functions).

* Change internal streaming payloads to stringified IDs as well

This is equivalent to 591a9af from
tootsuite#5019, with an extra change for the addition to FeedManager#unpush.

* Ensure we have a status_id_seq sequence

Apparently this is not a given when specifying a custom ID function,
so now we ensure it gets created. This uses the generic version of this
function to more easily support adding additional tables with timestamp
IDs in the future, although it would be possible to cut this down to a
less generic version if necessary. It is only run during db:schema:load
or the relevant migration, so the overhead is extraordinarily minimal.

* Transition reblogs to new Redis format

This provides a one-way migration to transition old Redis reblog entries
into the new format, with a separate tracking entry for reblogs.

It is not invertible because doing so could (if timestamp IDs are used)
require a database query for each status in each users' feed, which is
likely to be a significant toll on major instances.

* Address review comments from @akihikodaki

No functional changes.

* Additional review changes

* Heredoc cleanup

* Run db:schema:load hooks for test in development

This matches the behavior in Rails'
ActiveRecord::Tasks::DatabaseTasks.each_current_configuration, which
would otherwise break `rake db:setup` in development.

It also moves some functionality out to a library, which will be a good
place to put additional related functionality in the near future.
cobodo pushed a commit to cobodo/mastodon that referenced this pull request Oct 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.