Skip to content

Add #cache_key to ActiveRecord::Relation. #20884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 2, 2015

Conversation

afcapel
Copy link
Contributor

@afcapel afcapel commented Jul 14, 2015

The Rails caching guide suggests to create a cache key for collections of objects using the collection size and the timestamp of the last updated record. We could automate the process baking this logic into ActiveRecord::Relation.

    @users = User.where("name like ?", "%Alberto%")
    @users.cache_key
    => "/users/query-5942b155a43b139f2471b872ac54251f-3-20150714212107656125000"

The end result is that you can have cache keys in any relationship that will expire when the any of records matching the relation query changes.

#
# Product.where("name like ?", "%Game%").cache_key(:last_reviewed_at)
def cache_key(timestamp_column = :updated_at)
query_signature = Digest::MD5.hexdigest(to_sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Perhaps use SHA over MD5?
  2. The logic select count(*), max(updated_at) could be written manually to only need one sql statement.
  3. What if table has no updated_at column?
  4. Do we need to expose internal logic of sql+count+updated? If not, we could use a single hash function (Digest::SHA1("#{to_sql}-#{count}-#{timestamp}")) and have a shorter string to return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egilburg thanks for your feedback, here are my thoughts:

  1. Although MD5 can have some issues for cryptography I think it is fine for fingerprinting. It's somewhat less CPU intensive and it's what the Asset Pipeline uses for fingerprints.
  2. That's a good idea, although I'll have to check if I can create the query in a DB independent manner.
  3. It will fail hard with an missing column exception. I think that's the right behaviour.
  4. The string length it's not an issue, AFAICT. Average key length ~ 75 chars (depending on the model name), while memcached keys can be up to 255 chars. I'd prefer to optimise for the readability of the key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If used in client side query/HTML source code, this will leak total record count of whole db table which may be privacy issue in multi tenant systems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably it'd only leak the count of the query you're displaying, which would be scoped to the tennant.

@egilburg
Copy link
Contributor

I also guess it doesn't detect changes on includes tables, right? Probably good idea to document that caveat.

@afcapel
Copy link
Contributor Author

afcapel commented Jul 15, 2015

@egilburg again, I'm trying to follow the same patterns as ActiveRecord::Base#cache_key which won't change automatically when a dependent record changes, unless you make that record to touch the parent object. You need to understand the principles of Russian doll caching, but I think that explanation belongs in the Rails caching guide, not in this particular method documentation.

#
# Under the hood this triggers two SQL queries:
#
# SELECT COUNT(*) FROM "products" WHERE (name like '%Cosmic Encounter%')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this query needed? Is there any case in which the count can change, without the updated_at of the newest record changing? (Aside from deliberately bypassing this?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgrif if you delete a record other than the last one, the collection will change but the newest updated_at will be the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. We should still condense this down to a single query though (and change COUNT(*) to COUNT(updated_at) so we only need to access a single column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgrif That makes sense, I'll prepare the change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COUNT(updated_at) actually needs to read all the values for that column... COUNT(*) only counts the rows, so it does not.

@bogdan
Copy link
Contributor

bogdan commented Jul 18, 2015

This method requires an index on updated_at column. This requirement makes it sometimes unusable if table is large enough. This feature is definitely unusable in my project.

I am not sure if rails can include such features. You may relay on it when your DB is small but in a 1-2 year you should make it more smart or chose other caching strategy.

@sgrif
Copy link
Contributor

sgrif commented Jul 18, 2015

That's a fair concern. This method is completely unoverrideable as defined. Perhaps it should delegate to a class method that the users can override as desired.

@sgrif
Copy link
Contributor

sgrif commented Jul 18, 2015

I'd also like to see some more thorough tests for this behavior. Asserting that the hash of the query does change based on the SQL performed (maybe asserting that WHERE 1=1 changes the cache key). I'd also like to see explicit tests for the count portion, with and without a scope to ensure it matches that of the relation.

@afcapel
Copy link
Contributor Author

afcapel commented Jul 18, 2015

@bogdan I don't see how that is different from any AR method querying a column other than id, for instance User.where(email: "..."). You have to add indexes to columns in big tables that you are going to scan, but that concern is left to the developer, as it depends on many factors that rails can not foresee.

@sgrif
Copy link
Contributor

sgrif commented Jul 18, 2015

@afcapel Regardless, the user needs to be able to change this behavior if they desire. We shouldn't be locking them into one implementation, the same way as they can override it for a single value. I think the simplest solution is to have the method be ActiveRecord::Base.collection_cache_key(collection, timestamp_column = :updated_at) and have the implementation on Relation be nothing more than

def cache_key(timestamp_column = :updated_at)
  @klass.collection_cache_key(self, timestamp_column)
end

@afcapel
Copy link
Contributor Author

afcapel commented Jul 18, 2015

@sgrif I've updated the PR with your feedback, moving collection_cache_key to AR::Base, adding more tests around the query cache and ensuring only a query is triggered. Please, let me know what you think.

@sgrif
Copy link
Contributor

sgrif commented Jul 18, 2015

I have some more comments but I'm traveling the rest of the day, will comment further tonight.

size, timestamp = if collection.loaded?
[collection.size, collection.collect(&timestamp_column).compact.max]
else
column_type = collection.klass.type_for_attribute(timestamp_column.to_s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collection.klass is always self

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in cases in which you call ActiveRecord::Base.collection_cache_key(collection) like this one. It's a bit annoying, but since a class method is public I think we need to support that kind of calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should support calling this method in any case other than a relation calling this method on self.klass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I would even go so far as to say it's worth mentioning you should not call this method directly in the documentation.

@sgrif
Copy link
Contributor

sgrif commented Jul 20, 2015

@afcapel Did you exclude the suggestion about changing collection.klass to implicit self + documenting "this method should not be called directly" because you feel strongly against that suggestion? It's completely fine if that's the case, I would just like to confirm that you feel strongly that is not the right direction to go, and not an oversight.

In either case, can you either implement that suggestion, or confirm you feel strongly it is incorrect, and then squash & rebase?

# You can also pass a custom timestamp column to fetch the timestamp of the
# last updated record.
#
# Product.where("name like ?", "%Game%").cache_key(:last_reviewed_at)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to document that the user can override this behavior by implementing self.collection_cache_key on the class

@afcapel
Copy link
Contributor Author

afcapel commented Jul 20, 2015

@sgrif oh, no I just missed the comments, but I think it's a good idea to make that method as "private" as possible. I'll make the change, squash and rebase.


# Generates a cache key for the records in the given collection.
# See <tt>ActiveRecord::Relation#cache_key</tt> for details.
def collection_cache_key(collection = all, timestamp_column = :updated_at)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually prefer to mark this method as # :nodoc: since it only serves as a hook for Relation#cache_key

@sgrif
Copy link
Contributor

sgrif commented Jul 20, 2015

❤️ I just left 2 more comments about docs, besides that, this is good to merge. Ping me once updated. (The wife wants to head to the movies so I might not get to it tonight)

@afcapel afcapel force-pushed the ar-relation-cache-key branch from e1120a7 to 476e3f5 Compare July 20, 2015 00:48
@afcapel
Copy link
Contributor Author

afcapel commented Jul 20, 2015

@sgrif nice, the PR is already updated. Thanks for your help!

#
# You can customize the strategy to generate the key on a per model basis
# overriding ActiveRecord::Base#collection_cache_key.
def cache_key(timestamp_column = :updated_at)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity perhaps rename timestamp_column to cache_key_column. For example tables without timestamp but which have autosequence id could use the id column, by defining:

def cache_key
  super(:id)
end

On other hand this would introduce discrepancy between definition of collection_cache_key. So another alternative would be to extract class method cache_key_column which returns :updated_at by default and which user could override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure if cache_key_column is a better name. To me def cache_key(cache_key_column) sounds like we will fetch the key from a column, which is not exactly right.

@afcapel
Copy link
Contributor Author

afcapel commented Jul 31, 2015

@sgrif any updates on this? Please let me know if you think it needs any more changes.

@sgrif sgrif merged commit 476e3f5 into rails:master Aug 2, 2015
sgrif added a commit that referenced this pull request Aug 2, 2015
Add #cache_key to ActiveRecord::Relation.
sgrif added a commit that referenced this pull request Aug 2, 2015
PostgreSQL is strict about the usage of `DISTINCT` and `ORDER BY`, which
one of the tests demonstrated. The order clause is never going to be
relevant in the query we're performing, so let's just remove it
entirely.
@swalkinshaw
Copy link
Contributor

@afcapel before I created a new issue I just want to bring up a potential problem here.

Right now the cache key is a combination of 3 factors: a) SQL query digest, b) record count, c) max timestamp. This does cover most situations but there's a few where it doesn't. The easiest use case to explain it is any query involving limit.

Pseudo query: Post.order(:title).limit(5). Let's say the 5 records returned are:

[
  { title: 'A', updated_at: 2015-08-04 04:14:42 +0000 },
  { title: 'B', updated_at: 2015-08-05 04:14:42 +0000 },
  { title: 'C', updated_at: 2015-08-08 04:14:42 +0000 },
  { title: 'D', updated_at: 2015-08-04 04:14:42 +0000 },
  { title: 'E', updated_at: 2015-08-05 04:14:42 +0000 }
]

Now let's delete the post with title 'B' and re-run the query Post.order(:title).limit(5):

[
  { title: 'A', updated_at: 2015-08-04 04:14:42 +0000 },
  { title: 'C', updated_at: 2015-08-08 04:14:42 +0000 },
  { title: 'D', updated_at: 2015-08-04 04:14:42 +0000 },
  { title: 'E', updated_at: 2015-08-05 04:14:42 +0000 },
  { title: 'F', updated_at: 2015-08-05 04:14:42 +0000 }
]

Notice that the 3 factors of the cache key all remain the same.

a) same query digest
b) same count of 5
c) same max updated_at (the newest one is still there)

We ran into this issue implementing relation cache keys in our app we found the only reliable solution was to add a 4th factor: a representation of all the relation's id attributes.

In Postgres we accomplished it using md5(array_agg(id || '-' || updated_at)::text) for non-loaded queries. And for already loaded ones:

results = collection.flat_map { |item| [item.id, item.updated_at].join('-') }
Digest::SHA1.hexdigest(results.join)

You'll also note we also just skipped only taking the max updated_at and just used them all just like we did with id.

@manuelmeurer
Copy link
Contributor

I agree with @swalkinshaw, ran into the same situation where max updated_at, collection size and SQL query digest are not "sufficiently unique".

@christos
Copy link
Contributor

christos commented Sep 4, 2015

I've had the same issues as @swalkinshaw and @manuelmeurer using this approach. I've created PR #21503 with a more complete solution similar to @swalkinshaw. See the PR description for more details.

Assuming we are correct @sgrif, If this PR goes in to Rails 5, it could cause a lot of cache debugging headaches to people. Can you either review #21053 or revert this PR?

jgraichen added a commit to openmensa/openmensa that referenced this pull request May 9, 2021
Rails has its own #cache_key implementation for collections introduced
in Rails 5: rails/rails#20884

The key is different and not tuned to postgresql, but allows us to drop
extra code/patches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants