-
Notifications
You must be signed in to change notification settings - Fork 21.9k
Add #cache_key to ActiveRecord::Relation. #20884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# | ||
# Product.where("name like ?", "%Game%").cache_key(:last_reviewed_at) | ||
def cache_key(timestamp_column = :updated_at) | ||
query_signature = Digest::MD5.hexdigest(to_sql) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Perhaps use SHA over MD5?
- The logic
select count(*), max(updated_at)
could be written manually to only need one sql statement. - What if table has no
updated_at
column? - Do we need to expose internal logic of sql+count+updated? If not, we could use a single hash function (
Digest::SHA1("#{to_sql}-#{count}-#{timestamp}")
) and have a shorter string to return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@egilburg thanks for your feedback, here are my thoughts:
- Although MD5 can have some issues for cryptography I think it is fine for fingerprinting. It's somewhat less CPU intensive and it's what the Asset Pipeline uses for fingerprints.
- That's a good idea, although I'll have to check if I can create the query in a DB independent manner.
- It will fail hard with an missing column exception. I think that's the right behaviour.
- The string length it's not an issue, AFAICT. Average key length ~ 75 chars (depending on the model name), while memcached keys can be up to 255 chars. I'd prefer to optimise for the readability of the key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If used in client side query/HTML source code, this will leak total record count of whole db table which may be privacy issue in multi tenant systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably it'd only leak the count of the query you're displaying, which would be scoped to the tennant.
I also guess it doesn't detect changes on |
@egilburg again, I'm trying to follow the same patterns as |
# | ||
# Under the hood this triggers two SQL queries: | ||
# | ||
# SELECT COUNT(*) FROM "products" WHERE (name like '%Cosmic Encounter%') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this query needed? Is there any case in which the count can change, without the updated_at
of the newest record changing? (Aside from deliberately bypassing this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgrif if you delete a record other than the last one, the collection will change but the newest updated_at
will be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. We should still condense this down to a single query though (and change COUNT(*)
to COUNT(updated_at)
so we only need to access a single column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgrif That makes sense, I'll prepare the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COUNT(updated_at)
actually needs to read all the values for that column... COUNT(*)
only counts the rows, so it does not.
This method requires an index on I am not sure if rails can include such features. You may relay on it when your DB is small but in a 1-2 year you should make it more smart or chose other caching strategy. |
That's a fair concern. This method is completely unoverrideable as defined. Perhaps it should delegate to a class method that the users can override as desired. |
I'd also like to see some more thorough tests for this behavior. Asserting that the hash of the query does change based on the SQL performed (maybe asserting that |
@bogdan I don't see how that is different from any AR method querying a column other than |
@afcapel Regardless, the user needs to be able to change this behavior if they desire. We shouldn't be locking them into one implementation, the same way as they can override it for a single value. I think the simplest solution is to have the method be def cache_key(timestamp_column = :updated_at)
@klass.collection_cache_key(self, timestamp_column)
end |
@sgrif I've updated the PR with your feedback, moving |
I have some more comments but I'm traveling the rest of the day, will comment further tonight. |
size, timestamp = if collection.loaded? | ||
[collection.size, collection.collect(×tamp_column).compact.max] | ||
else | ||
column_type = collection.klass.type_for_attribute(timestamp_column.to_s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collection.klass
is always self
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in cases in which you call ActiveRecord::Base.collection_cache_key(collection)
like this one. It's a bit annoying, but since a class method is public I think we need to support that kind of calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should support calling this method in any case other than a relation calling this method on self.klass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I would even go so far as to say it's worth mentioning you should not call this method directly
in the documentation.
@afcapel Did you exclude the suggestion about changing In either case, can you either implement that suggestion, or confirm you feel strongly it is incorrect, and then squash & rebase? |
# You can also pass a custom timestamp column to fetch the timestamp of the | ||
# last updated record. | ||
# | ||
# Product.where("name like ?", "%Game%").cache_key(:last_reviewed_at) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to document that the user can override this behavior by implementing self.collection_cache_key
on the class
@sgrif oh, no I just missed the comments, but I think it's a good idea to make that method as "private" as possible. I'll make the change, squash and rebase. |
|
||
# Generates a cache key for the records in the given collection. | ||
# See <tt>ActiveRecord::Relation#cache_key</tt> for details. | ||
def collection_cache_key(collection = all, timestamp_column = :updated_at) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually prefer to mark this method as # :nodoc:
since it only serves as a hook for Relation#cache_key
❤️ I just left 2 more comments about docs, besides that, this is good to merge. Ping me once updated. (The wife wants to head to the movies so I might not get to it tonight) |
e1120a7
to
476e3f5
Compare
@sgrif nice, the PR is already updated. Thanks for your help! |
# | ||
# You can customize the strategy to generate the key on a per model basis | ||
# overriding ActiveRecord::Base#collection_cache_key. | ||
def cache_key(timestamp_column = :updated_at) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for clarity perhaps rename timestamp_column
to cache_key_column
. For example tables without timestamp but which have autosequence id could use the id column, by defining:
def cache_key
super(:id)
end
On other hand this would introduce discrepancy between definition of collection_cache_key
. So another alternative would be to extract class method cache_key_column
which returns :updated_at
by default and which user could override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure if cache_key_column
is a better name. To me def cache_key(cache_key_column)
sounds like we will fetch the key from a column, which is not exactly right.
@sgrif any updates on this? Please let me know if you think it needs any more changes. |
Add #cache_key to ActiveRecord::Relation.
PostgreSQL is strict about the usage of `DISTINCT` and `ORDER BY`, which one of the tests demonstrated. The order clause is never going to be relevant in the query we're performing, so let's just remove it entirely.
@afcapel before I created a new issue I just want to bring up a potential problem here. Right now the cache key is a combination of 3 factors: a) SQL query digest, b) record count, c) max timestamp. This does cover most situations but there's a few where it doesn't. The easiest use case to explain it is any query involving Pseudo query: [
{ title: 'A', updated_at: 2015-08-04 04:14:42 +0000 },
{ title: 'B', updated_at: 2015-08-05 04:14:42 +0000 },
{ title: 'C', updated_at: 2015-08-08 04:14:42 +0000 },
{ title: 'D', updated_at: 2015-08-04 04:14:42 +0000 },
{ title: 'E', updated_at: 2015-08-05 04:14:42 +0000 }
] Now let's delete the post with title 'B' and re-run the query [
{ title: 'A', updated_at: 2015-08-04 04:14:42 +0000 },
{ title: 'C', updated_at: 2015-08-08 04:14:42 +0000 },
{ title: 'D', updated_at: 2015-08-04 04:14:42 +0000 },
{ title: 'E', updated_at: 2015-08-05 04:14:42 +0000 },
{ title: 'F', updated_at: 2015-08-05 04:14:42 +0000 }
] Notice that the 3 factors of the cache key all remain the same. a) same query digest We ran into this issue implementing relation cache keys in our app we found the only reliable solution was to add a 4th factor: a representation of all the relation's In Postgres we accomplished it using results = collection.flat_map { |item| [item.id, item.updated_at].join('-') }
Digest::SHA1.hexdigest(results.join) You'll also note we also just skipped only taking the max |
I agree with @swalkinshaw, ran into the same situation where max updated_at, collection size and SQL query digest are not "sufficiently unique". |
I've had the same issues as @swalkinshaw and @manuelmeurer using this approach. I've created PR #21503 with a more complete solution similar to @swalkinshaw. See the PR description for more details. Assuming we are correct @sgrif, If this PR goes in to Rails 5, it could cause a lot of cache debugging headaches to people. Can you either review #21053 or revert this PR? |
Rails has its own #cache_key implementation for collections introduced in Rails 5: rails/rails#20884 The key is different and not tuned to postgresql, but allows us to drop extra code/patches.
The Rails caching guide suggests to create a cache key for collections of objects using the collection size and the timestamp of the last updated record. We could automate the process baking this logic into
ActiveRecord::Relation
.The end result is that you can have cache keys in any relationship that will expire when the any of records matching the relation query changes.