New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add #cache_key to ActiveRecord::Relation. #20884

Merged
merged 1 commit into from Aug 2, 2015

Conversation

Projects
None yet
9 participants
@afcapel
Contributor

afcapel commented Jul 14, 2015

The Rails caching guide suggests to create a cache key for collections of objects using the collection size and the timestamp of the last updated record. We could automate the process baking this logic into ActiveRecord::Relation.

    @users = User.where("name like ?", "%Alberto%")
    @users.cache_key
    => "/users/query-5942b155a43b139f2471b872ac54251f-3-20150714212107656125000"

The end result is that you can have cache keys in any relationship that will expire when the any of records matching the relation query changes.

Show outdated Hide outdated activerecord/lib/active_record/relation.rb
#
# Product.where("name like ?", "%Game%").cache_key(:last_reviewed_at)
def cache_key(timestamp_column = :updated_at)
query_signature = Digest::MD5.hexdigest(to_sql)

This comment has been minimized.

@egilburg

egilburg Jul 14, 2015

Contributor
  1. Perhaps use SHA over MD5?
  2. The logic select count(*), max(updated_at) could be written manually to only need one sql statement.
  3. What if table has no updated_at column?
  4. Do we need to expose internal logic of sql+count+updated? If not, we could use a single hash function (Digest::SHA1("#{to_sql}-#{count}-#{timestamp}")) and have a shorter string to return.
@egilburg

egilburg Jul 14, 2015

Contributor
  1. Perhaps use SHA over MD5?
  2. The logic select count(*), max(updated_at) could be written manually to only need one sql statement.
  3. What if table has no updated_at column?
  4. Do we need to expose internal logic of sql+count+updated? If not, we could use a single hash function (Digest::SHA1("#{to_sql}-#{count}-#{timestamp}")) and have a shorter string to return.

This comment has been minimized.

@afcapel

afcapel Jul 15, 2015

Contributor

@egilburg thanks for your feedback, here are my thoughts:

  1. Although MD5 can have some issues for cryptography I think it is fine for fingerprinting. It's somewhat less CPU intensive and it's what the Asset Pipeline uses for fingerprints.
  2. That's a good idea, although I'll have to check if I can create the query in a DB independent manner.
  3. It will fail hard with an missing column exception. I think that's the right behaviour.
  4. The string length it's not an issue, AFAICT. Average key length ~ 75 chars (depending on the model name), while memcached keys can be up to 255 chars. I'd prefer to optimise for the readability of the key.
@afcapel

afcapel Jul 15, 2015

Contributor

@egilburg thanks for your feedback, here are my thoughts:

  1. Although MD5 can have some issues for cryptography I think it is fine for fingerprinting. It's somewhat less CPU intensive and it's what the Asset Pipeline uses for fingerprints.
  2. That's a good idea, although I'll have to check if I can create the query in a DB independent manner.
  3. It will fail hard with an missing column exception. I think that's the right behaviour.
  4. The string length it's not an issue, AFAICT. Average key length ~ 75 chars (depending on the model name), while memcached keys can be up to 255 chars. I'd prefer to optimise for the readability of the key.

This comment has been minimized.

@egilburg

egilburg Jul 17, 2015

Contributor

If used in client side query/HTML source code, this will leak total record count of whole db table which may be privacy issue in multi tenant systems.

@egilburg

egilburg Jul 17, 2015

Contributor

If used in client side query/HTML source code, this will leak total record count of whole db table which may be privacy issue in multi tenant systems.

This comment has been minimized.

@sgrif

sgrif Jul 18, 2015

Member

Presumably it'd only leak the count of the query you're displaying, which would be scoped to the tennant.

@sgrif

sgrif Jul 18, 2015

Member

Presumably it'd only leak the count of the query you're displaying, which would be scoped to the tennant.

Show outdated Hide outdated activerecord/lib/active_record/relation.rb
# last updated record.
#
# Product.where("name like ?", "%Game%").cache_key(:last_reviewed_at)
def cache_key(timestamp_column = :updated_at)

This comment has been minimized.

@egilburg

egilburg Jul 14, 2015

Contributor

Name is a bit misleading, it may be how the reason you want to use this data, but not what the data actually is (e.g. it doesn't refer to any actual cache stored by Rails). Not really sure what would be a better name, though...

@egilburg

egilburg Jul 14, 2015

Contributor

Name is a bit misleading, it may be how the reason you want to use this data, but not what the data actually is (e.g. it doesn't refer to any actual cache stored by Rails). Not really sure what would be a better name, though...

This comment has been minimized.

@afcapel

afcapel Jul 15, 2015

Contributor

I'm following the same name as ActiveRecord::Base#cache_key as it is essentially the same idea behind it.

@afcapel

afcapel Jul 15, 2015

Contributor

I'm following the same name as ActiveRecord::Base#cache_key as it is essentially the same idea behind it.

@egilburg

This comment has been minimized.

Show comment
Hide comment
@egilburg

egilburg Jul 14, 2015

Contributor

I also guess it doesn't detect changes on includes tables, right? Probably good idea to document that caveat.

Contributor

egilburg commented Jul 14, 2015

I also guess it doesn't detect changes on includes tables, right? Probably good idea to document that caveat.

@afcapel

This comment has been minimized.

Show comment
Hide comment
@afcapel

afcapel Jul 15, 2015

Contributor

@egilburg again, I'm trying to follow the same patterns as ActiveRecord::Base#cache_key which won't change automatically when a dependent record changes, unless you make that record to touch the parent object. You need to understand the principles of Russian doll caching, but I think that explanation belongs in the Rails caching guide, not in this particular method documentation.

Contributor

afcapel commented Jul 15, 2015

@egilburg again, I'm trying to follow the same patterns as ActiveRecord::Base#cache_key which won't change automatically when a dependent record changes, unless you make that record to touch the parent object. You need to understand the principles of Russian doll caching, but I think that explanation belongs in the Rails caching guide, not in this particular method documentation.

Show outdated Hide outdated activerecord/lib/active_record/relation.rb
#
# Under the hood this triggers two SQL queries:
#
# SELECT COUNT(*) FROM "products" WHERE (name like '%Cosmic Encounter%')

This comment has been minimized.

@sgrif

sgrif Jul 17, 2015

Member

Why is this query needed? Is there any case in which the count can change, without the updated_at of the newest record changing? (Aside from deliberately bypassing this?)

@sgrif

sgrif Jul 17, 2015

Member

Why is this query needed? Is there any case in which the count can change, without the updated_at of the newest record changing? (Aside from deliberately bypassing this?)

This comment has been minimized.

@afcapel

afcapel Jul 18, 2015

Contributor

@sgrif if you delete a record other than the last one, the collection will change but the newest updated_at will be the same.

@afcapel

afcapel Jul 18, 2015

Contributor

@sgrif if you delete a record other than the last one, the collection will change but the newest updated_at will be the same.

This comment has been minimized.

@sgrif

sgrif Jul 18, 2015

Member

Ah, right. We should still condense this down to a single query though (and change COUNT(*) to COUNT(updated_at) so we only need to access a single column.

@sgrif

sgrif Jul 18, 2015

Member

Ah, right. We should still condense this down to a single query though (and change COUNT(*) to COUNT(updated_at) so we only need to access a single column.

This comment has been minimized.

@afcapel

afcapel Jul 18, 2015

Contributor

@sgrif That makes sense, I'll prepare the change.

@afcapel

afcapel Jul 18, 2015

Contributor

@sgrif That makes sense, I'll prepare the change.

This comment has been minimized.

@matthewd

matthewd Jul 18, 2015

Member

COUNT(updated_at) actually needs to read all the values for that column... COUNT(*) only counts the rows, so it does not.

@matthewd

matthewd Jul 18, 2015

Member

COUNT(updated_at) actually needs to read all the values for that column... COUNT(*) only counts the rows, so it does not.

Show outdated Hide outdated activerecord/lib/active_record/relation.rb
query_signature = Digest::MD5.hexdigest(to_sql)
key = "#{klass.model_name.cache_key}/query-#{query_signature}-#{size}"
if timestamp = maximum(timestamp_column)

This comment has been minimized.

@sgrif

sgrif Jul 17, 2015

Member

Does this execute a query if the relation is loaded?

@sgrif

sgrif Jul 17, 2015

Member

Does this execute a query if the relation is loaded?

This comment has been minimized.

@afcapel

afcapel Jul 18, 2015

Contributor

Yes, it does. I considered looping through the collection if it is loaded, but decided not to because it doesn't seem very useful to cache collections that are already loaded. I could add the condition, if people think it is needed.

@afcapel

afcapel Jul 18, 2015

Contributor

Yes, it does. I considered looping through the collection if it is loaded, but decided not to because it doesn't seem very useful to cache collections that are already loaded. I could add the condition, if people think it is needed.

This comment has been minimized.

@sgrif

sgrif Jul 18, 2015

Member

I could see cases where a partial is cached, and re-used in a context where the relation had already been loaded. Since it's trivially easy for us to add a max_by(&timestamp_column), I think we should save the query in that scenario.

@sgrif

sgrif Jul 18, 2015

Member

I could see cases where a partial is cached, and re-used in a context where the relation had already been loaded. Since it's trivially easy for us to add a max_by(&timestamp_column), I think we should save the query in that scenario.

@bogdan

This comment has been minimized.

Show comment
Hide comment
@bogdan

bogdan Jul 18, 2015

Contributor

This method requires an index on updated_at column. This requirement makes it sometimes unusable if table is large enough. This feature is definitely unusable in my project.

I am not sure if rails can include such features. You may relay on it when your DB is small but in a 1-2 year you should make it more smart or chose other caching strategy.

Contributor

bogdan commented Jul 18, 2015

This method requires an index on updated_at column. This requirement makes it sometimes unusable if table is large enough. This feature is definitely unusable in my project.

I am not sure if rails can include such features. You may relay on it when your DB is small but in a 1-2 year you should make it more smart or chose other caching strategy.

@sgrif

This comment has been minimized.

Show comment
Hide comment
@sgrif

sgrif Jul 18, 2015

Member

That's a fair concern. This method is completely unoverrideable as defined. Perhaps it should delegate to a class method that the users can override as desired.

Member

sgrif commented Jul 18, 2015

That's a fair concern. This method is completely unoverrideable as defined. Perhaps it should delegate to a class method that the users can override as desired.

@sgrif

This comment has been minimized.

Show comment
Hide comment
@sgrif

sgrif Jul 18, 2015

Member

I'd also like to see some more thorough tests for this behavior. Asserting that the hash of the query does change based on the SQL performed (maybe asserting that WHERE 1=1 changes the cache key). I'd also like to see explicit tests for the count portion, with and without a scope to ensure it matches that of the relation.

Member

sgrif commented Jul 18, 2015

I'd also like to see some more thorough tests for this behavior. Asserting that the hash of the query does change based on the SQL performed (maybe asserting that WHERE 1=1 changes the cache key). I'd also like to see explicit tests for the count portion, with and without a scope to ensure it matches that of the relation.

@afcapel

This comment has been minimized.

Show comment
Hide comment
@afcapel

afcapel Jul 18, 2015

Contributor

@bogdan I don't see how that is different from any AR method querying a column other than id, for instance User.where(email: "..."). You have to add indexes to columns in big tables that you are going to scan, but that concern is left to the developer, as it depends on many factors that rails can not foresee.

Contributor

afcapel commented Jul 18, 2015

@bogdan I don't see how that is different from any AR method querying a column other than id, for instance User.where(email: "..."). You have to add indexes to columns in big tables that you are going to scan, but that concern is left to the developer, as it depends on many factors that rails can not foresee.

@sgrif

This comment has been minimized.

Show comment
Hide comment
@sgrif

sgrif Jul 18, 2015

Member

@afcapel Regardless, the user needs to be able to change this behavior if they desire. We shouldn't be locking them into one implementation, the same way as they can override it for a single value. I think the simplest solution is to have the method be ActiveRecord::Base.collection_cache_key(collection, timestamp_column = :updated_at) and have the implementation on Relation be nothing more than

def cache_key(timestamp_column = :updated_at)
  @klass.collection_cache_key(self, timestamp_column)
end
Member

sgrif commented Jul 18, 2015

@afcapel Regardless, the user needs to be able to change this behavior if they desire. We shouldn't be locking them into one implementation, the same way as they can override it for a single value. I think the simplest solution is to have the method be ActiveRecord::Base.collection_cache_key(collection, timestamp_column = :updated_at) and have the implementation on Relation be nothing more than

def cache_key(timestamp_column = :updated_at)
  @klass.collection_cache_key(self, timestamp_column)
end
@afcapel

This comment has been minimized.

Show comment
Hide comment
@afcapel

afcapel Jul 18, 2015

Contributor

@sgrif I've updated the PR with your feedback, moving collection_cache_key to AR::Base, adding more tests around the query cache and ensuring only a query is triggered. Please, let me know what you think.

Contributor

afcapel commented Jul 18, 2015

@sgrif I've updated the PR with your feedback, moving collection_cache_key to AR::Base, adding more tests around the query cache and ensuring only a query is triggered. Please, let me know what you think.

@sgrif

This comment has been minimized.

Show comment
Hide comment
@sgrif

sgrif Jul 18, 2015

Member

I have some more comments but I'm traveling the rest of the day, will comment further tonight.

Member

sgrif commented Jul 18, 2015

I have some more comments but I'm traveling the rest of the day, will comment further tonight.

Show outdated Hide outdated activerecord/lib/active_record/collection_cache_key.rb
size, timestamp = if collection.loaded?
[collection.size, collection.collect(&timestamp_column).compact.max]
else
column_type = collection.klass.type_for_attribute(timestamp_column.to_s)

This comment has been minimized.

@sgrif

sgrif Jul 19, 2015

Member

collection.klass is always self

@sgrif

sgrif Jul 19, 2015

Member

collection.klass is always self

This comment has been minimized.

@afcapel

afcapel Jul 19, 2015

Contributor

Not in cases in which you call ActiveRecord::Base.collection_cache_key(collection) like this one. It's a bit annoying, but since a class method is public I think we need to support that kind of calls.

@afcapel

afcapel Jul 19, 2015

Contributor

Not in cases in which you call ActiveRecord::Base.collection_cache_key(collection) like this one. It's a bit annoying, but since a class method is public I think we need to support that kind of calls.

This comment has been minimized.

@sgrif

sgrif Jul 19, 2015

Member

I do not think we should support calling this method in any case other than a relation calling this method on self.klass

@sgrif

sgrif Jul 19, 2015

Member

I do not think we should support calling this method in any case other than a relation calling this method on self.klass

This comment has been minimized.

@sgrif

sgrif Jul 19, 2015

Member

In fact, I would even go so far as to say it's worth mentioning you should not call this method directly in the documentation.

@sgrif

sgrif Jul 19, 2015

Member

In fact, I would even go so far as to say it's worth mentioning you should not call this method directly in the documentation.

Show outdated Hide outdated activerecord/lib/active_record/collection_cache_key.rb
column_type = collection.klass.type_for_attribute(timestamp_column.to_s)
column = "#{connection.quote_table_name(collection.table_name)}.#{connection.quote_column_name(timestamp_column)}"
result = collection.select("COUNT(*) AS size", "MAX(#{column}) AS timestamp").to_a.first

This comment has been minimized.

@sgrif

sgrif Jul 19, 2015

Member

This builds intermediate AR objects, which is unnecessary here. We should be passing the AST directly to the connection adapter, where we have a specific method to call for a query which returns a single row.

@sgrif

sgrif Jul 19, 2015

Member

This builds intermediate AR objects, which is unnecessary here. We should be passing the AST directly to the connection adapter, where we have a specific method to call for a query which returns a single row.

Show outdated Hide outdated activerecord/lib/active_record/collection_cache_key.rb
query_signature = Digest::MD5.hexdigest(collection.to_sql)
key = "#{collection.model_name.cache_key}/query-#{query_signature}"
size, timestamp = if collection.loaded?

This comment has been minimized.

@sgrif

sgrif Jul 19, 2015

Member

I'd rather skip the intermediate array object, and just do:

if collection.loaded?
  size = ...
  timestamp = ...
else
  size = ...
  timestamp = ...
end
@sgrif

sgrif Jul 19, 2015

Member

I'd rather skip the intermediate array object, and just do:

if collection.loaded?
  size = ...
  timestamp = ...
else
  size = ...
  timestamp = ...
end
test "cache_key for empty relation" do
developers = Developer.where(name: "Non Existent Developer")
assert_match(/\Adevelopers\/query-(\h+)-0\Z/, developers.cache_key)
end

This comment has been minimized.

@sgrif

sgrif Jul 19, 2015

Member

This is great, but can we add a test case that shows that the count portion is a specific expected value for a cache key, both with and without a where clause?

@sgrif

sgrif Jul 19, 2015

Member

This is great, but can we add a test case that shows that the count portion is a specific expected value for a cache key, both with and without a where clause?

This comment has been minimized.

@afcapel

afcapel Jul 20, 2015

Contributor

I've added these, to check each part of the cache key. Is that ok?

@afcapel

afcapel Jul 20, 2015

Contributor

I've added these, to check each part of the cache key. Is that ok?

if timestamp
"#{key}-#{size}-#{timestamp.utc.to_s(cache_timestamp_format)}"
else
"#{key}-#{size}"

This comment has been minimized.

@sgrif

sgrif Jul 19, 2015

Member

Is there a case where we would have no value for timestamp, but size is some value other than 0?

@sgrif

sgrif Jul 19, 2015

Member

Is there a case where we would have no value for timestamp, but size is some value other than 0?

This comment has been minimized.

@afcapel

afcapel Jul 19, 2015

Contributor

The timestamp column could have NULL values. In that case, it probably wouldn't be a good idea to create a cache key like this, but at least the method behaves somewhat better in that situation.

@afcapel

afcapel Jul 19, 2015

Contributor

The timestamp column could have NULL values. In that case, it probably wouldn't be a good idea to create a cache key like this, but at least the method behaves somewhat better in that situation.

Show outdated Hide outdated activerecord/lib/active_record/collection_cache_key.rb
column = "#{connection.quote_table_name(collection.table_name)}.#{connection.quote_column_name(timestamp_column)}"
result = collection.select("COUNT(*) AS size", "MAX(#{column}) AS timestamp").to_a.first
[result.size, column_type.deserialize(result.timestamp)]

This comment has been minimized.

@sgrif

sgrif Jul 19, 2015

Member

I really don't like that we're reaching into the internals of the type system here, but I'm not sure I have a better idea at the moment. Just wanted to point out that this makes me uncomfortable.

@sgrif

sgrif Jul 19, 2015

Member

I really don't like that we're reaching into the internals of the type system here, but I'm not sure I have a better idea at the moment. Just wanted to point out that this makes me uncomfortable.

This comment has been minimized.

@afcapel

afcapel Jul 19, 2015

Contributor

I see what you mean. Once I tried to make only one query, I saw myself having to handle the low levels details, like column deserialization. It feels that it should be easier, although that's the best I could come up with. Happy to accept suggestions to improve it, though.

@afcapel

afcapel Jul 19, 2015

Contributor

I see what you mean. Once I tried to make only one query, I saw myself having to handle the low levels details, like column deserialization. It feels that it should be easier, although that's the best I could come up with. Happy to accept suggestions to improve it, though.

This comment has been minimized.

@sgrif

sgrif Jul 19, 2015

Member

Long term my goal is to have the connection adapter handle all primitives (e.g. all types from the DB that have a canonical Ruby representation, so that includes timestamp but excludes point). But that's a long way off. If I think of something I'll let you know.

@sgrif

sgrif Jul 19, 2015

Member

Long term my goal is to have the connection adapter handle all primitives (e.g. all types from the DB that have a canonical Ruby representation, so that includes timestamp but excludes point). But that's a long way off. If I think of something I'll let you know.

Show outdated Hide outdated activerecord/lib/active_record/collection_cache_key.rb
key = "#{collection.model_name.cache_key}/query-#{query_signature}"
size, timestamp = if collection.loaded?
[collection.size, collection.collect(&timestamp_column).compact.max]

This comment has been minimized.

@sgrif

sgrif Jul 19, 2015

Member

Do you think this would be better written as: collection.max_by(&timestamp_column).public_send(timestamp_column)

@sgrif

sgrif Jul 19, 2015

Member

Do you think this would be better written as: collection.max_by(&timestamp_column).public_send(timestamp_column)

assert_match /\Adevelopers\/query-(\h+)-(\d+)-(\d+)\Z/, developers.cache_key
/\Adevelopers\/query-(\h+)-(\d+)-(\d+)\Z/ =~ developers.cache_key

This comment has been minimized.

@afcapel

afcapel Jul 20, 2015

Contributor

assert_match doesn't set the global match variables.

@afcapel

afcapel Jul 20, 2015

Contributor

assert_match doesn't set the global match variables.

@sgrif

This comment has been minimized.

Show comment
Hide comment
@sgrif

sgrif Jul 20, 2015

Member

@afcapel Did you exclude the suggestion about changing collection.klass to implicit self + documenting "this method should not be called directly" because you feel strongly against that suggestion? It's completely fine if that's the case, I would just like to confirm that you feel strongly that is not the right direction to go, and not an oversight.

In either case, can you either implement that suggestion, or confirm you feel strongly it is incorrect, and then squash & rebase?

Member

sgrif commented Jul 20, 2015

@afcapel Did you exclude the suggestion about changing collection.klass to implicit self + documenting "this method should not be called directly" because you feel strongly against that suggestion? It's completely fine if that's the case, I would just like to confirm that you feel strongly that is not the right direction to go, and not an oversight.

In either case, can you either implement that suggestion, or confirm you feel strongly it is incorrect, and then squash & rebase?

# You can also pass a custom timestamp column to fetch the timestamp of the
# last updated record.
#
# Product.where("name like ?", "%Game%").cache_key(:last_reviewed_at)

This comment has been minimized.

@sgrif

sgrif Jul 20, 2015

Member

We need to document that the user can override this behavior by implementing self.collection_cache_key on the class

@sgrif

sgrif Jul 20, 2015

Member

We need to document that the user can override this behavior by implementing self.collection_cache_key on the class

@afcapel

This comment has been minimized.

Show comment
Hide comment
@afcapel

afcapel Jul 20, 2015

Contributor

@sgrif oh, no I just missed the comments, but I think it's a good idea to make that method as "private" as possible. I'll make the change, squash and rebase.

Contributor

afcapel commented Jul 20, 2015

@sgrif oh, no I just missed the comments, but I think it's a good idea to make that method as "private" as possible. I'll make the change, squash and rebase.

Show outdated Hide outdated activerecord/lib/active_record/collection_cache_key.rb
# Generates a cache key for the records in the given collection.
# See <tt>ActiveRecord::Relation#cache_key</tt> for details.
def collection_cache_key(collection = all, timestamp_column = :updated_at)

This comment has been minimized.

@sgrif

sgrif Jul 20, 2015

Member

I would actually prefer to mark this method as # :nodoc: since it only serves as a hook for Relation#cache_key

@sgrif

sgrif Jul 20, 2015

Member

I would actually prefer to mark this method as # :nodoc: since it only serves as a hook for Relation#cache_key

@sgrif

This comment has been minimized.

Show comment
Hide comment
@sgrif

sgrif Jul 20, 2015

Member

❤️ I just left 2 more comments about docs, besides that, this is good to merge. Ping me once updated. (The wife wants to head to the movies so I might not get to it tonight)

Member

sgrif commented Jul 20, 2015

❤️ I just left 2 more comments about docs, besides that, this is good to merge. Ping me once updated. (The wife wants to head to the movies so I might not get to it tonight)

@afcapel

This comment has been minimized.

Show comment
Hide comment
@afcapel

afcapel Jul 20, 2015

Contributor

@sgrif nice, the PR is already updated. Thanks for your help!

Contributor

afcapel commented Jul 20, 2015

@sgrif nice, the PR is already updated. Thanks for your help!

#
# You can customize the strategy to generate the key on a per model basis
# overriding ActiveRecord::Base#collection_cache_key.
def cache_key(timestamp_column = :updated_at)

This comment has been minimized.

@egilburg

egilburg Jul 20, 2015

Contributor

for clarity perhaps rename timestamp_column to cache_key_column. For example tables without timestamp but which have autosequence id could use the id column, by defining:

def cache_key
  super(:id)
end

On other hand this would introduce discrepancy between definition of collection_cache_key. So another alternative would be to extract class method cache_key_column which returns :updated_at by default and which user could override.

@egilburg

egilburg Jul 20, 2015

Contributor

for clarity perhaps rename timestamp_column to cache_key_column. For example tables without timestamp but which have autosequence id could use the id column, by defining:

def cache_key
  super(:id)
end

On other hand this would introduce discrepancy between definition of collection_cache_key. So another alternative would be to extract class method cache_key_column which returns :updated_at by default and which user could override.

This comment has been minimized.

@afcapel

afcapel Jul 31, 2015

Contributor

Sorry, I'm not sure if cache_key_column is a better name. To me def cache_key(cache_key_column) sounds like we will fetch the key from a column, which is not exactly right.

@afcapel

afcapel Jul 31, 2015

Contributor

Sorry, I'm not sure if cache_key_column is a better name. To me def cache_key(cache_key_column) sounds like we will fetch the key from a column, which is not exactly right.

@afcapel

This comment has been minimized.

Show comment
Hide comment
@afcapel

afcapel Jul 31, 2015

Contributor

@sgrif any updates on this? Please let me know if you think it needs any more changes.

Contributor

afcapel commented Jul 31, 2015

@sgrif any updates on this? Please let me know if you think it needs any more changes.

@sgrif sgrif merged commit 476e3f5 into rails:master Aug 2, 2015

sgrif added a commit that referenced this pull request Aug 2, 2015

Merge pull request #20884
Add #cache_key to ActiveRecord::Relation.

sgrif added a commit that referenced this pull request Aug 2, 2015

Fix test failures caused by #20884
PostgreSQL is strict about the usage of `DISTINCT` and `ORDER BY`, which
one of the tests demonstrated. The order clause is never going to be
relevant in the query we're performing, so let's just remove it
entirely.
@jonatack

This comment has been minimized.

Show comment
Hide comment
@jonatack

jonatack Aug 2, 2015

Contributor

First off, great idea behind this PR!

Question: calling #cache_key on an empty AR relation, I'm seeing a NoMethodError: undefined method updated_at for nil:NilClass from activerecord/lib/active_record/collection_cache_key.rb:10:in public_send.

Is this expected behavior?

First off, great idea behind this PR!

Question: calling #cache_key on an empty AR relation, I'm seeing a NoMethodError: undefined method updated_at for nil:NilClass from activerecord/lib/active_record/collection_cache_key.rb:10:in public_send.

Is this expected behavior?

This comment has been minimized.

Show comment
Hide comment
@jonatack

jonatack Aug 2, 2015

Contributor

For example, when I reproduce the cache_key for empty relation test below with the query developers = Developer.where(name: "Non Existent Developer"), I'm getting the error.

Contributor

jonatack replied Aug 2, 2015

For example, when I reproduce the cache_key for empty relation test below with the query developers = Developer.where(name: "Non Existent Developer"), I'm getting the error.

@swalkinshaw

This comment has been minimized.

Show comment
Hide comment
@swalkinshaw

swalkinshaw Aug 8, 2015

Contributor

@afcapel before I created a new issue I just want to bring up a potential problem here.

Right now the cache key is a combination of 3 factors: a) SQL query digest, b) record count, c) max timestamp. This does cover most situations but there's a few where it doesn't. The easiest use case to explain it is any query involving limit.

Pseudo query: Post.order(:title).limit(5). Let's say the 5 records returned are:

[
  { title: 'A', updated_at: 2015-08-04 04:14:42 +0000 },
  { title: 'B', updated_at: 2015-08-05 04:14:42 +0000 },
  { title: 'C', updated_at: 2015-08-08 04:14:42 +0000 },
  { title: 'D', updated_at: 2015-08-04 04:14:42 +0000 },
  { title: 'E', updated_at: 2015-08-05 04:14:42 +0000 }
]

Now let's delete the post with title 'B' and re-run the query Post.order(:title).limit(5):

[
  { title: 'A', updated_at: 2015-08-04 04:14:42 +0000 },
  { title: 'C', updated_at: 2015-08-08 04:14:42 +0000 },
  { title: 'D', updated_at: 2015-08-04 04:14:42 +0000 },
  { title: 'E', updated_at: 2015-08-05 04:14:42 +0000 },
  { title: 'F', updated_at: 2015-08-05 04:14:42 +0000 }
]

Notice that the 3 factors of the cache key all remain the same.

a) same query digest
b) same count of 5
c) same max updated_at (the newest one is still there)

We ran into this issue implementing relation cache keys in our app we found the only reliable solution was to add a 4th factor: a representation of all the relation's id attributes.

In Postgres we accomplished it using md5(array_agg(id || '-' || updated_at)::text) for non-loaded queries. And for already loaded ones:

results = collection.flat_map { |item| [item.id, item.updated_at].join('-') }
Digest::SHA1.hexdigest(results.join)

You'll also note we also just skipped only taking the max updated_at and just used them all just like we did with id.

Contributor

swalkinshaw commented Aug 8, 2015

@afcapel before I created a new issue I just want to bring up a potential problem here.

Right now the cache key is a combination of 3 factors: a) SQL query digest, b) record count, c) max timestamp. This does cover most situations but there's a few where it doesn't. The easiest use case to explain it is any query involving limit.

Pseudo query: Post.order(:title).limit(5). Let's say the 5 records returned are:

[
  { title: 'A', updated_at: 2015-08-04 04:14:42 +0000 },
  { title: 'B', updated_at: 2015-08-05 04:14:42 +0000 },
  { title: 'C', updated_at: 2015-08-08 04:14:42 +0000 },
  { title: 'D', updated_at: 2015-08-04 04:14:42 +0000 },
  { title: 'E', updated_at: 2015-08-05 04:14:42 +0000 }
]

Now let's delete the post with title 'B' and re-run the query Post.order(:title).limit(5):

[
  { title: 'A', updated_at: 2015-08-04 04:14:42 +0000 },
  { title: 'C', updated_at: 2015-08-08 04:14:42 +0000 },
  { title: 'D', updated_at: 2015-08-04 04:14:42 +0000 },
  { title: 'E', updated_at: 2015-08-05 04:14:42 +0000 },
  { title: 'F', updated_at: 2015-08-05 04:14:42 +0000 }
]

Notice that the 3 factors of the cache key all remain the same.

a) same query digest
b) same count of 5
c) same max updated_at (the newest one is still there)

We ran into this issue implementing relation cache keys in our app we found the only reliable solution was to add a 4th factor: a representation of all the relation's id attributes.

In Postgres we accomplished it using md5(array_agg(id || '-' || updated_at)::text) for non-loaded queries. And for already loaded ones:

results = collection.flat_map { |item| [item.id, item.updated_at].join('-') }
Digest::SHA1.hexdigest(results.join)

You'll also note we also just skipped only taking the max updated_at and just used them all just like we did with id.

@manuelmeurer

This comment has been minimized.

Show comment
Hide comment
@manuelmeurer

manuelmeurer Aug 19, 2015

Contributor

I agree with @swalkinshaw, ran into the same situation where max updated_at, collection size and SQL query digest are not "sufficiently unique".

Contributor

manuelmeurer commented Aug 19, 2015

I agree with @swalkinshaw, ran into the same situation where max updated_at, collection size and SQL query digest are not "sufficiently unique".

@christos

This comment has been minimized.

Show comment
Hide comment
@christos

christos Sep 4, 2015

Contributor

I've had the same issues as @swalkinshaw and @manuelmeurer using this approach. I've created PR #21503 with a more complete solution similar to @swalkinshaw. See the PR description for more details.

Assuming we are correct @sgrif, If this PR goes in to Rails 5, it could cause a lot of cache debugging headaches to people. Can you either review #21053 or revert this PR?

Contributor

christos commented Sep 4, 2015

I've had the same issues as @swalkinshaw and @manuelmeurer using this approach. I've created PR #21503 with a more complete solution similar to @swalkinshaw. See the PR description for more details.

Assuming we are correct @sgrif, If this PR goes in to Rails 5, it could cause a lot of cache debugging headaches to people. Can you either review #21053 or revert this PR?

@metaskills metaskills referenced this pull request Mar 21, 2016

Merged

Gem version 0.1.0 #1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment