Skip to content

Use collection ids and timestamps to generate truly unique collection cache keys #21503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Use collection ids and timestamps to generate truly unique collection cache keys #21503

wants to merge 4 commits into from

Conversation

christos
Copy link
Contributor

@christos christos commented Sep 4, 2015

Pull request #20884 introduced collection cache keys to help cache partials for entire collections, provided the collection members update their timestamp columns when they change.

However, as highlighted here and here the solution does not create a unique cache key identifying the collection contents in the following two scenarios.

1 - collection count SQL is not compatible with limit()

The collection size SQL generated by the original PR's code for Developer.where(name: "David").limit(1) is incorrect:

SELECT COUNT(*) AS size, MAX("developers"."updated_at") AS timestamp 
  FROM "developers"
  WHERE "developers"."name" = "David" 
  LIMIT 1

The size returned by the above query is not what the LIMIT specifies, but the size of the result of all developers whose name is David. This is expected SQL behaviour, by the way.

2 - Replacing an old record in the collection does not invalidate the cache key

This has already been explained by @swalkinshaw here and the following test fails against master.

test "collection_cache_key changes when old collection members are replaced" do
  project = Project.create
  project.developers.create(updated_at: 2.hours.ago, name: "anonymous")
  project.developers.create(updated_at: 4.hours.ago, name: "eponymous")

  key1 = project.developers.collection_cache_key

  project.developers.where(name: "eponymous").destroy_all
  project.developers.create(updated_at: 5.hours.ago, name: "anonymous")

  key2 = project.developers.collection_cache_key

  assert_not_equal key2, key1
end

The solution proposed by @swalkinshaw here is what I have been using successfully in production, but in the form of an application helper called cache_collection. The helper mirrors the cache view helper accepting a collection instead of a record.

This PR implement the same solution as my cache_collection helper but using the collection_cache_key approach from #20884

I've made a few modifications.

I've removed the SQL query signature digest, because the query contents (i.e. ids and timestamps) can uniquely identify a collection with records whose timestamp columns are properly updated when the records change.

Instead of plucking just the collection ids and the maximum timestamp, I chose to pluck all timestamps taking advantage of the fact that the pluck method fires just one query for both ids and timestamps.

I've also switched to a SHA256 digest to minimise the probability of a collision. My understanding is that given the large size of the digested content a collision is very improbable.

Personally, I prefer the view helper approach, as it makes it more easy to see that asking for a collection cache key will incur a database request.

Having said that, and seeing as #20884 also incurs a database query, I think this more complete solution will save people some headaches when they encounter the above scenarios where the original PR would fail to provide a correct cache key.

If the Rails core team is happy with my implementation in this PR, then let me know and I will update the method documentation as well.

@rails-bot
Copy link

r? @carlosantoniodasilva

(@rails-bot has picked a reviewer for you, use r? to override)

@swalkinshaw
Copy link
Contributor

@christos this looks like a good solution. Much simpler as well 👍


size = result["size"]
timestamp = column_type.deserialize(result["timestamp"])
unique_signature = collection.unscope(:order).pluck(primary_key, timestamp_column).flatten.join("-")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: Perhaps freeze the two "-" strings while you're at it, since master targets Ruby 2.2+? For what it's worth, the same string is frozen throughout ActiveSupport::Inflector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonatack Done.

@christos
Copy link
Contributor Author

Is there any interest on this being merged?

r? @sgrif this is related to a feature you merged back in July.

@manuelmeurer
Copy link
Contributor

👍 Let's get this merged!

@sgrif
Copy link
Contributor

sgrif commented Oct 23, 2015

I'm at a conference but I have this pinned and will look at it soon.

On Fri, Oct 23, 2015, 1:55 PM Manuel Meurer notifications@github.com
wrote:

[image: 👍] Let's get this merged!


Reply to this email directly or view it on GitHub
#21503 (comment).

@christos
Copy link
Contributor Author

@sgrif Ping?


size = result["size"]
timestamp = column_type.deserialize(result["timestamp"])
unique_signature = collection.unscope(:order).pluck(primary_key, timestamp_column).flatten.join("-".freeze)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by PostgreSQL order failures? If the scope has an order then the cache key query has to preserve it otherwise it won't match the same set of records.

[3] pry(main)> Post
=> Post(id: integer, published_at: datetime, created_at: datetime, updated_at: datetime)
[4] pry(main)> 20.times { |i| Post.create!(published_at: Time.current - 1.hour * i) }
[5] pry(main)> Post.count
   (0.1ms)  SELECT COUNT(*) FROM "posts"
=> 20
[6] pry(main)> Post.order(published_at: :desc).limit(10).pluck(:id, :updated_at)
   (0.3ms)  SELECT  "posts"."id", "posts"."updated_at" FROM "posts" ORDER BY "posts"."published_at" DESC LIMIT 10
=> [[1, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [2, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [3, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [4, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [5, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [6, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [7, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [8, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [9, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [10, Wed, 25 Nov 2015 21:40:23 UTC +00:00]]
[7] pry(main)> Post.order(published_at: :asc).limit(10).pluck(:id, :updated_at)
   (0.2ms)  SELECT  "posts"."id", "posts"."updated_at" FROM "posts" ORDER BY "posts"."published_at" ASC LIMIT 10
=> [[20, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [19, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [18, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [17, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [16, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [15, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [14, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [13, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [12, Wed, 25 Nov 2015 21:40:23 UTC +00:00],
 [11, Wed, 25 Nov 2015 21:40:23 UTC +00:00]]
[8] pry(main)> Post.order(published_at: :desc).limit(10).cache_key
   (0.2ms)  SELECT  "posts"."id", "posts"."updated_at" FROM "posts" LIMIT 10
=> "posts/collection-digest-7f2daae36b230c64b72b28cafd70ddbd07d0f6369fb46e3155cc90ebcdad5058"
[9] pry(main)> Post.order(published_at: :asc).limit(10).cache_key
   (0.1ms)  SELECT  "posts"."id", "posts"."updated_at" FROM "posts" LIMIT 10
=> "posts/collection-digest-7f2daae36b230c64b72b28cafd70ddbd07d0f6369fb46e3155cc90ebcdad5058"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hdabrows Well spotted.

You can see the Postgres failure in this build

The real culprit is pluck, which for an association that goes through a join table, creates a query that tries to pluck attributes that are not both in the SELECT DISTINCT and ORDER BY clauses as required by Postgres.

ActiveRecord::StatementInvalid: PG::InvalidColumnReference: ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list
LINE 1: ... "developers_projects"."project_id" = $1 ORDER BY developers...
                                                             ^
: SELECT DISTINCT "developers"."id", "developers"."updated_at" FROM "developers" INNER JOIN "developers_projects" ON "developers"."id" = "developers_projects"."developer_id" WHERE "developers_projects"."project_id" = $1 ORDER BY developers.name desc, developers.id desc

To be honest, I have no idea how to pluck what I need without always executing the entire query without pluck.

Any ideas?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christos , I pulled one request to try to fix this. #28075
I am not sure if this is a good idea

@kipcole9
Copy link

Looks like this change would also fix the bug that results in an exception from trying to generate a cache key on an empty unloaded collection:

        query = collection
          .select("COUNT(*) AS size", "MAX(#{column}) AS timestamp")
          .unscope(:order)
        result = connection.select_one(query)
        size = result["size"]    # exception when result is nil
        timestamp = column_type.deserialize(result["timestamp"])  # exception when result is nil


size = result["size"]
timestamp = column_type.deserialize(result["timestamp"])
unique_signature = collection.pluck(primary_key, timestamp_column).flatten.join("-".freeze)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christos now that you removed the need for the unscoped, these statements are identical.

@cantonic
Copy link

This is what I have implemented in a project inspired by this PR:

module CollectionCacheKey
  def collection_cache_key(collection = all, timestamp_column = :updated_at)
    model_signature = collection.model_name.cache_key

    if collection.loaded?
      unique_signature = collection.pluck(primary_key, timestamp_column).to_s
    else
      unique_signature = collection.unscope(:order).pluck(primary_key, timestamp_column).to_s
    end

    "#{model_signature}/collection-digest-#{unique_signature}"
  end
end

Note that I have used to_s instead of flatten.join("-".freeze) for stringifying the unique_signature.

I found out through a quick benchmark that this is the fastest way to generate a unique cache key.

Even if you end up having more inelegant cache keys like foo/collection-digest-[1, 2, 3] instead of foo/collection-digest-1-2-3, one should put performance over style in this case since caching is about speed improvement to the max in my opinion.

@prathamesh-sonpatki prathamesh-sonpatki added this to the 5.0.0 milestone Jan 16, 2016
@sgrif
Copy link
Contributor

sgrif commented Feb 11, 2016

@christos Are you still interested in working on this? This PR has failing tests which haven't been addressed, and you have a conditional where both branches are identical that needs to be removed.

@sgrif sgrif removed this from the 5.0.0 milestone Feb 23, 2016
@christos
Copy link
Contributor Author

christos commented Mar 6, 2016

@sgrif I am stuck as per this comment

I see you came across the problem in the initial PR and you fixed it by removing the order scope, using unscope(:order).

You later realised that it could mess up the IDs/timestamps, when combining a pluck with unscope, limit, and offset

I'll have a go again trying to create failing tests when unscope(:order) is called, but I am not sure how to resolve it without loading the entire collection instead of using pluck

@bquorning
Copy link
Contributor

bquorning commented Jan 23, 2017

I am stuck as per this comment

That link leads nowhere now, but I assume you got stuck on the failing Postgres tests: The “outdated diff” comment #21503 (comment) mentions the failing Postgres tests:

ActiveRecord::StatementInvalid: PG::InvalidColumnReference:
  ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list

So, calling pluck(primary_key, timestamp_column) changes the selected columns from * to id, created_at. But since the query needs ordering by name and id, and the select is distinct, the query must select name as well. It doesn’t, and Postgres complains.

Isn’t Arel to blame here? Why does Arel generate a query that can’t be performed against Postgres?

@bquorning
Copy link
Contributor

@sgrif Perhaps you know the answer to that question? (from my comment above)

Isn’t Arel to blame here? Why does Arel generate a query that can’t be performed against Postgres?

@feliperaul
Copy link
Contributor

Do we have any update on this?

@talpava
Copy link

talpava commented Sep 16, 2019 via email

@talpava
Copy link

talpava commented Sep 16, 2019 via email

@rails-bot
Copy link

rails-bot bot commented Dec 17, 2019

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Thank you for your contributions.

@rails-bot rails-bot bot added the stale label Dec 17, 2019
@SleeplessByte
Copy link
Contributor

I don't think it has been resolved?

@rails-bot rails-bot bot removed the stale label Dec 18, 2019
@rails-bot
Copy link

rails-bot bot commented Mar 17, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Thank you for your contributions.

@rails-bot rails-bot bot added the stale label Mar 17, 2020
@rails-bot rails-bot bot closed this Mar 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.