Support distributed preloading calls #45724

jonathanhefner · 2022-08-01T18:56:46Z

This PR is split into two commits.

The 1st commit allows records from already-loaded relations to be used when performing post-hoc preloads of downstream associations.

The 2nd commit modifies the Preloader to track preloaded associations in buckets, and, when preloading a downstream association, to preload that association for all owner records from the same bucket.

The goal of this PR is to allow individual includes and preload calls to be placed alongside the code that depends on them. This makes it easier to add a preload when needed or remove a preload when it is no longer needed.

Consider the following code:

def render_posts(posts)
  posts.includes(:comments).each do |post|
    post.comments.each do |comment|
      # ...
    end
  end
end

def render_authors(authors)
  authors.includes(:posts).each do |author|
    render_posts(author.posts)
  end
end

render_authors(Author.all)

Before the 1st commit, with N authors each having at least 1 post, the code would execute 2N + 2 queries. For example, with 5 authors:

SELECT "authors".* FROM "authors"
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 1]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 11]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 2]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 12]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 3]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 13]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 4]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 14]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 5]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 15]]

After the 1st commit, the code executes N + 2 queries:

SELECT "authors".* FROM "authors"
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 11]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 12]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 13]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 14]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 15]]

After the 2nd commit, the code executes 3 queries:

SELECT "authors".* FROM "authors"
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" IN (?, ?, ?, ?, ?)  [["post_id", 11], ["post_id", 12], ["post_id", 13], ["post_id", 14], ["post_id", 15]]

Related: #45161, #45413, #45231. The primary difference of this PR is that users must still opt in to preloading via includes or preload. The intent is to allow fine-grained control over when preloading happens, but to make cascading automatic.

Prior to this commit, calling `includes` or `preload` on an already-loaded relation or collection proxy would return a new relation that ignored the loaded records. This commit passes the loaded records along to the new relation so that it can use them when preloading an association. Consider the following code: ```ruby posts.each do |post| # ... end # ...some intervening code... posts.includes(:comments).each do |post| post.comments.each do |comment| # ... end end ``` Before this commit, the code would execute 3 queries: ``` SELECT "posts".* FROM "posts" SELECT "posts".* FROM "posts" SELECT "comments".* FROM "comments" WHERE "comments"."post_id" IN (?, ?, ?, ?, ?) [["post_id", 1], ["post_id", 2], ["post_id", 3], ["post_id", 4], ["post_id", 5]] ``` After this commit, the code executes only 2 queries: ``` SELECT "posts".* FROM "posts" SELECT "comments".* FROM "comments" WHERE "comments"."post_id" IN (?, ?, ?, ?, ?) [["post_id", 1], ["post_id", 2], ["post_id", 3], ["post_id", 4], ["post_id", 5]] ```

This commit modifies the `Preloader` to track preloaded associations in buckets, and, when preloading a downstream association, to preload that association for all owner records from the same bucket. The goal of this change is to allow individual `includes` and `preload` calls to be placed alongside the code that depends on them. This makes it easier to add a preload when needed or remove a preload when it is no longer needed. Consider the following code: ```ruby def render_posts(posts) posts.includes(:comments).each do |post| post.comments.each do |comment| # ... end end end def render_authors(authors) authors.includes(:posts).each do |author| render_posts(author.posts) end end render_authors(Author.all) ``` Before this commit, with N authors each having at least 1 post, the code would execute N + 2 queries. For example, with 5 authors: ``` SELECT "authors".* FROM "authors" SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?) [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]] SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ? [["post_id", 11]] SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ? [["post_id", 12]] SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ? [["post_id", 13]] SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ? [["post_id", 14]] SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ? [["post_id", 15]] ``` After this commit, the code executes only 3 queries: ``` SELECT "authors".* FROM "authors" SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?) [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]] SELECT "comments".* FROM "comments" WHERE "comments"."post_id" IN (?, ?, ?, ?, ?) [["post_id", 11], ["post_id", 12], ["post_id", 13], ["post_id", 14], ["post_id", 15]] ```

ghiculescu · 2022-08-01T19:00:25Z

Very cool idea. I think the final SQL output is similar to what you'd get if using goldiloader, without any includes needed. I know goldiloader isn't part of Rails (and it's great to see this in Rails!) but it would be good to confirm if these changes are compatible with it.

jonathanhefner · 2022-08-01T19:31:31Z

@ghiculescu Thanks! 😃 If my understanding is correct, I think #45413 is closer to goldiloader.

In this PR, I wanted to take a more conservative approach, to try to avoid situations where deeply nested code must defensively disable the feature for fear of loading too much data. Granted, that can still happen with this PR, but my hope is that restricting the feature to explicit (distributed) chains of includes / preload calls will greatly reduce the risk.

gwincr11 · 2022-08-21T21:06:27Z

Yeah, glad to see another approach around this. I took a shot at this style of api before I submitted my proposed approach. The feedback I received I would like to share.

Many people thought it was not a great ergonomic to have to opt in to the include since that is more frequently the preferred mode and they preferred an opt out.

The tricky thing I think is balancing the shortest code to write that minimizes mistakes we care about most. Large data loads could be a problem but which is our api optimized to fix? Should it more eagerly load to prevent the n+1 or should it eagerly optimize for large data flows and make fixing n+1 queries an opt in?

gwincr11 · 2022-08-22T13:54:42Z

I do wonder if perhaps a new api is needed that does not exist. The concern we are working with here is fairly well understood when you maybe loading a lot of records directly from the database, we use the batch api for this. What if the associations allowed for a batch like method to be called?

Something like

post.comments_in_batches.each

This would allow the user to limit the number of things brought back from the database in a familiar way and also allow dynamic includes to be the default. It also provides a familiar api for working with large data sets.

jonathanhefner added 2 commits August 1, 2022 13:49

rails-bot bot added the activerecord label Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support distributed preloading calls #45724

Support distributed preloading calls #45724

jonathanhefner commented Aug 1, 2022

ghiculescu commented Aug 1, 2022 •

edited

jonathanhefner commented Aug 1, 2022

gwincr11 commented Aug 21, 2022

gwincr11 commented Aug 22, 2022

Support distributed preloading calls #45724

Are you sure you want to change the base?

Support distributed preloading calls #45724

Conversation

jonathanhefner commented Aug 1, 2022

ghiculescu commented Aug 1, 2022 • edited

jonathanhefner commented Aug 1, 2022

gwincr11 commented Aug 21, 2022

gwincr11 commented Aug 22, 2022

ghiculescu commented Aug 1, 2022 •

edited