Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support distributed preloading calls #45724

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jonathanhefner
Copy link
Member

This PR is split into two commits.

The 1st commit allows records from already-loaded relations to be used when performing post-hoc preloads of downstream associations.

The 2nd commit modifies the Preloader to track preloaded associations in buckets, and, when preloading a downstream association, to preload that association for all owner records from the same bucket.

The goal of this PR is to allow individual includes and preload calls to be placed alongside the code that depends on them. This makes it easier to add a preload when needed or remove a preload when it is no longer needed.

Consider the following code:

def render_posts(posts)
  posts.includes(:comments).each do |post|
    post.comments.each do |comment|
      # ...
    end
  end
end

def render_authors(authors)
  authors.includes(:posts).each do |author|
    render_posts(author.posts)
  end
end

render_authors(Author.all)

Before the 1st commit, with N authors each having at least 1 post, the code would execute 2N + 2 queries. For example, with 5 authors:

SELECT "authors".* FROM "authors"
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 1]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 11]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 2]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 12]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 3]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 13]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 4]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 14]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 5]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 15]]

After the 1st commit, the code executes N + 2 queries:

SELECT "authors".* FROM "authors"
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 11]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 12]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 13]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 14]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 15]]

After the 2nd commit, the code executes 3 queries:

SELECT "authors".* FROM "authors"
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" IN (?, ?, ?, ?, ?)  [["post_id", 11], ["post_id", 12], ["post_id", 13], ["post_id", 14], ["post_id", 15]]

Related: #45161, #45413, #45231. The primary difference of this PR is that users must still opt in to preloading via includes or preload. The intent is to allow fine-grained control over when preloading happens, but to make cascading automatic.

Prior to this commit, calling `includes` or `preload` on an
already-loaded relation or collection proxy would return a new relation
that ignored the loaded records.  This commit passes the loaded records
along to the new relation so that it can use them when preloading an
association.

Consider the following code:

  ```ruby
  posts.each do |post|
    # ...
  end

  # ...some intervening code...

  posts.includes(:comments).each do |post|
    post.comments.each do |comment|
      # ...
    end
  end
  ```

Before this commit, the code would execute 3 queries:

  ```
  SELECT "posts".* FROM "posts"
  SELECT "posts".* FROM "posts"
  SELECT "comments".* FROM "comments" WHERE "comments"."post_id" IN (?, ?, ?, ?, ?)  [["post_id", 1], ["post_id", 2], ["post_id", 3], ["post_id", 4], ["post_id", 5]]
  ```

After this commit, the code executes only 2 queries:

  ```
  SELECT "posts".* FROM "posts"
  SELECT "comments".* FROM "comments" WHERE "comments"."post_id" IN (?, ?, ?, ?, ?)  [["post_id", 1], ["post_id", 2], ["post_id", 3], ["post_id", 4], ["post_id", 5]]
  ```
This commit modifies the `Preloader` to track preloaded associations in
buckets, and, when preloading a downstream association, to preload that
association for all owner records from the same bucket.

The goal of this change is to allow individual `includes` and `preload`
calls to be placed alongside the code that depends on them.  This makes
it easier to add a preload when needed or remove a preload when it is no
longer needed.

Consider the following code:

  ```ruby
  def render_posts(posts)
    posts.includes(:comments).each do |post|
      post.comments.each do |comment|
        # ...
      end
    end
  end

  def render_authors(authors)
    authors.includes(:posts).each do |author|
      render_posts(author.posts)
    end
  end

  render_authors(Author.all)
  ```

Before this commit, with N authors each having at least 1 post, the code
would execute N + 2 queries.  For example, with 5 authors:

  ```
  SELECT "authors".* FROM "authors"
  SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]]
  SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 11]]
  SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 12]]
  SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 13]]
  SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 14]]
  SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 15]]
  ```

After this commit, the code executes only 3 queries:

  ```
  SELECT "authors".* FROM "authors"
  SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]]
  SELECT "comments".* FROM "comments" WHERE "comments"."post_id" IN (?, ?, ?, ?, ?)  [["post_id", 11], ["post_id", 12], ["post_id", 13], ["post_id", 14], ["post_id", 15]]
  ```
@ghiculescu
Copy link
Member

ghiculescu commented Aug 1, 2022

Very cool idea. I think the final SQL output is similar to what you'd get if using goldiloader, without any includes needed. I know goldiloader isn't part of Rails (and it's great to see this in Rails!) but it would be good to confirm if these changes are compatible with it.

@jonathanhefner
Copy link
Member Author

@ghiculescu Thanks! 😃 If my understanding is correct, I think #45413 is closer to goldiloader.

In this PR, I wanted to take a more conservative approach, to try to avoid situations where deeply nested code must defensively disable the feature for fear of loading too much data. Granted, that can still happen with this PR, but my hope is that restricting the feature to explicit (distributed) chains of includes / preload calls will greatly reduce the risk.

@gwincr11
Copy link
Contributor

Yeah, glad to see another approach around this. I took a shot at this style of api before I submitted my proposed approach. The feedback I received I would like to share.

Many people thought it was not a great ergonomic to have to opt in to the include since that is more frequently the preferred mode and they preferred an opt out.

The tricky thing I think is balancing the shortest code to write that minimizes mistakes we care about most. Large data loads could be a problem but which is our api optimized to fix? Should it more eagerly load to prevent the n+1 or should it eagerly optimize for large data flows and make fixing n+1 queries an opt in?

@gwincr11
Copy link
Contributor

I do wonder if perhaps a new api is needed that does not exist. The concern we are working with here is fairly well understood when you maybe loading a lot of records directly from the database, we use the batch api for this. What if the associations allowed for a batch like method to be called?

Something like

post.comments_in_batches.each

This would allow the user to limit the number of things brought back from the database in a familiar way and also allow dynamic includes to be the default. It also provides a familiar api for working with large data sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants