-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Relation#load_async to schedule the query on the background thread pool #41372
Conversation
44d6dab
to
e733187
Compare
085482a
to
a80388d
Compare
[].freeze | ||
else | ||
relation = join_dependency.apply_column_aliases(relation) | ||
@_join_dependency = join_dependency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very happy with having to store this in an instance variable for later use, but apply_join_dependency
can sometimes trigger a query on its own, so if we were to call it twice, we'd generate an extra query in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test for this case? Or can you elaborate a bit more what you mean by it "can sometimes trigger a query on its own"? I don't follow what scenarios this might happen in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is exactly one test case: https://buildkite.com/rails/rails/builds/74758#40f90c22-7273-4458-a0db-71ec3fbf3638/1055-1064
The extra query is triggered here:
klass.connection.distinct_relation_for_primary_key(relation) |
Which mean that in such case load_async
would actually trigger a query in the main thread.
There might be a way to refactor this out, but I chose to prioritize a smaller more comprehensible PR.
# Post.where(published: true).defer # => #<ActiveRecord::Relation> | ||
def defer | ||
unless loaded? || scheduled? | ||
return self if connection.current_transaction.joinable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning here is that some shared code might attempt to use .defer
, and sometime might be called from inside a transaction, and sometimes outside.
So have have to make an API decision of either
- Querying immediately
- Just do nothing
I believe the later make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new name (which I agree with), I think it's probably better that load_async
will still load
even if it can't do it async.
In the async-possible case, the resulting relation here will behave like it was already loaded (by blocking if necessary to make that true)... I can't think of specific examples, but my gut says an un-loaded relation will behave less similarly in some edge cases.
That said, I can also see use cases where a knowing caller might want to say "start an async query for this if you can, because I think it's going to be used later... but if you'd need to block, then we'll just wait and see whether it's actually needed". (Association preload and "all Relation-shaped instance variables that get copied from a controller into a view context" being variously-reasonable examples there.) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can find arguments for both behaviors too.
I think the most common case this will happen, will be in tests using transactional fixtures. So I'm thinking we should chose the solution that will make this fallbacks behave closer to production, in term of assert_queries
and similar test assertions.
Using this reasoning, I'd say you are right and we should fallback by either straight calling load
, or maybe by calling exec_main_query(async: false)
, as to simulate the query being performed, but the records not being instantiated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I pushed 2042a2914535551f165378e11da19fef98ee7fbb, which does call exec_main_query(async: false)
if we're inside a transaction.
The pro is that the behavior is much closer to a query that would be properly scheduled asynchronously, as the records instantiation is still delayed.
The con is that it open the door for some weird behavior, as the main query is executed inside the transaction, but if you have preloads or something like that, and you iterate on the relation outside the transaction, they might be performed outside.
So maybe falling back to load
might be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not possible to do an async query then we should go through the regular load path rather than a new future result path that isn't actually future. This makes the feature easier to debug and harder to make accidental mistakes.
I made this mistake in my demo app by calling to_a
on the load_async
(ie Dog.where(name: "fido #{i}").load_async.to_a
). From a user experience perspective it was pretty confusing to debug why it said it was async but it was not actually async.
Additionally I made another mistake that revealed a bug with pluck
and load_async
. If you do Dog.where(name: "fido #{i}").load_async.pluck(:id)
it results in duplicate queries. I now know this is not the correct way to use the feature but it's an easy mistake to make that results in incorrect behavior.
Dog Load (0.3ms) SELECT `dogs`.* FROM `dogs` WHERE `dogs`.`name` = 'fido 1'
Dog Pluck (0.2ms) SELECT `dogs`.`id` FROM `dogs` WHERE `dogs`.`name` = 'fido 1'
In the non-async code path (ie Dog.where(name: "fido #{i}")load.pluck(:login)
) Rails knows what to do with an already loaded relation and we don't get duplicate queries (see https://github.com/rails/rails/blob/main/activerecord/lib/active_record/relation/calculations.rb#L184-L186)
Dog Load (0.3ms) SELECT `dogs`.* FROM `dogs` WHERE `dogs`.`name` = 'fido 1'
If we go the previous path for relations that can't be loaded async, we'll have less edge cases to worry about. I think the mistake I made with pluck and to_a is an easy one to make and so in that case we should make sure the app behaves as it previously did rather than having potentially surprising non-async async behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pluck
is a nice find:
def pluck(*column_names)
if loaded? && all_attributes?(column_names)
return records.pluck(*column_names)
end
Since load_async
doesn't set @loaded = true
but @scheduled = true
, pluck doesn't know it could use the future.
I either need to go over all Relation
methods to make sure they handle both loaded?
and scheduled?
states, or we could merge both states together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this mistake in my demo app by calling to_a on the load_async [...] it was pretty confusing to debug why it said it was async but it was not actually async.
I think this is my fault, that part of the design was explained in the initial proof of concept, but I didn't mention it in the first actual PR.
Then wehn you do use the relation, either the query is completed and it just uses the results, or it is being executed and it waits for it to complete, or the thread pool was busy and it simply execute it in the foreground.
That third case is important, because of the way Ruby thread scheduling works, unless you do other unrelated IOs between the time you did call load_async
and the time you actually need the result, it's entirely possible that the query have been sitting idle in the work queue without even having been sent to the database.
We could of course wait for the thread pool to execute it, but that would be a waste because at this stage we know that the main thread will be blocked on that result, and that it's free to execute the query itself.
So yes, it's very easy to misuse load_async
, and we should be on the lookout for making easier to understand what's happening, but I do believe that behavior is the current one.
Now if the database clients had true native async APIs, we could send the query and return a response promise/future, but that's unlikely to happen any time soon.
Let me know if that behavior is still unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: pluck
Now load_async
marks the Relation
as loaded?
, which fixes pluck
and similar methods that behave differently based on wether the Relation was loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else we could do it to call Thread.pass
right after we enqueue the query, this would increase the chance that the query get performed asynchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So interestingly enough, calling Thread.pass
caused the test of that fallback behavior to fail, which proves that it's actually effective at producing the behavior you expected in this simple scenario.
a80388d
to
901c4e6
Compare
@@ -33,7 +41,7 @@ def complete(asynchronous_queries_tracker) | |||
attr_reader :current_session | |||
|
|||
def initialize | |||
@current_session = nil | |||
@current_session = NoSession |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some thought, I think I make sense to have defer
work by default. It's good to cancel queries when we know we exited the context in which they are useful, but if we're in rails console
or just a random rails runner
script, there's no reason to not allow it.
[].freeze | ||
else | ||
relation = join_dependency.apply_column_aliases(relation) | ||
@_join_dependency = join_dependency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test for this case? Or can you elaborate a bit more what you mean by it "can sometimes trigger a query on its own"? I don't follow what scenarios this might happen in.
022015d
to
56ab574
Compare
I've put more thoughts in the In #40037 I made it held by the But now I don't think it's really worth the tradeoff. First the vast majority of Rails applications out there have a single database, so for them, a per pool executor or a global one don't matter. Then for apps using horizontal sharding (like Shopify), you can have a large number of pools to similar shards, but you only ever query a single one during a request cycle. If you have say 30 shards, you'd end up with 30 executor with X threads each, most of them sitting idle the vast majority of the time, and I think we should rather keep the number of threads low, as I don't quite trust the scheduler to properly handle hundreds of idle threads. |
9f23296
to
2042a29
Compare
While testing out rails#41372 in a demo application I noticed that the `ASYNC` portion wasn't getting added to the log. I tracked it down to this method not passing `async: true` to `execute_query` so the `sql` subscriber was always getting `async: false` for the payload. Before: ``` (0.3ms) SELECT * FROM dogs where name = 'fido 1' (0.1ms) SELECT * FROM dogs where name = 'fido 2' (0.1ms) SELECT * FROM dogs where name = 'fido 3' ``` After: ``` ASYNC (0.3ms) SELECT * FROM dogs where name = 'fido 1' ASYNC (0.2ms) SELECT * FROM dogs where name = 'fido 2' ASYNC (0.2ms) SELECT * FROM dogs where name = 'fido 3' ``` I also verified that this fixes `Dog Load` to `ASYNC Dog Load` in the new PR. It will be easier to add a test for this functionality when we merge rails#41372.
753aefe
to
d0fc4a8
Compare
… thread pool This is built on previous async support added to Adapter#select_all.
d0fc4a8
to
2a90104
Compare
@eileencodes code squashed, rebased, and about to be green on CI! |
Thanks!
|
rails#41372 Added `ActiveRecord::AsynchronousQueriesTracker::NullSession` which replaced a use of `nil` in `AsynchronousQueriesTracker`. This commit changes the `finalize_session` method to match that change from `nil` and properly handle cases where it is called with a `NullSession` present.
records.empty? | ||
else | ||
!exists? | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious - why we dont want to use guard clause in that case?
# | ||
# Post.where(published: true).load_async # => #<ActiveRecord::Relation> | ||
def load_async | ||
unless loaded? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of guard clause will simplify most of the conditional statements in the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion but we prefer to not use guard clauses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a place where the core team explains why they don't want to use guard clauses? I did a quick search, but wasn't able to find anything and would love to hear the reasons.
Followup: rails#41372 Something we knew we'd need when we implemented `Relation#load_async` but that we chose to delay to have time to think about it. Right now, Active Record async support is limited to "collection results", but among the not so fast queries that would benefit from asynchronicity you often find aggregates (e.g. `count`, `sum`, etc) as well as hand crafted `find_by_sql` queries. `load_async` was easy to add as an API, because `Relation` acts as a collection so it was trivial to simply block whenever it was iterated while retaining total API compatibility. For aggregates and `find_by_sql`, we have no other choice but to return something different in async mode, with its own API. This proof of concept showcase what this API looks like for `Relation#count`: ```ruby Post.where(published: true).count # => 2 promise = Post.where(published: true).async.count # => #<ActiveRecord::Promise status=pending> promise.value # => 2 ``` This API should be applicable to all aggregate methods, as well as all methods returning a single record, or anything other than a `Relation`.
Followup: rails#41372 Something we knew we'd need when we implemented `Relation#load_async` but that we chose to delay to have time to think about it. Right now, Active Record async support is limited to "collection results", but among the not so fast queries that would benefit from asynchronicity you often find aggregates (e.g. `count`, `sum`, etc) as well as hand crafted `find_by_sql` queries. `load_async` was easy to add as an API, because `Relation` acts as a collection so it was trivial to simply block whenever it was iterated while retaining total API compatibility. For aggregates and `find_by_sql`, we have no other choice but to return something different in async mode, with its own API. This proof of concept showcase what this API looks like for `Relation#count`: ```ruby Post.where(published: true).count # => 2 promise = Post.where(published: true).count(async: true) # => #<ActiveRecord::Promise status=pending> promise.value # => 2 ``` This API should be applicable to all aggregate methods, as well as all methods returning a single record, or anything other than a `Relation`.
Related to #load_async queries added in rails#40037 rails#41372.
This is a followup to #40037 (see that PR for complete context)
Relation#load_async
schedules the query on the background thread pool, and returnsself
.