-
Notifications
You must be signed in to change notification settings - Fork 203
Open
Labels
featureNew feature or requestNew feature or requestinstrumentationkeepEnsures stale-bot keeps this issue/PR openEnsures stale-bot keeps this issue/PR open
Description
ActiveRecord
7.1 introduced async query methods1, which leverages concurrency primitives via Promise
API to defer executing queries.
We would like to enrich ActiveRecord
related spans to know whether or not they were executed from the async context.
Additional Notes
The sql.active_record
notification payload includes a async
attribute:
https://github.com/rails/rails/blob/616d3a7675c18ae2d38d3116de47e2e7308cfbd8/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb#L1135
We may be able to enrich the ActiveRecord
span and amend it as a Shared DB attribute for the driver itself however we do not currently use notifications in this instrumentation.
Worst case scenario, we monkey patch existing async
methods, concurrency primitives.
Footnotes
kaylareopelle
Metadata
Metadata
Assignees
Labels
featureNew feature or requestNew feature or requestinstrumentationkeepEnsures stale-bot keeps this issue/PR openEnsures stale-bot keeps this issue/PR open
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
bensheldon commentedon Oct 31, 2024
Thank you for opening this! 🙇🏻
It would also be great if we could include the
lock_wait
attribute too, which represents how much time is spent in the foreground waiting for the query to complete (in a true async query, it would be 0.0ms):https://github.com/rails/rails/blob/df9912116b29423b5bde3771b9829447a07d3f6c/activerecord/lib/active_record/log_subscriber.rb#L23-L24
github-actions commentedon Dec 1, 2024
👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the
keep
label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.zvkemp commentedon Dec 13, 2024
I can work on this.
kaylareopelle commentedon Jan 10, 2025
Thanks, @zvkemp! I've assigned this issue to you.
zvkemp commentedon Jan 15, 2025
@arielvalentin I have a few clarifying questions on this.
The current ActiveRecord::Instrumentation patches _query_by_sql, which happens before any decisions about asynchrony are made (the caller may request the query should be async, but can't require it). If
async: true
is passed, and the other conditions are met, the query asks to be added as a scheduled task to a thread pool (either global or per-connection, depending on ActiveRecord's config). The thread pool may not be running or the queue may be full, in which case the fallback action is executed (running the task on the calling thread, so not async). All of the above happens within the existingUser query
span (let's call this SpanA). If the task scheduling succeeded, SpanA measures the time it took to build and enqueue the query; if not, it measures the entire query execution.The second wrinkle is in retrieving the result from the promise — even if it is successfully scheduled, if the task is still not enqueued by the time the original thread requests the result, the query is executed synchronously. At this point Span A is finished (so it's not permitted to change its name or other attributes), so the execution should be wrapped in a new span. If the task has started, but hasn't returned a result yet, the caller thread waits until it can acquire the tasks's mutex (indicating the thread pool is done with it), and records the
lock_wait
time.So onto the question of how to enrich SpanA — in the case that it was actually async, there should probably be:
execute_or_wait
tracking thelock_wait
time (SpanC). This would have and end_timestamp of no less then SpanB's end_timestamp, but could start before SpanB (if the thread pool is very busy, and hasn't started on this task yet — should SpanC be the parent of SpanB in this case?)The parentage of SpanB is a somewhat tricky problem — its fairly straightforward to wrap the task using SpanA as a parent, though because of asynchrony, SpanB may be started after SpanA is finished (is this even valid? The spec doesn't seem to prohibit it), like this:
In the case where the task was scheduled but not picked up yet:
^ In this case, SpanB is still a child of SpanA, though in reality it would be more correct to make it a child of SpanC. I'm not sure how that re-parenting would work though, unless we delay deciding on a span context until the query is actually fired.
arielvalentin commentedon Jan 15, 2025
@zvkemp these are all great points and thank you for the illustrations because they highlight the complexity around how our currently implementation does not work well with concurrent or parallel workflows. open-telemetry/opentelemetry-ruby#1766
Knowing that AR et.al. internals are using Concurrent Ruby is what prompted me to add its instrumentation as a dependency for Rails.
The Concurrent Ruby instrumentation attempts to do that but uses OTel objects instead of using the propagator API.
I do not have an answer here for this except that it may mean we need to model this differently, i.e. potentially having multiple spans, and/or implement this a way that uses context propagation differently between threads and fibers, which then requires the backend to stich the graph together properly, but we would not be able to enrich parent spans like this issue would like.
zvkemp commentedon Jan 16, 2025
I'll further add that
lock_wait
doesn't tell the whole story of why you might have waited; heavy load on the global thread pool can cause this as well, but time spent in the queue isn't tracked.So here's a clearer picture of the 5 scenarios that can happen using
load_async
:lock_wait == 0
, async):lock_wait > 0
, async):^ this is the only scenario where ActiveRecord would have reported a non-zero/non-nil lock_wait time - the
query was successfully scheduled, and it started executing (but didn't finish) before its result
was requested.
lock_wait == nil
, not async):(to_a was called before a worker became available, so the executor was pre-empted). This may also happen if
to_a
is called immediately, before the executor has any chance to spawn a worker or assign it work (e.g.User.all.load_async.to_a
). The call toThread#pass
is supposed to prevent this (but I'm not sure why that's even necessary, as you'll be waiting the same amount of time anyway).lock_wait == 0
)(executor executes the task inline instead of enqueuing it).
async
argument is ignored.So
lock_wait
isn't even reported in 4 out of 5 scenarios, and in the one where it is, you're not really waiting to acquire the lock (I mean, you are, but only to determine when the executor has finished). Thelock_wait
time is equal to the time between callingto_a
and whenever the query finishes. I think the three spans shown in all of these examples are worth including;lock_wait
on its own— probably not.As to the question of how to tell whether the span was async — we can certainly annotate that it was requested to be asynchronous (meaning that under some conditions it probably would be), but short of comparing thread object_ids, it's not really possible to reliably tell based on the callsite or arguments whether the query was executed in the request thread or in the executor's thread pool. I think there are some improvements to be made here, the main one being that currently, async loads (scenarios 1 or 2) have a span that only wraps the scheduler, not the query itself, so the traces are misleading at best.