[ActiveRecord] Add Attributes for Async Queries #1219

New issue

Open

#1351

Open

[ActiveRecord] Add Attributes for Async Queries#1219

#1351

Assignees

Labels

featureinstrumentationkeep

arielvalentin

opened

on Oct 31, 2024

Contributor

ActiveRecord 7.1 introduced async query methods¹, which leverages concurrency primitives via Promise API to defer executing queries.

We would like to enrich ActiveRecord related spans to know whether or not they were executed from the async context.

Additional Notes

The sql.active_record notification payload includes a async attribute:
https://github.com/rails/rails/blob/616d3a7675c18ae2d38d3116de47e2e7308cfbd8/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb#L1135

We may be able to enrich the ActiveRecord span and amend it as a Shared DB attribute for the driver itself however we do not currently use notifications in this instrumentation.

Worst case scenario, we monkey patch existing async methods, concurrency primitives.

https://github.com/rails/rails/pull/44446 ↩

added

Contributor

Thank you for opening this! 🙇🏻

It would also be great if we could include the lock_wait attribute too, which represents how much time is spent in the foreground waiting for the query to complete (in a true async query, it would be 0.0ms):

https://github.com/rails/rails/blob/df9912116b29423b5bde3771b9829447a07d3f6c/activerecord/lib/active_record/log_subscriber.rb#L23-L24

github-actionsbot

Contributor

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

added

added

and removed

Contributor

I can work on this.

assigned

Contributor

Thanks, @zvkemp! I've assigned this issue to you.

zvkemp

Contributor

We would like to enrich ActiveRecord related spans to know whether or not they were executed from the async context

@arielvalentin I have a few clarifying questions on this.

The current ActiveRecord::Instrumentation patches _query_by_sql, which happens before any decisions about asynchrony are made (the caller may request the query should be async, but can't require it). If async: true is passed, and the other conditions are met, the query asks to be added as a scheduled task to a thread pool (either global or per-connection, depending on ActiveRecord's config). The thread pool may not be running or the queue may be full, in which case the fallback action is executed (running the task on the calling thread, so not async). All of the above happens within the existing User query span (let's call this SpanA). If the task scheduling succeeded, SpanA measures the time it took to build and enqueue the query; if not, it measures the entire query execution.

The second wrinkle is in retrieving the result from the promise — even if it is successfully scheduled, if the task is still not enqueued by the time the original thread requests the result, the query is executed synchronously. At this point Span A is finished (so it's not permitted to change its name or other attributes), so the execution should be wrapped in a new span. If the task has started, but hasn't returned a result yet, the caller thread waits until it can acquire the tasks's mutex (indicating the thread pool is done with it), and records the lock_wait time.

So onto the question of how to enrich SpanA — in the case that it was actually async, there should probably be:

a span indicating the task was scheduled (SpanA, with a different name or attributes)
a new span that wraps the async task (let's call this SpanB)
a new span wrapping execute_or_wait tracking the lock_wait time (SpanC). This would have and end_timestamp of no less then SpanB's end_timestamp, but could start before SpanB (if the thread pool is very busy, and hasn't started on this task yet — should SpanC be the parent of SpanB in this case?)

The parentage of SpanB is a somewhat tricky problem — its fairly straightforward to wrap the task using SpanA as a parent, though because of asynchrony, SpanB may be started after SpanA is finished (is this even valid? The spec doesn't seem to prohibit it), like this:

|----SpanA------|
                        |-SpanB------------------|
                          |-----SpanC (lock_wait)--|

In the case where the task was scheduled but not picked up yet:

|----SpanA------|
                      |-----SpanC (lock_wait)-----|
                        |-SpanB------------------|

^ In this case, SpanB is still a child of SpanA, though in reality it would be more correct to make it a child of SpanC. I'm not sure how that re-parenting would work though, unless we delay deciding on a span context until the query is actually fired.

arielvalentin

ContributorAuthor

@zvkemp these are all great points and thank you for the illustrations because they highlight the complexity around how our currently implementation does not work well with concurrent or parallel workflows. open-telemetry/opentelemetry-ruby#1766

Knowing that AR et.al. internals are using Concurrent Ruby is what prompted me to add its instrumentation as a dependency for Rails.
The Concurrent Ruby instrumentation attempts to do that but uses OTel objects instead of using the propagator API.

I do not have an answer here for this except that it may mean we need to model this differently, i.e. potentially having multiple spans, and/or implement this a way that uses context propagation differently between threads and fibers, which then requires the backend to stich the graph together properly, but we would not be able to enrich parent spans like this issue would like.

zvkemp

Contributor

I'll further add that lock_wait doesn't tell the whole story of why you might have waited; heavy load on the global thread pool can cause this as well, but time spent in the queue isn't tracked.

So here's a clearer picture of the 5 scenarios that can happen using load_async:

The happy path– short query time and no waiting (lock_wait == 0, async):

|--schedule-query--|
                     |-query-|
                                # call .to_a
                                |-execute-or-wait-| (returns quickly)

long query time (lock_wait > 0, async):

|--schedule-query--|
                     |--query-fires-here---------------|
                          # call .to_a
                          |--execute-or-wait------------|

^ this is the only scenario where ActiveRecord would have reported a non-zero/non-nil lock_wait time - the
query was successfully scheduled, and it started executing (but didn't finish) before its result
was requested.

busy executor (all workers busy, queue has slots available, lock_wait == nil, not async):

|--schedule-query--|
                       # call .to_a
                       |--execute_or_wait---------|
                        |--query-fires-here------|

(to_a was called before a worker became available, so the executor was pre-empted). This may also happen if to_a is called immediately, before the executor has any chance to spawn a worker or assign it work (e.g. User.all.load_async.to_a). The call to Thread#pass is supposed to prevent this (but I'm not sure why that's even necessary, as you'll be waiting the same amount of time anyway).

blocked executor: (queue is full, not async, lock_wait == 0)

|--schedule-query---------------|
       |--query-fires-here--|
                                   # call.to_a
                                   |-execute-or-wait-| (returns quickly)

(executor executes the task inline instead of enqueuing it).

There's an additional scenario where the executor is not configured, in which case the async argument is ignored.

So lock_wait isn't even reported in 4 out of 5 scenarios, and in the one where it is, you're not really waiting to acquire the lock (I mean, you are, but only to determine when the executor has finished). The lock_wait time is equal to the time between calling to_a and whenever the query finishes. I think the three spans shown in all of these examples are worth including; lock_wait on its own— probably not.

As to the question of how to tell whether the span was async — we can certainly annotate that it was requested to be asynchronous (meaning that under some conditions it probably would be), but short of comparing thread object_ids, it's not really possible to reliably tell based on the callsite or arguments whether the query was executed in the request thread or in the executor's thread pool. I think there are some improvements to be made here, the main one being that currently, async loads (scenarios 1 or 2) have a span that only wraps the scheduler, not the query itself, so the traces are misleading at best.

zvkemp

linked a pull request that will close this issue

on Jan 17, 2025

feat: track ActiveRecord async queries #1351

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

zvkemp

Labels

featureinstrumentationkeep

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

feat: track ActiveRecord async queriesopen-telemetry/opentelemetry-ruby-contrib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ActiveRecord] Add Attributes for Async Queries #1219

Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

[ActiveRecord] Add Attributes for Async Queries #1219

Description

Additional Notes

Footnotes

Activity

bensheldon commented on Oct 31, 2024

github-actions commented on Dec 1, 2024

zvkemp commented on Dec 13, 2024

kaylareopelle commented on Jan 10, 2025

zvkemp commented on Jan 15, 2025

arielvalentin commented on Jan 15, 2025

zvkemp commented on Jan 16, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions