Allow for frequent sampling of selected threads #4225

lachmatt · 2025-05-26T13:04:25Z

Why & What

PoC for #4227

Modifies continuous profiler's code to allow for frequent sampling of selected threads.

Samples for selected threads are accumulated on the native side in a new buffer, similar to the buffer used for allocation samples.
Buffer is periodically read and reset from the managed side and exported by a thread shared with continuous profiler.

Selective sampling is implemented by adding another mode to the continuous profiler sampler.
Both sampling modes can be enabled independently.
Both of types of sampling can be enabled at the same time - the restriction would be for higher sampling interval (i.e. continuous sampling) being a multiple of smaller sampling interval (i.e. selective sampling). This simplifies implementation of sampling at different frequencies.

Functionality added in this PR would allow plugins to implement trace-centric sampling.
Remaining functionality for such feature could be implemented in plugins, by e.g. customizing the TracerProvider created by autoinstrumentation using existing hooks.

The open question is how to best expose the API for starting/stopping sampling given thread to plugins. For now, methods are called using reflection.

TODO in this PR:

Implement a test that verifies the behavior when 2 modes of sampling (continuous, selective) are enabled at the same time (verified manually).

Preferably in a separate PR:

Rename native code, now shared between continuous/selective sampling and look into reducing duplication on the native side (Cleanup and optimize native-side code #4303)
resolve symbols outside of runtime suspension (Move symbols resolution outside of runtime suspension #4302)

For the sake of the simplicity of the implementation, for now, I took some shortcuts:

when both modes of sampling are enabled by plugin, continuous sampling interval is expected to be higher, and multiple of selective sampling interval. If that's not the case, sampling initialization fails.
when both modes of sampling are enabled, and export interval or timeout differs, higher values are used.

Higher timeout is probably preferrable as more permissive.
Higher export interval might be preferrable (in case of selected thread samples, this means more samples read in a single batch, possibly more samples exported in a single request, possibly more efficient encoding, possibly lower need for additional batching in exporter).
On the other hand, less frequent reads might result in some data loss (e.g. when native buffers are waiting to be read and exported). Let me know if you think this decisions should be changed.

Tests

Included in PR. Simple test that verifies that number of captured samples and time diff between consecutive samples is in an expected range.

Checklist

CHANGELOG.md is updated.
Documentation is updated.
New features are covered by tests.

RassK · 2025-06-19T12:43:56Z

Some thoughts:

Can you update the state, what TODOs are you going to address in this PR.
Can you add docs (overview, visual schema - how it tracks the thread)
Use low rate by default but let users choose one which works with long traces and the overhead.

lachmatt · 2025-06-25T13:34:59Z

Some thoughts:

Can you update the state, what TODOs are you going to address in this PR.

Done.

Can you add docs (overview, visual schema - how it tracks the thread)

I briefly described it here. I will create a new draft PR with implementation in plugin and link to it here.

ysolomchenko · 2025-06-30T10:17:00Z

src/OpenTelemetry.AutoInstrumentation.Native/continuous_profiler.cpp

+static std::mutex                  selective_sampling_buffer_lock = std::mutex();
+static std::vector<unsigned char>* selective_sampling_buffer      = new std::vector<unsigned char>();
+


Is there a specific reason for using a pointer and allocating selective_sampling_buffer with new?
Using a plain std::vector<unsigned char> instead of a pointer might simplify the code.
Using new and delete introduces extra complexity and manual memory management.
Would it make sense to simplify this by using a local object?

The main reason for the current approach was consistency with the existing code.
If we were to use a different approach, I think it'd be beneficial to modify the code related to allocation buffers as well and I didn't want to do that in this PR.

ysolomchenko · 2025-06-30T10:17:22Z

src/OpenTelemetry.AutoInstrumentation.Native/continuous_profiler.cpp

+// TODO: deduplicate
+int32_t SelectiveSamplingConsumeAndReplaceBuffer(int32_t len, unsigned char* buf)
+{
+    if (len <= 0 || buf == nullptr)
+    {
+        trace::Logger::Warn("Unexpected 0/null buffer to SelectiveSamplingConsumeAndReplaceBuffer");
+        return 0;
+    }
+    std::vector<unsigned char>* to_use = nullptr;
+    {
+        std::lock_guard<std::mutex> guard(selective_sampling_buffer_lock);
+        to_use                    = selective_sampling_buffer;
+        selective_sampling_buffer = new std::vector<unsigned char>();
+        selective_sampling_buffer->reserve(kSamplesBufferDefaultSize);
+    }
+    if (to_use == nullptr)
+    {
+        return 0;
+    }
+    const size_t to_use_len = static_cast<int>(std::min(to_use->size(), static_cast<size_t>(len)));
+    memcpy(buf, to_use->data(), to_use_len);
+    delete to_use;
+    return static_cast<int32_t>(to_use_len);
+}
+


I noticed the buffer is replaced by allocating a new vector and deleting the old one each time.
Would it make sense to simplify this by swapping the buffer with a local object, then using clear() and resize() on the original vector to avoid frequent heap allocations?
This could avoid the overhead of frequent heap allocations, which using new and delete currently introduces.
Was there a particular reason for the current approach?

See #4225 (comment).
I'd expect the overhead to be acceptable, assuming managed-side code should consume the buffer rarely (e.g. every 0.5s)

The current code avoids copying inside the lock, but heap allocations are far more expensive than copying. I recommend copying the sample buffer to buf directly within the lock and avoiding allocations altogether—copying is fast and essentially lock free, while heap allocations are not.
Even better, if this is a single-consumer single-producer (SCSP) scenario, we could use a ring buffer for a completely lock-free solution. Please share your thoughts.

The current code avoids copying inside the lock, but heap allocations are far more expensive than copying. I recommend copying the sample buffer to buf directly within the lock and avoiding allocations altogether—copying is fast and essentially lock free, while heap allocations are not.

Sounds like a good idea.
As mentioned above, I think considering this is expected to be called somewhat rarely by the managed thread that reads the buffer, I think this could be changed in the follow-up PR. Let me know if you disagree and consider this a blocker for merging this PR.

Even better, if this is a single-consumer single-producer (SCSP) scenario, we could use a ring buffer for a completely lock-free solution. Please share your thoughts.

I'd prefer to investigate such optimizations in a separate PR.

Let's link a issue to these comments if a separate follow up is expected 👍

Here is my revised suggestion - please,
• Defer lock-free ring buffer optimization for now.
• Perform buffer copying under the lock.
• Do not allocate a new vector for every read; reuse the existing buffer. Allocate it once as part of initialization.
• Use std::unique_ptr<unsigned char[]> for raw buffer management instead of std::vector, which is unnecessary for this use case.

ysolomchenko · 2025-06-30T10:21:19Z

src/OpenTelemetry.AutoInstrumentation.Native/continuous_profiler.cpp

+
+    localBuf.StartSelectedThreadsBatch();
+    for (auto thread_id : selected_sampling_threads_set)
+    {
+        prof->stats_.num_threads++;
+        thread_span_context spanContext = thread_span_context_map[thread_id];
+        auto                found       = prof->managed_tid_to_state_.find(thread_id);
+
+        if (found != prof->managed_tid_to_state_.end() && found->second != nullptr)
+        {
+            localBuf.StartSampleForSelectedThread(found->second, spanContext);
+        }
+        else
+        {
+            auto unknown = ThreadState();
+            localBuf.StartSampleForSelectedThread(&unknown, spanContext);
+        }
+
+        HRESULT snapshotHr =
+            info12->DoStackSnapshot(thread_id, &FrameCallback, COR_PRF_SNAPSHOT_DEFAULT, &dssp, nullptr, 0);
+        if (FAILED(snapshotHr))
+        {
+            trace::Logger::Debug("DoStackSnapshot failed. HRESULT=0x", std::setfill('0'), std::setw(8), std::hex,
+                                 snapshotHr);
+        }
+        localBuf.EndSample();
+    }
+    localBuf.EndSelectedThreadsBatch();


The increment of prof->stats_.num_threads inside the loop is redundant since the loop iterates exactly over selected_sampling_threads_set. Instead of incrementing the counter on each iteration, it would be more efficient and clearer to set prof->stats_.num_threads once, either before or after the loop, using the count (or size) of selected_sampling_threads_set. This avoids unnecessary increments

ysolomchenko · 2025-06-30T10:32:25Z

src/OpenTelemetry.AutoInstrumentation.Native/continuous_profiler.cpp

+unsigned int GetSleepTime(const ContinuousProfiler* const prof)
+{
+    // Assumption is continuous profiling interval is bigger and multiple of selective sampling interval.
+    // If both are enabled, we need to wake every smaller interval.
+    if (prof->selectedThreadsSamplingInterval.has_value())
+    {
+        return prof->selectedThreadsSamplingInterval.value();
+    }
+    if (prof->threadSamplingInterval.has_value())
+    {
+        return prof->threadSamplingInterval.value();
+    }
+    // Shouldn't ever happen.
+    return 0;
+}


The comment assumes the continuous interval is larger and a multiple of the selective one. Would it make sense to return the smaller of the two intervals instead, to avoid relying on this assumption?

Also, instead of returning 0 in the “Shouldn’t ever happen” case, maybe consider throwing an exception, adding an assert, or using std::unreachable() to catch unexpected states earlier. What do you think?

The assumptions described here are validated on the managed side, when plugin method is called.
I find the current version more readable, but can change it as suggested if others agree.

For the case when 0 is returned - this situation won't be allowed by the managed side validation (invalid sampling intervals configuration will result in sampling not being started at all).
This function is called only in sampling thread, if 0 is ever returned warning is logged and thread exits.

lachmatt added 7 commits May 26, 2025 14:55

allow for frequent sampling of selected threads

d4f12ff

remove BOM in added files

1e8d838

initialize all sampling types in single call, fix frame encoding

b8a6672

native format fix

5c8a03c

managed side code cleanup

4bd4684

fix managed side sampling config validation

6087390

Merge branch 'main' into sample-selected-threads

f120fe9

lachmatt changed the title ~~[WIP] allow for frequent sampling of selected threads~~ Allow for frequent sampling of selected threads Jun 11, 2025

lachmatt marked this pull request as ready for review June 11, 2025 14:47

lachmatt requested a review from a team as a code owner June 11, 2025 14:47

lachmatt added 4 commits June 13, 2025 13:48

fail sampling init if plugin provides inconsistent sampling frequencies

d52254f

check fix

2c55a5a

sampling frequency validation cleanup

2eb6286

native side null checks

4271e43

ysolomchenko reviewed Jun 30, 2025

View reviewed changes

Merge branch 'main' into sample-selected-threads

ff7f597

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow for frequent sampling of selected threads #4225

Allow for frequent sampling of selected threads #4225

lachmatt commented May 26, 2025 •

edited by RassK

Loading

Uh oh!

RassK commented Jun 19, 2025

Uh oh!

lachmatt commented Jun 25, 2025

Uh oh!

ysolomchenko Jun 30, 2025

Uh oh!

lachmatt Jul 1, 2025

Uh oh!

ysolomchenko Jun 30, 2025

Uh oh!

lachmatt Jul 1, 2025

Uh oh!

eftiquar Jul 5, 2025

Uh oh!

lachmatt Jul 9, 2025 •

edited

Loading

Uh oh!

RassK Jul 9, 2025

Uh oh!

lachmatt Jul 10, 2025

Uh oh!

eftiquar Jul 12, 2025

Uh oh!

ysolomchenko Jun 30, 2025

Uh oh!

ysolomchenko Jun 30, 2025

Uh oh!

lachmatt Jul 1, 2025

Uh oh!

Uh oh!

		static std::mutex selective_sampling_buffer_lock = std::mutex();
		static std::vector<unsigned char>* selective_sampling_buffer = new std::vector<unsigned char>();

Allow for frequent sampling of selected threads #4225

Are you sure you want to change the base?

Allow for frequent sampling of selected threads #4225

Conversation

lachmatt commented May 26, 2025 • edited by RassK Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why & What

Tests

Checklist

Uh oh!

RassK commented Jun 19, 2025

Uh oh!

lachmatt commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lachmatt Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lachmatt commented May 26, 2025 •

edited by RassK

Loading

lachmatt Jul 9, 2025 •

edited

Loading