Skip to content

Allow for frequent sampling of selected threads #4225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

lachmatt
Copy link
Contributor

@lachmatt lachmatt commented May 26, 2025

Why & What

PoC for #4227

Modifies continuous profiler's code to allow for frequent sampling of selected threads.

Samples for selected threads are accumulated on the native side in a new buffer, similar to the buffer used for allocation samples.
Buffer is periodically read and reset from the managed side and exported by a thread shared with continuous profiler.

Selective sampling is implemented by adding another mode to the continuous profiler sampler.
Both sampling modes can be enabled independently.
Both of types of sampling can be enabled at the same time - the restriction would be for higher sampling interval (i.e. continuous sampling) being a multiple of smaller sampling interval (i.e. selective sampling). This simplifies implementation of sampling at different frequencies.

Functionality added in this PR would allow plugins to implement trace-centric sampling.
Remaining functionality for such feature could be implemented in plugins, by e.g. customizing the TracerProvider created by autoinstrumentation using existing hooks.

The open question is how to best expose the API for starting/stopping sampling given thread to plugins. For now, methods are called using reflection.

TODO in this PR:

  • Implement a test that verifies the behavior when 2 modes of sampling (continuous, selective) are enabled at the same time (verified manually).

Preferably in a separate PR:

For the sake of the simplicity of the implementation, for now, I took some shortcuts:

  • when both modes of sampling are enabled by plugin, continuous sampling interval is expected to be higher, and multiple of selective sampling interval. If that's not the case, sampling initialization fails.
  • when both modes of sampling are enabled, and export interval or timeout differs, higher values are used.

Higher timeout is probably preferrable as more permissive.
Higher export interval might be preferrable (in case of selected thread samples, this means more samples read in a single batch, possibly more samples exported in a single request, possibly more efficient encoding, possibly lower need for additional batching in exporter).
On the other hand, less frequent reads might result in some data loss (e.g. when native buffers are waiting to be read and exported). Let me know if you think this decisions should be changed.

Tests

Included in PR. Simple test that verifies that number of captured samples and time diff between consecutive samples is in an expected range.

Checklist

  • CHANGELOG.md is updated.
  • Documentation is updated.
  • New features are covered by tests.

@lachmatt lachmatt changed the title [WIP] allow for frequent sampling of selected threads Allow for frequent sampling of selected threads Jun 11, 2025
@lachmatt lachmatt marked this pull request as ready for review June 11, 2025 14:47
@lachmatt lachmatt requested a review from a team as a code owner June 11, 2025 14:47
@RassK
Copy link
Contributor

RassK commented Jun 19, 2025

Some thoughts:

  • Can you update the state, what TODOs are you going to address in this PR.
  • Can you add docs (overview, visual schema - how it tracks the thread)
  • Use low rate by default but let users choose one which works with long traces and the overhead.

@lachmatt
Copy link
Contributor Author

Some thoughts:

  • Can you update the state, what TODOs are you going to address in this PR.

Done.

  • Can you add docs (overview, visual schema - how it tracks the thread)

I briefly described it here. I will create a new draft PR with implementation in plugin and link to it here.

Comment on lines +84 to +86
static std::mutex selective_sampling_buffer_lock = std::mutex();
static std::vector<unsigned char>* selective_sampling_buffer = new std::vector<unsigned char>();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason for using a pointer and allocating selective_sampling_buffer with new?
Using a plain std::vector<unsigned char> instead of a pointer might simplify the code.
Using new and delete introduces extra complexity and manual memory management.
Would it make sense to simplify this by using a local object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for the current approach was consistency with the existing code.
If we were to use a different approach, I think it'd be beneficial to modify the code related to allocation buffers as well and I didn't want to do that in this PR.

Comment on lines +213 to +237
// TODO: deduplicate
int32_t SelectiveSamplingConsumeAndReplaceBuffer(int32_t len, unsigned char* buf)
{
if (len <= 0 || buf == nullptr)
{
trace::Logger::Warn("Unexpected 0/null buffer to SelectiveSamplingConsumeAndReplaceBuffer");
return 0;
}
std::vector<unsigned char>* to_use = nullptr;
{
std::lock_guard<std::mutex> guard(selective_sampling_buffer_lock);
to_use = selective_sampling_buffer;
selective_sampling_buffer = new std::vector<unsigned char>();
selective_sampling_buffer->reserve(kSamplesBufferDefaultSize);
}
if (to_use == nullptr)
{
return 0;
}
const size_t to_use_len = static_cast<int>(std::min(to_use->size(), static_cast<size_t>(len)));
memcpy(buf, to_use->data(), to_use_len);
delete to_use;
return static_cast<int32_t>(to_use_len);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the buffer is replaced by allocating a new vector and deleting the old one each time.
Would it make sense to simplify this by swapping the buffer with a local object, then using clear() and resize() on the original vector to avoid frequent heap allocations?
This could avoid the overhead of frequent heap allocations, which using new and delete currently introduces.
Was there a particular reason for the current approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #4225 (comment).
I'd expect the overhead to be acceptable, assuming managed-side code should consume the buffer rarely (e.g. every 0.5s)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code avoids copying inside the lock, but heap allocations are far more expensive than copying. I recommend copying the sample buffer to buf directly within the lock and avoiding allocations altogether—copying is fast and essentially lock free, while heap allocations are not.
Even better, if this is a single-consumer single-producer (SCSP) scenario, we could use a ring buffer for a completely lock-free solution. Please share your thoughts.

Copy link
Contributor Author

@lachmatt lachmatt Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code avoids copying inside the lock, but heap allocations are far more expensive than copying. I recommend copying the sample buffer to buf directly within the lock and avoiding allocations altogether—copying is fast and essentially lock free, while heap allocations are not.

Sounds like a good idea.
As mentioned above, I think considering this is expected to be called somewhat rarely by the managed thread that reads the buffer, I think this could be changed in the follow-up PR. Let me know if you disagree and consider this a blocker for merging this PR.

Even better, if this is a single-consumer single-producer (SCSP) scenario, we could use a ring buffer for a completely lock-free solution. Please share your thoughts.

I'd prefer to investigate such optimizations in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's link a issue to these comments if a separate follow up is expected 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my revised suggestion - please,
• Defer lock-free ring buffer optimization for now.
• Perform buffer copying under the lock.
• Do not allocate a new vector for every read; reuse the existing buffer. Allocate it once as part of initialization.
• Use std::unique_ptr<unsigned char[]> for raw buffer management instead of std::vector, which is unnecessary for this use case.

Comment on lines +777 to +804

localBuf.StartSelectedThreadsBatch();
for (auto thread_id : selected_sampling_threads_set)
{
prof->stats_.num_threads++;
thread_span_context spanContext = thread_span_context_map[thread_id];
auto found = prof->managed_tid_to_state_.find(thread_id);

if (found != prof->managed_tid_to_state_.end() && found->second != nullptr)
{
localBuf.StartSampleForSelectedThread(found->second, spanContext);
}
else
{
auto unknown = ThreadState();
localBuf.StartSampleForSelectedThread(&unknown, spanContext);
}

HRESULT snapshotHr =
info12->DoStackSnapshot(thread_id, &FrameCallback, COR_PRF_SNAPSHOT_DEFAULT, &dssp, nullptr, 0);
if (FAILED(snapshotHr))
{
trace::Logger::Debug("DoStackSnapshot failed. HRESULT=0x", std::setfill('0'), std::setw(8), std::hex,
snapshotHr);
}
localBuf.EndSample();
}
localBuf.EndSelectedThreadsBatch();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The increment of prof->stats_.num_threads inside the loop is redundant since the loop iterates exactly over selected_sampling_threads_set. Instead of incrementing the counter on each iteration, it would be more efficient and clearer to set prof->stats_.num_threads once, either before or after the loop, using the count (or size) of selected_sampling_threads_set. This avoids unnecessary increments

Comment on lines +899 to +913
unsigned int GetSleepTime(const ContinuousProfiler* const prof)
{
// Assumption is continuous profiling interval is bigger and multiple of selective sampling interval.
// If both are enabled, we need to wake every smaller interval.
if (prof->selectedThreadsSamplingInterval.has_value())
{
return prof->selectedThreadsSamplingInterval.value();
}
if (prof->threadSamplingInterval.has_value())
{
return prof->threadSamplingInterval.value();
}
// Shouldn't ever happen.
return 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment assumes the continuous interval is larger and a multiple of the selective one. Would it make sense to return the smaller of the two intervals instead, to avoid relying on this assumption?

Also, instead of returning 0 in the “Shouldn’t ever happen” case, maybe consider throwing an exception, adding an assert, or using std::unreachable() to catch unexpected states earlier. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumptions described here are validated on the managed side, when plugin method is called.
I find the current version more readable, but can change it as suggested if others agree.

For the case when 0 is returned - this situation won't be allowed by the managed side validation (invalid sampling intervals configuration will result in sampling not being started at all).
This function is called only in sampling thread, if 0 is ever returned warning is logged and thread exits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants