Skip to content

Support asynchronous stack walk on Linux/MacOs #13171

Closed
@k15tfu

Description

@k15tfu

Originally posted by @davmason in https://github.com/dotnet/coreclr/issues/25676#issuecomment-512555568

@k15tfu In some other issue I mentioned that the only supported DoStackSnapshot scenarios are either after calling ICorProfilerInfo10::SuspendRuntime or doing a synchronous stack walk on the same thread.

The docs that you are referring to were written for Windows a long time ago, and we don't have the support for asynchronous stack walking yet on Linux. It is something I would like to get working in a future release, but for now I don't think there is a solution that allows this to work as you are attempting on Linux.

[...]

If you can tell us why you are trying to get a stack from a signal handler, we might be able to provide a work around that avoids the problem.

Originally posted by @k15tfu in https://github.com/dotnet/coreclr/issues/25676#issuecomment-513143826

@davmason

[...]

or doing a synchronous stack walk on the same thread

Did you mean doing it not only being in the same thread, but also from one of the profiler callbacks? Just to be on the same page =)

and we don't have the support for asynchronous stack walking yet on Linux.

Does it mean it may not work, or it definitely doesn't work on Linux? Because I don't see other critical issues at the moment (except one more issue I'll report a bit later). Do you know other issues I will face? =)

If you can tell us why you are trying to get a stack from a signal handler, we might be able to provide a work around that avoids the problem.

I do a sampling profiler. On Windows I use SuspendThread() and GetThreadContext() and then unwind the stack. But on Linux I cannot neither suspend it, nor get another thread's context, thus usually such profilers are implemented in Linux using signals. Honestly, I would not want to use ICorProfilerInfo10::SuspendRuntime() because I'm sure it will affect the whole app much more significantly than signal handlers.

[...]

Originally posted by @janvorli in https://github.com/dotnet/coreclr/issues/25676#issuecomment-513160482

Btw, it seems to me that on Linux, you can simulate the SuspendThread / GetThreadContext using a signal handler in an async signal safe manner. Basically the way I've mentioned in a response to another issue that you've created.

  • From one thread, send a signal to all other threads
  • In the signal handler for that signal, save the signal context in a way that will allow the thread initiating the "suspension" read it later (e.g have a lock free global stack of references to the contexts using e.g. a singly linked list as an underlying structure) and then synchronize with the initiating thread using a pipe or some other way that is async signal safe (if using pipe, send a byte into the pipe and then read from it. The initiating thread writes into the pipe when it is done walking the stacks)
  • When the initiating thread completes reading from pipes of all the threads, it knows all threads are waiting. So it can read the contexts they have captured and it is safe to start walking their stacks.

Originally posted by @davmason in https://github.com/dotnet/coreclr/issues/25676#issuecomment-513405905

or doing a synchronous stack walk on the same thread

Did you mean doing it not only being in the same thread, but also from one of the profiler callbacks? Just to be on the same page =)

It should be done within the profiler callback to be safe, it would generally work anywhere but you might run in to the same stub issue if not within a profiler callback.

and we don't have the support for asynchronous stack walking yet on Linux.

Does it mean it may not work, or it definitely doesn't work on Linux? Because I don't see other critical issues at the moment (except one more issue I'll report a bit later). Do you know other issues I will face? =)

It's completely untested on our end. This might be the only issue or it might not be. I don't know of any issues already, but I would not be surprised to find out there are more.

I do a sampling profiler. On Windows I use SuspendThread() and GetThreadContext() and then unwind the stack. But on Linux I cannot neither suspend it, nor get another thread's context, thus usually such profilers are implemented in Linux using signals. Honestly, I would not want to use ICorProfilerInfo10::SuspendRuntime() because I'm sure it will affect the whole app much more significantly than signal handlers.

ICorProfilerInfo10::SuspendRuntime was intended to be used by sampling profilers. It should not be much overhead at all, we use a very similar approach internally when doing sampling for EventPipe (see https://github.com/dotnet/coreclr/blob/master/src/vm/sampleprofiler.cpp#L169-L208). ICorProfilerInfo10 is basically just a shim that calls ThreadSuspend::SuspendEE. If you have evidence that this approach is too much overhead I would love to hear it.

Originally posted by @k15tfu in https://github.com/dotnet/coreclr/issues/25676#issuecomment-514527091

@janvorli Yes, we surely can emulate it like this. I remember that we have already discussed it in another issue, thanks again for that =) But it doesn't help me to avoid this problem, because the root case is that I use the context taken from signal handler (I simply cannot get the context any other way) when the thread can be stopped at an arbitrary moment.

[...]

@davmason

In some other issue I mentioned that the only supported DoStackSnapshot scenarios are either after calling ICorProfilerInfo10::SuspendRuntime or doing a synchronous stack walk on the same thread.

I'm not sure what "the only supported scenarios" means. The others are not supported yet (but you are ready to work on it), or they are prohibited and will never be supported on Linux?

It's completely untested on our end. This might be the only issue or it might not be. I don't know of any issues already, but I would not be surprised to find out there are more.

Okay, I get you, I'll let you know if I find anything else.

ICorProfilerInfo10 is basically just a shim that calls ThreadSuspend::SuspendEE. If you have evidence that this approach is too much overhead I would love to hear it.

We do not stop all managed threads at the same time. To the contrary, we sequentially stop each thread, take a snapshot, resume it, and then repeat it for the remaining threads. So from this point of view, using of OS level feature signals to temporarily "interrupt" the thread (like the system does it all the time during scheduling) and ask it to do some task for us seems to be more affective than stopping the whole CLR runtime.

Originally posted by @davmason in https://github.com/dotnet/coreclr/issues/25676#issuecomment-515629550

I'm not sure what "the only supported scenarios" means. The others are not supported yet (but you are ready to work on it), or they are prohibited and will never be supported on Linux?

3.0 is past the point where I can introduce new features so all I can say at this point is that for 3.0 the only scenarios that we have tested and can support fully are the ones I mentioned. Other ones (like the signal scenario you are attempting) will need some level of code changes and testing, and will have to be evaluated against everything else for future releases. Planning for 5.0 hasn't happened yet, so I do not intend to say they will be forbidden but I can't guarantee that they will fully supported either. It seems like a reasonable feature request, whether or not it happens will depend on the cost and what other work happens in the 5.0 timeframe.

We do not stop all managed threads at the same time. To the contrary, we sequentially stop each thread, take a snapshot, resume it, and then repeat it for the remaining threads. So from this point of view, using of OS level feature signals to temporarily "interrupt" the thread (like the system does it all the time during scheduling) and ask it to do some task for us seems to be more affective than stopping the whole CLR runtime.

It probably is more performant to do sampling the way you are attempting, but you have to weigh that against the difficulty of the work and the number of workarounds necessary. I am trying to be as honest as possible that you will run in to issues and we won't be able to work around some of them.

Metadata

Metadata

Assignees

Labels

area-Diagnostics-coreclrenhancementProduct code improvement that does NOT require public API changes/additions

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions