Add code to walk the list of thread states #13

eklitzke · 2016-08-25T01:36:11Z

This is how you do it:

each interpreter has a field called tstate_head which is the head of a linked list of thread states
each thread state has a field called next which is a pointer to the next thread state
the last thread state has a null pointer for next

Likewise:

each thread state has a back reference to the interpreter state in a field called interp
there's a single linked list of interpreters whose head is the static symbol interp_head

In the single-threaded case there is one interpreter and one thread state.

In the generic case the way you enumerate the thread states is:

locate _PyThreadState_Current
follow interp up to the interpreter
follow tstate_head to the first thread state (which will always be the same as _PyThreadState_Current for a single-threaded program, but could be different for a multi-threaded program)
follow the next field until NULL is encountered

It should be fine to ignore the multiple interpreter thing, no one really uses that feature nowadays.

The text was updated successfully, but these errors were encountered:

ogrisel · 2016-10-18T12:25:15Z

Related: it would be great to have a way to generate one flamegraph svg file per active thread in a multithreaded python program.

eklitzke · 2016-10-31T05:15:10Z

I gave a few more details about this in #41 .

This is probably the next major feature I'll add to Pyflame (after fixing #34, which has a PR by me out). However, I'm going to spend most of November traveling. If someone wants to surprise me with a PR to fix this in the interim that would be great, otherwise I'll probably get some time in December.

batterseapower · 2016-11-15T22:57:22Z

Did some initial work on this at batterseapower@36ba9e1

Right now it doesn't seem to be able to find interp_head, which is necessary if we want to do anything interesting (PyThreadState_Current is NULL if the GIL is not held). This should be fixable because e.g. nm -a knows about interp_head.

batterseapower · 2016-11-15T23:44:32Z

Fixed that, but PyThreadState.frame seems to always be null in the cases I looked at.

jamespic · 2017-02-13T18:10:12Z

@batterseapower That makes sense. PyThreadState_Current is only non-null when something holds the GIL. Threads set it when they acquire the GIL, based on a value held in a thread-local. If I get some time, I'll look at the feasibility of traversing thread-locals.

jamespic · 2017-02-19T14:23:40Z

I've got some further work on jamespic@db3feaa.

@batterseapower I think you were actually on the right lines. What you were doing should have worked. I thought I was getting null frames, until I realised that I was running against a debug build (since normal builds don't expose interp_head), which had a different PyObject layout. There was also a bug in your "is current" code that was throwing away perfectly good frames. With a couple of tweaks I was able to get your code profiling non-GIL frames.

Meanwhile, I also managed to get it profiling non-GIL frames on non-debug builds (builds without interp_head exposed), at least on AMD64. Whilst the variable itself isn't exposed, the PyInterpreterState_Head function is. Using some evil hacks stolen from https://github.com/eklitzke/ptrace-call-userspace (thanks @eklitzke) I was able to get it calling this function via ptrace.

My apologies for the scrappiness of the code - it's literally the first thing I've ever written in C++.

eklitzke · 2017-02-28T01:48:19Z

I have a version that copies the code into a new page: 2ac20de . This still crashes the Python process, but it seems like it might crash less frequently? This version also leaks a page per pyflame invocation (it's just a POC). Hopefully I can take a look again tomorrow, if not you can try this approach if you like.

jamespic · 2017-02-28T08:33:00Z

Leaking a page isn't too terrible if we only do it once per invocation - and I don't think there's anywhere we can safely munmap it from. Looking at your code, I think it should be thread safe (apologies for the earlier comment rambling about SIGSTOP - I hadn't grokked your code yet, and wasn't aware you could safely change `rip` in ptraced code, so thought we'd still have to inject a `JMP` somewhere). I'll try and find some time to investigate later today.

jamespic · 2017-02-28T22:43:34Z

I've managed to reproduce locally. It's still not thread safe. The initial memory overwrite for mmap is the issue now - there's nothing to stop another thread hitting the dirty code, so if code gets overwritten in a hot library (when I reproduced the issue, it was in libc, in do_futex_wait), there's a high probability of something else hitting it.

I'll look at stopping threads whilst code is in an unsafe state.

eklitzke · 2017-02-28T23:25:42Z

There are a couple of answers here on how to pause all threads: http://stackoverflow.com/questions/18577956/how-to-use-ptrace-to-get-a-consistent-view-of-multiple-threads

Not leaking a page is pretty easy -- you just need to use the munmap(2) syscall. Actually there are some tricks to try to keep the page allocated and then find it later, but for an initial implementation mmap(2) followed by munmap(2) later on is probably fine.

jamespic · 2017-03-01T00:22:58Z

I added thread pausing in jamespic@e038c99, which seems to have solved the crashing issue for me.

eklitzke · 2017-03-01T03:05:59Z

This is looking good to me, and seems stable on my system. You might want to rung clang-format again, there was one style violation I saw in that commit which is that the style for range-based for loops should be for (auto foo : blah) not for (auto foo: blah). (Assuming clang-format knows about that.)

If you submit another PR I'll do another look over the whole diff.

Before merging:

Consider adding a note to the README about threading support. (If not, I can add it.)
Ideally the commits should be squashed to be somewhat logical. Since I first saw the PR, I already cherry-picked 189ce18 onto master. Preferably all of @batterseapower's commits would be squashed to one (or a few) logical changes, and same with yours. Basically, the branch should at least be bisectable.

Walk threads and handle non-GIL code - Implements #13

eklitzke · 2017-03-03T23:23:21Z

Thanks @jamespic for fixing this!

fillest mentioned this issue Oct 30, 2016

"idle" detalization #41

Closed

eklitzke added this to the Pyflame2 milestone Oct 31, 2016

jamespic mentioned this issue Feb 25, 2017

Profile all threads, not just the one with the GIL #52

Closed

eklitzke added a commit that referenced this issue Mar 3, 2017

Merge pull request #53 from jamespic/rebase

28c2aa7

Walk threads and handle non-GIL code - Implements #13

eklitzke closed this as completed Mar 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add code to walk the list of thread states #13

Add code to walk the list of thread states #13

eklitzke commented Aug 25, 2016

ogrisel commented Oct 18, 2016 •

edited

eklitzke commented Oct 31, 2016

batterseapower commented Nov 15, 2016

batterseapower commented Nov 15, 2016

jamespic commented Feb 13, 2017

jamespic commented Feb 19, 2017 •

edited

eklitzke commented Feb 28, 2017

jamespic commented Feb 28, 2017 via email •

edited

jamespic commented Feb 28, 2017

eklitzke commented Feb 28, 2017

jamespic commented Mar 1, 2017

eklitzke commented Mar 1, 2017

eklitzke commented Mar 3, 2017

Add code to walk the list of thread states #13

Add code to walk the list of thread states #13

Comments

eklitzke commented Aug 25, 2016

ogrisel commented Oct 18, 2016 • edited

eklitzke commented Oct 31, 2016

batterseapower commented Nov 15, 2016

batterseapower commented Nov 15, 2016

jamespic commented Feb 13, 2017

jamespic commented Feb 19, 2017 • edited

eklitzke commented Feb 28, 2017

jamespic commented Feb 28, 2017 via email • edited

jamespic commented Feb 28, 2017

eklitzke commented Feb 28, 2017

jamespic commented Mar 1, 2017

eklitzke commented Mar 1, 2017

eklitzke commented Mar 3, 2017

ogrisel commented Oct 18, 2016 •

edited

jamespic commented Feb 19, 2017 •

edited

jamespic commented Feb 28, 2017 via email •

edited