Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix async Python functors invoking from multiple C++ threads (#1587) #1595

Merged
merged 2 commits into from
Jun 11, 2019

Conversation

uentity
Copy link
Contributor

@uentity uentity commented Nov 6, 2018

Ensure GIL is released after functor destructor finished (not only
during functor execution as in previous implementation).

This fixes #1587

@wjakob
Copy link
Member

wjakob commented Nov 9, 2018

I'm a bit confused by the change, can you clarify? Specifically, I'm wondering if/why it is safe to std::move() the function within the lambda function. It seems to me that for lambdas with state, this operation will only work the first time. Afterwards, the lambda closure object will contain an invalid function object, which will potentially trigger undefined behavior the next time the function is called. What am I missing?

@wjakob
Copy link
Member

wjakob commented Nov 9, 2018

FWIW what's also missing here is a testcase that would have lead to breakage in the previous implemenation.

@uentity
Copy link
Contributor Author

uentity commented Nov 9, 2018

@wjakob Your first comment is absolutely right, I missed that point, s.t. all functors became single-shot (can be called only once and fail/crash 2nd time).

Fixed that in next commit. Had to introduce local "function handle" struct with custom destructor. Seems that my code is working now.

Considering testcase: I can write a Python test with statefull lambda, and it should crash Python (and I suppose we can't catch/prevent it in testing code) without this PR and work fine with it.

@uentity uentity force-pushed the fix-async-pycb branch 5 times, most recently from e619e19 to e3375b6 Compare November 9, 2018 19:51
@uentity
Copy link
Contributor Author

uentity commented Nov 9, 2018

@wjakob I've made a test case and squashed functional changes into a single commit.
Please, take a look.

[NOTE] test leads to segfault (on my machine) if run without latest commit (with previous functional version).

@uentity
Copy link
Contributor Author

uentity commented Nov 14, 2018

@wjakob there seems to be a problem with gil_scoped_acquire.
If async callbacks are launched from separate (non-main) Python thread, then following code will lead to deadlock:

~func_handle() {
	gil_scoped_acquire acq;
	function kill_f(std::move(f));
}

If I replace it with conventional Python API for acquiring GIL (straight from documentation):

~func_handle() {
	PyGILState_STATE gstate;
	gstate = PyGILState_Ensure();
	{
		function kill_f(std::move(f));
	}
	PyGILState_Release(gstate);
}

then everything works without issues.
Consider the latest couple of commits.

@davidhewitt
Copy link
Contributor

Hello @wjakob @uentity, I've authored a separate PR #1556 which resolves the deadlock problem with gil_scoped_acquire. You might wish to check that out!

Below is my summary of what I understand of the problems these PRs are trying to solve.

std::function<...> extracted from pybind11 typically is some signature like std::function<int(int)> which has no obvious association to python types and can easily be passed to portions of the codebase which are not aware to things like the GIL. People doing this then expect these objects to be thread safe, however:

  1. If C++ code spawns a thread and copies the function handle to give to that new thread, then in that new thread the function will correctly capture the GIL during calls but not during destruction. That is what this PR solves.
  2. If Python spawns a thread and calls a pybind11-defined function which takes one of these std::function types as an argument, then on the first construction of a gil_scoped_acquire object we get a deadlock. That is what Resolve threading issue with gil_scoped_acquire #1556 solves.

I would love to contribute in any way possible to help resolve these!

@uentity
Copy link
Contributor Author

uentity commented Dec 1, 2018

Hi @davidhewitt!
If you look at my last commit on this topic, then you can figure out that I was trying to remedy the consequences of problem you described in item 2. It's great that you've fixed gil_scoped_aquire in first place!

With you PR #1556 applied I can bring back my original implementation of func_handle destructor in this PR, i.e.:

~func_handle() {
	gil_scoped_acquire acq;
	function kill_f(std::move(f));
}

Very soon after I thought my issue #1587 is solved I've faced the issue of your PR #1556 and fixed it by replacing gil_scoped_acquire with Python GIL API calls (see above). But my fix isn't perfect, because if doesn't follow RAII (for example, destructor of functor f throws...).

BTW can you give any comment on comparative "heaviness" of capturing GIL solutions I noted earlier? If I look at gil_scoped_acquire implementation, I have a (deceptive?) feeling that it is "heavy" :-)

@davidhewitt
Copy link
Contributor

I understand gil_scoped_acquire is slightly more complicated because it is designed to allow python to be migrated to a different OS thread. I think because of this it's probably advisable to use gil_scoped_acquire everywhere. Plus RAII :)

NB I see that #1211 has just been merged which was exactly the same thing that I was trying to solve. So I think that gil_scoped_acquire will work properly for you in this PR if you rebase.

func_handle(const func_handle&) = default;
~func_handle() {
gil_scoped_acquire acq;
function kill_f(std::move(f));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more clear to explicitly reset the function handle and decrease the reference count: f.release().dec_ref();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Done.

Copy link
Contributor Author

@uentity uentity Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidhewitt causes strange test failure on VS with Python 3.6 on x86 platform, so reverted back to original version.
Logs: https://ci.appveyor.com/project/wjakob/pybind11/builds/20894165

@uentity
Copy link
Contributor Author

uentity commented Dec 10, 2018

@wjakob style check is failing because of non-existent URL:

$ curl -fsSL ftp://ftp.stack.nl/pub/users/dimitri/doxygen-1.8.12.linux.bin.tar.gz | tar xz
curl: (6) Could not resolve host: ftp.stack.nl

@hammer498
Copy link

I was also affected by this and your patch has unblocked me. Thank you @uentity!

@uentity
Copy link
Contributor Author

uentity commented Jun 6, 2019

@hammer498 you're welcome =)

@wjakob
Copy link
Member

wjakob commented Jun 11, 2019

This looks good to me now.

@wjakob wjakob merged commit 8760a10 into pybind:master Jun 11, 2019
wjakob pushed a commit that referenced this pull request Jun 11, 2019
…1595)

* Fix async Python functors invoking from multiple C++ threads (#1587)

Ensure GIL is held during functor destruction.

* Add async Python callbacks test that runs in separate Python thread
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash when Python lambdas (with captured state) are called from multiple C++ threads (fix inside)
4 participants