Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous execution for the Node-API generator #2654

Open
wants to merge 144 commits into
base: master
Choose a base branch
from

Conversation

mmomtchev
Copy link
Contributor

@mmomtchev mmomtchev commented Jul 17, 2023

There are still some unit tests that haven't been updated, but this is advanced enough to start discussing it.

This implementation is quite different from the first one which relied on lambdas - this one is C++03 compatible and has a every elegant solution to the problem of copying the local variables. All local variables in the wrapper methods are now members of a class that inherits from Napi::AsyncWorker. This allows reusing of the existing code. The three phases of an async method - sync input arguments parsing on the main thread, async execution in a worker thread and sync generation of the resulting JS objects back on the main thread - are now implemented as class methods. The class deletes itself once the operation is complete. This way, all existing typemaps can be used in async mode without any modification. Async mode can be enabled either by using a %feature or it can simply replace sync mode when using the -async CLI option. In this case, all unit tests (so far) work without changing anything besides adding await and eventually fixing the JS to be strict mode compliant, as top-level await requires it.

Points that should be discussed:

  • Should locking be opt-in or opt-out? Currently I am leaning towards opt-out for both sync and async methods once async is enabled, but an alternative will be something akin to the INPUT typemaps that are used with %apply. My experience is that most C++ libraries expect that their major objects' methods are not reentrant - like GDAL (Dataset), ImageMagick (Image) and Eigen (Matrix) - so locking will have to be implemented most of the time
  • How should the Async / Sync interact with renaming? Currently, renaming is applied before the suffix
  • Maybe the async unit testing should be a separate instance with a CLI option, because NAPI unit testing is now exceptionally slow

@dot-asm

AFAIK this is the first SWIG PR to be developed entirely on solar power

@dot-asm
Copy link
Contributor

dot-asm commented Jul 23, 2023

Not that what I say would actually matter, but I'd like to argue that it would be appropriate to leave synchronous method names as they are and add a suffix to their asynchronous counterparts. As opposed to suffixing both names that is. I mean given a Class::method() .method on JS side would invoke synchronous method, while say .methodAsPromise would return a promise. The rationale is that if there exists a module somewhere, the current scripts won't need modifications if maintainer chooses to extend the module with asynchronous methods. What I [for one?] would also appreciate is ability to pick methods for "asynchronization" one by one. Because it's not given that it would be appropriate to "async" all methods swig can get its hands on. The decision should be developer's... Now, it might be possible to achieve these as it stands now, as I haven't dived into implementation, only reviewed the documentation modifications so far. And I can't really deduce it from the docs... It sounds like options are "sync and async with suffixes", "everything async without suffix", "method of a choice is sync/async without suffix"...

Doc/Manual/Javascript.html Outdated Show resolved Hide resolved
@dot-asm
Copy link
Contributor

dot-asm commented Jul 23, 2023

  • Should locking be opt-in or opt-out?

I don't quite follow. Is it correct understanding that you suggest per-object locks? And the concern is that programmers would pass the same object to asynchronous methods that would execute in parallel. Well, in such a case I'd say it should be on a programmer. It's their direct responsibility to know what is appropriate and I for one see no point in "helping" them in the such manner. Because such "help" would simply serialize execution of the asynchronous method, as if it's synchronous. In other words it would simply mask the problem from the user and create a false illusion that they know what they're doing. The keyword is "false illusion," which is a bad thing, at least in my "book." If anything, it might be more helpful to detect the race and terminate (or throw an exception). Well, maybe I've misread, not to mention that it might be just me:-) Just in case, it should be noted that detecting a race is not universally appropriate, in the sense that it might be safe to share an object among multiple threads, programmer should know.

@mmomtchev
Copy link
Contributor Author

on a programmer. It's their direct responsibility to know what is appropriate and I for one see no point in "helping" them in the such manner. Because such "help" would simply serialize execution of the asynchronous method, as if it's synchronous.

By programmer, do you mean the end user of the module or the interface/wrapper author?

A JS programmer does not expect random crashes and memory corruption if he doesn't implement locking himself - in fact, there isn't even a standard way to implement locks in JS, since all functions are atomic (interrupted only by eventual await statements) and there are no tools available to him to debug such issues.

The goal of the locking mechanism is to allow concurrent execution when possible, but to serialize it when needed.

Normally, this decision should be made by the wrapper author, who should be aware if the wrapped C++ library supports concurrent calls on the same object or function. I think that as mostly every author will have to implement some kind of locking mechanism, it is best if SWIG offered some basic feature to facilitate, if not completely automate, this task.

@mmomtchev mmomtchev marked this pull request as ready for review July 25, 2023 21:21
@mmomtchev
Copy link
Contributor Author

This should be feature-complete. Finally, I went for an opt-in approach to locking. Either through a %feature, either through a CLI switch (to facilitate testing). Locking can be enabled/disabled on per-type or per-method basis.

Next step would be to test this implementation in node-magickwand.

@dot-asm
Copy link
Contributor

dot-asm commented Jul 30, 2023

By programmer, do you mean the end user of the module or the interface/wrapper author?

Both. Script developer has to know something. Module provider obviously has to know everything.

[the] decision should be made by the wrapper author, who should be aware if the wrapped C++ library supports concurrent calls on the same object or function.

Right. And earlier I tried to say that the author is [arguably] entitled to express the choice in the method name, the .method vs. .methodAsPromise thing. I would also challenge the choice of providing an async interface for a method that doesn't support concurrent execution. Indeed, if the execution has to be serialized, then there is no point in running it asynchronously. Well, it might still be considered beneficial if the method is slow, but in such a case it might be more appropriate to advise the user to execute it in a generic promise, as opposed to having the method return the promise. Expectation would be that the object is instantiated in the said promise context. Of course this is just an expectation and the user still can instantiate the object in the outer scope. In such a case one can wonder if it would be more appropriate to detect that the object is attempted to be used concurrently and throw an exception.

I feel some dissonance. It's likely to be rooted in the fact that I'm not thinking about Javascript in the "right" way, but nevertheless. Taking your Pi example. It's perfectly "reentrant" and requires no locking whatsoever. (I'm writing "reentrant" in quotes, because normally the term applies to functions/methods, not objects.) So we have to imagine a "non-reentrant" object. The said property would necessarily mean that it's mutating in the process of the method execution, right? Now, if serializing the execution is meaningful, does it mean that the said mutations take the object to its original state? I mean the next method invocation surely expects the object to be in a specific state. In which case wouldn't it be more appropriate to make the script developer create multiple objects? As opposed to the behind-curtains serialization that is. Or are we talking about a larger scratch area that is not practical to allocate on a per-object basis? Either way, my point is that it's too application-specific to paint with the same brush. And I'd argue the module provider should have the tools to express all the necessary nuances...

@mmomtchev
Copy link
Contributor Author

Even if the underlying C++ method is not reentrant at all - because it uses static data - there still would be a significant benefit to having an async method - which is not blocking the event loop. The main use of Node.js is as an application server coupled with Express.js (or other similar Express-inspired middleware - there are so many of them now). An async method will allow Node.js to serve other clients while some kind of processing, be it I/O or CPU, is running. node-magickwand can not be used in an application server without this - as opening a file would block all other clients. Using worker_threads with a preexisting pool (because launching a new thread for each operation would be prohibitive) is an alternative, but it is slower and very cumbersome to use because it imposes message-passing by serialization of all objects. async/await on the other side is a very practical interface.

Then, there are the methods which are not reentrant only when called on a single object. node-magickwand can apply filters on multiple images in parallel. Most modern C++ libraries fall in this category.

@mmomtchev
Copy link
Contributor Author

mmomtchev commented Aug 12, 2023

One thing that is a problem with the current implementation are the generic throws typemaps:

%typemap(throws,noblock=1) SWIGTYPE {

These cannot construct the value that is going to be thrown (they have to throw a Napi::Value that cannot be constructed without accessing V8). Alas, I do not have a simple solution to this problem:

  • Detecting it so that I can at least emit a warning is not easy
  • Separating the catch from the action code requires modifying the generic code in emit.cxx
  • Implementing a SWIG_NAPI_Raise which detects the null NAPI environment - just like for strings and errors - requires some very complex code, storing the catch statement in a lambda and then running it back on the main thread

PS. Found a very elegant solution.

@mmomtchev
Copy link
Contributor Author

mmomtchev commented Aug 16, 2023

Finally, I was able to implement using %raise (and %catches) in async context - I am modifying the #define SWIG_Raise before each asynchronous wrapper. Everything that I have tried seems to work, except one unit test that.I modified slightly: b4220f3

https://gist.github.com/mmomtchev/0544f11b1ff104c7b5498e6aed8cbf9b

What does not work is catching a generic exception and retrieving its error message - the error message is lost when copying a generic std::exception - it must be copied as its exact type. It seems that generic asynchronous transmission of exceptions between threads is a classical problem in C++ that got finally solved in C++11 by introducing std::exception_ptr. This however requires that all exception handling in SWIG be redesigned around std::exception_ptr.

@mmomtchev
Copy link
Contributor Author

mmomtchev commented Sep 8, 2023

@dot-asm @ojwb @wsfulton I just got a very good taste of the problems with automatic copying of variables in lambda functions on Windows and I am reconsidering the advanced asynchronous throwing.

The problem with asynchronous throwing is that something must cross the thread boundary. It can't be a V8 object. And it seems that it is not a very good idea to do it with the exception object either.

At least in my case, the MSVC compiler/linker/runtime do not properly allocate/destroy a captured local exception depending on whether the constructor lives across a library boundary or not. As you may know, on Windows, shared libraries can have a separate heap and it seems that MSVC mishandles the destruction in some cases such as this one:
https://gist.github.com/mmomtchev/64156d9391984f7695d8556974b1611b

This particular example works for std::exception but not for a class defined in a shared library. Moving the class from the shared library into the main executable solves the problem.

@dot-asm
Copy link
Contributor

dot-asm commented Sep 9, 2023

https://gist.github.com/mmomtchev/64156d9391984f7695d8556974b1611b

Two problems. Without /EHsc run-time won't unwind stack correctly, in which case it shouldn't come as a surprise if it misses some destructors. Secondly, if you want to compile your code with /MTd, the shared library you link with is better to be built for debugging as well. That fact that you attempt to link with *.lib isn't reassuring that you link with a debugging version. Well, not that it's bound to fail, but mixing debug and non-debug is crash-prone.

@mmomtchev
Copy link
Contributor Author

Two problems. Without /EHsc run-time won't unwind stack correctly, in which case it shouldn't come as a surprise if it misses some destructors. Secondly, if you want to compile your code with /MTd, the shared library you link with is better to be built for

The standalone example is missing /EHsc but the bindings where I have the problem have it. Still, the problem is very consistent with an unwinding issue. I am linking with /MTd libraries as well, normally the linker refuses to combine incompatible runtimes.

I will try to check what is happening, maybe there is something more than just /EHsc or something else disables it.

@mmomtchev
Copy link
Contributor Author

@dot-asm yes, indeed
https://learn.microsoft.com/en-us/cpp/build/reference/eh-exception-handling-model?view=msvc-170
The default exception unwinding code doesn't destroy automatic C++ objects outside of try blocks that go out of scope because of an exception. Resource leaks and undefined behavior may result when a C++ exception is thrown.

I think that ImageMagick is built without it and this is the reason why moving the class out of it fixes the problem. I will look into it, thanks.

@mmomtchev
Copy link
Contributor Author

Alas this fixes the isolated reproduction, but not the bindings themselves - they already use /EHsc. So the isolated repro is invalid.

I managed to get ASAN running with MSVC in a Node addon and I found the culprit - when creating the lambda, the copy constructor of the captured object writes after the end of the allocated area. I am baffled how this is possible, so I am still convinced that this is indeed a compiler problem.

@mmomtchev
Copy link
Contributor Author

Ok, this has nothing to do with SWIG, sizeof(Magick::Exception) in the SWIG generated bindings is 64 bytes
sizeof(Magick::Exception) in the Magick++ library is 72 bytes, so it is obviously a purely MSVC problem that has nothing to do with exceptions, lambdas or whatever. I am still looking for the cause though.

since we are telling the users to define it themselves
@mmomtchev
Copy link
Contributor Author

Ok, clang on macOS has a somewhat similar alignment problem. Because of the custom alignment that ImageMagick types use, SWIG_NAPI_AsyncError which contains a lambda which contains an ImageMagick exception, must be aligned on 16 bytes boundary and for some strange reason clang does not support throwing objects that require more than 8 bytes alignment.

It is an edge case but now it is obvious that this will cause even more trouble in the future. I will try to preserve the feature but remove the lambda.

* No more throwing of capturing lambdas
* Uses the right C++ tool for the job
* emit_action overhaul

commit 0302ddc
Author: Momtchil Momtchev <momtchil@momtchev.com>
Date:   Sat Sep 16 17:39:06 2023 +0200

    deleting a hash deletes its members too

commit b4adbbe
Author: Momtchil Momtchev <momtchil@momtchev.com>
Date:   Sat Sep 16 17:12:12 2023 +0200

    fix double dealloc of trethrow

commit 6b88e0c
Author: Momtchil Momtchev <momtchil@momtchev.com>
Date:   Sat Sep 16 17:12:00 2023 +0200

    tidy up

commit af16ad4
Author: Momtchil Momtchev <momtchil@momtchev.com>
Date:   Sat Sep 16 17:07:17 2023 +0200

    free all the allocations

commit 36a0951
Author: Momtchil Momtchev <momtchil@momtchev.com>
Date:   Sat Sep 16 15:55:40 2023 +0200

    fix the special_variables test

commit 6f51103
Author: Momtchil Momtchev <momtchil@momtchev.com>
Date:   Sat Sep 16 15:48:46 2023 +0200

    fix variable expansion

commit 15ea779
Author: Momtchil Momtchev <momtchil@momtchev.com>
Date:   Sat Sep 16 13:54:55 2023 +0200

    fix %exception and restore swig_exception.i to its former glory

commit 2c4db7c
Author: Momtchil Momtchev <momtchil@momtchev.com>
Date:   Sat Sep 16 12:30:35 2023 +0200

    fix compilation with exceptions disabled

commit 25f3091
Author: Momtchil Momtchev <momtchil@momtchev.com>
Date:   Sat Sep 16 12:05:41 2023 +0200

    second attempt

    This time the code uses the right tools for the job
    Exceptions are saved in an exception_ptr
    The exception_ptr crosses the thread boundary
    The user code handler is placed in a separate function
    The exception_ptr is rethrown in the V8 thread

    Donwnsides:
    * Overhaul of emit_action() in emit.cxx
    * Bypassing the NAPI error handling

commit d75d839
Author: Momtchil Momtchev <momtchil@momtchev.com>
Date:   Fri Sep 15 17:16:08 2023 +0200

    async exceptions w/o lambdas
@mmomtchev
Copy link
Contributor Author

I refactored the asynchronous exceptions around std::exception_ptr and now everything is much cleaner, there is a catch(...) in the background thread and a std::rethrow_exception which uses a non-modified catch handler in the main thread.

This is a much better solution, the only downside is that I had to refactor emit_action, there is now a new emit_action_hash which returns separate fragments in a hash and the old emit_action which calls emit_action_hash but should behave as before. Normally the algorithm hasn't been modified at all, there is simply a wrapper and some renamed variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants