Asynchronous execution for the Node-API generator #2654

mmomtchev · 2023-07-17T19:48:17Z

There are still some unit tests that haven't been updated, but this is advanced enough to start discussing it.

This implementation is quite different from the first one which relied on lambdas - this one is C++03 compatible and has a every elegant solution to the problem of copying the local variables. All local variables in the wrapper methods are now members of a class that inherits from Napi::AsyncWorker. This allows reusing of the existing code. The three phases of an async method - sync input arguments parsing on the main thread, async execution in a worker thread and sync generation of the resulting JS objects back on the main thread - are now implemented as class methods. The class deletes itself once the operation is complete. This way, all existing typemaps can be used in async mode without any modification. Async mode can be enabled either by using a %feature or it can simply replace sync mode when using the -async CLI option. In this case, all unit tests (so far) work without changing anything besides adding await and eventually fixing the JS to be strict mode compliant, as top-level await requires it.

Points that should be discussed:

Should locking be opt-in or opt-out? Currently I am leaning towards opt-out for both sync and async methods once async is enabled, but an alternative will be something akin to the INPUT typemaps that are used with %apply. My experience is that most C++ libraries expect that their major objects' methods are not reentrant - like GDAL (Dataset), ImageMagick (Image) and Eigen (Matrix) - so locking will have to be implemented most of the time
How should the Async / Sync interact with renaming? Currently, renaming is applied before the suffix
Maybe the async unit testing should be a separate instance with a CLI option, because NAPI unit testing is now exceptionally slow

@dot-asm

AFAIK this is the first SWIG PR to be developed entirely on solar power

dot-asm · 2023-07-23T19:40:21Z

Not that what I say would actually matter, but I'd like to argue that it would be appropriate to leave synchronous method names as they are and add a suffix to their asynchronous counterparts. As opposed to suffixing both names that is. I mean given a Class::method() .method on JS side would invoke synchronous method, while say .methodAsPromise would return a promise. The rationale is that if there exists a module somewhere, the current scripts won't need modifications if maintainer chooses to extend the module with asynchronous methods. What I [for one?] would also appreciate is ability to pick methods for "asynchronization" one by one. Because it's not given that it would be appropriate to "async" all methods swig can get its hands on. The decision should be developer's... Now, it might be possible to achieve these as it stands now, as I haven't dived into implementation, only reviewed the documentation modifications so far. And I can't really deduce it from the docs... It sounds like options are "sync and async with suffixes", "everything async without suffix", "method of a choice is sync/async without suffix"...

Doc/Manual/Javascript.html

dot-asm · 2023-07-23T21:32:56Z

Should locking be opt-in or opt-out?

I don't quite follow. Is it correct understanding that you suggest per-object locks? And the concern is that programmers would pass the same object to asynchronous methods that would execute in parallel. Well, in such a case I'd say it should be on a programmer. It's their direct responsibility to know what is appropriate and I for one see no point in "helping" them in the such manner. Because such "help" would simply serialize execution of the asynchronous method, as if it's synchronous. In other words it would simply mask the problem from the user and create a false illusion that they know what they're doing. The keyword is "false illusion," which is a bad thing, at least in my "book." If anything, it might be more helpful to detect the race and terminate (or throw an exception). Well, maybe I've misread, not to mention that it might be just me:-) Just in case, it should be noted that detecting a race is not universally appropriate, in the sense that it might be safe to share an object among multiple threads, programmer should know.

mmomtchev · 2023-07-24T10:14:18Z

on a programmer. It's their direct responsibility to know what is appropriate and I for one see no point in "helping" them in the such manner. Because such "help" would simply serialize execution of the asynchronous method, as if it's synchronous.

By programmer, do you mean the end user of the module or the interface/wrapper author?

A JS programmer does not expect random crashes and memory corruption if he doesn't implement locking himself - in fact, there isn't even a standard way to implement locks in JS, since all functions are atomic (interrupted only by eventual await statements) and there are no tools available to him to debug such issues.

The goal of the locking mechanism is to allow concurrent execution when possible, but to serialize it when needed.

Normally, this decision should be made by the wrapper author, who should be aware if the wrapped C++ library supports concurrent calls on the same object or function. I think that as mostly every author will have to implement some kind of locking mechanism, it is best if SWIG offered some basic feature to facilitate, if not completely automate, this task.

mmomtchev · 2023-07-25T21:23:55Z

This should be feature-complete. Finally, I went for an opt-in approach to locking. Either through a %feature, either through a CLI switch (to facilitate testing). Locking can be enabled/disabled on per-type or per-method basis.

Next step would be to test this implementation in node-magickwand.

dot-asm · 2023-07-30T22:29:39Z

By programmer, do you mean the end user of the module or the interface/wrapper author?

Both. Script developer has to know something. Module provider obviously has to know everything.

[the] decision should be made by the wrapper author, who should be aware if the wrapped C++ library supports concurrent calls on the same object or function.

Right. And earlier I tried to say that the author is [arguably] entitled to express the choice in the method name, the .method vs. .methodAsPromise thing. I would also challenge the choice of providing an async interface for a method that doesn't support concurrent execution. Indeed, if the execution has to be serialized, then there is no point in running it asynchronously. Well, it might still be considered beneficial if the method is slow, but in such a case it might be more appropriate to advise the user to execute it in a generic promise, as opposed to having the method return the promise. Expectation would be that the object is instantiated in the said promise context. Of course this is just an expectation and the user still can instantiate the object in the outer scope. In such a case one can wonder if it would be more appropriate to detect that the object is attempted to be used concurrently and throw an exception.

I feel some dissonance. It's likely to be rooted in the fact that I'm not thinking about Javascript in the "right" way, but nevertheless. Taking your Pi example. It's perfectly "reentrant" and requires no locking whatsoever. (I'm writing "reentrant" in quotes, because normally the term applies to functions/methods, not objects.) So we have to imagine a "non-reentrant" object. The said property would necessarily mean that it's mutating in the process of the method execution, right? Now, if serializing the execution is meaningful, does it mean that the said mutations take the object to its original state? I mean the next method invocation surely expects the object to be in a specific state. In which case wouldn't it be more appropriate to make the script developer create multiple objects? As opposed to the behind-curtains serialization that is. Or are we talking about a larger scratch area that is not practical to allocate on a per-object basis? Either way, my point is that it's too application-specific to paint with the same brush. And I'd argue the module provider should have the tools to express all the necessary nuances...

mmomtchev · 2023-07-30T23:34:26Z

Even if the underlying C++ method is not reentrant at all - because it uses static data - there still would be a significant benefit to having an async method - which is not blocking the event loop. The main use of Node.js is as an application server coupled with Express.js (or other similar Express-inspired middleware - there are so many of them now). An async method will allow Node.js to serve other clients while some kind of processing, be it I/O or CPU, is running. node-magickwand can not be used in an application server without this - as opening a file would block all other clients. Using worker_threads with a preexisting pool (because launching a new thread for each operation would be prohibitive) is an alternative, but it is slower and very cumbersome to use because it imposes message-passing by serialization of all objects. async/await on the other side is a very practical interface.

Then, there are the methods which are not reentrant only when called on a single object. node-magickwand can apply filters on multiple images in parallel. Most modern C++ libraries fall in this category.

mmomtchev · 2023-08-12T17:18:14Z

One thing that is a problem with the current implementation are the generic throws typemaps:

swig/Lib/typemaps/swigtype.swg

Line 515 in 0d203fc

%typemap(throws,noblock=1) SWIGTYPE {

These cannot construct the value that is going to be thrown (they have to throw a Napi::Value that cannot be constructed without accessing V8). Alas, I do not have a simple solution to this problem:

Detecting it so that I can at least emit a warning is not easy
Separating the catch from the action code requires modifying the generic code in emit.cxx
Implementing a SWIG_NAPI_Raise which detects the null NAPI environment - just like for strings and errors - requires some very complex code, storing the catch statement in a lambda and then running it back on the main thread

PS. Found a very elegant solution.

mmomtchev · 2023-08-16T13:33:29Z

Finally, I was able to implement using %raise (and %catches) in async context - I am modifying the #define SWIG_Raise before each asynchronous wrapper. Everything that I have tried seems to work, except one unit test that.I modified slightly: b4220f3

https://gist.github.com/mmomtchev/0544f11b1ff104c7b5498e6aed8cbf9b

What does not work is catching a generic exception and retrieving its error message - the error message is lost when copying a generic std::exception - it must be copied as its exact type. It seems that generic asynchronous transmission of exceptions between threads is a classical problem in C++ that got finally solved in C++11 by introducing std::exception_ptr. This however requires that all exception handling in SWIG be redesigned around std::exception_ptr.

mmomtchev · 2023-09-08T22:11:09Z

@dot-asm @ojwb @wsfulton I just got a very good taste of the problems with automatic copying of variables in lambda functions on Windows and I am reconsidering the advanced asynchronous throwing.

The problem with asynchronous throwing is that something must cross the thread boundary. It can't be a V8 object. And it seems that it is not a very good idea to do it with the exception object either.

At least in my case, the MSVC compiler/linker/runtime do not properly allocate/destroy a captured local exception depending on whether the constructor lives across a library boundary or not. As you may know, on Windows, shared libraries can have a separate heap and it seems that MSVC mishandles the destruction in some cases such as this one:
https://gist.github.com/mmomtchev/64156d9391984f7695d8556974b1611b

This particular example works for std::exception but not for a class defined in a shared library. Moving the class from the shared library into the main executable solves the problem.

dot-asm · 2023-09-09T14:04:11Z

https://gist.github.com/mmomtchev/64156d9391984f7695d8556974b1611b

Two problems. Without /EHsc run-time won't unwind stack correctly, in which case it shouldn't come as a surprise if it misses some destructors. Secondly, if you want to compile your code with /MTd, the shared library you link with is better to be built for debugging as well. That fact that you attempt to link with *.lib isn't reassuring that you link with a debugging version. Well, not that it's bound to fail, but mixing debug and non-debug is crash-prone.

mmomtchev · 2023-09-09T14:15:15Z

Two problems. Without /EHsc run-time won't unwind stack correctly, in which case it shouldn't come as a surprise if it misses some destructors. Secondly, if you want to compile your code with /MTd, the shared library you link with is better to be built for

The standalone example is missing /EHsc but the bindings where I have the problem have it. Still, the problem is very consistent with an unwinding issue. I am linking with /MTd libraries as well, normally the linker refuses to combine incompatible runtimes.

I will try to check what is happening, maybe there is something more than just /EHsc or something else disables it.

mmomtchev · 2023-09-09T14:26:24Z

@dot-asm yes, indeed
https://learn.microsoft.com/en-us/cpp/build/reference/eh-exception-handling-model?view=msvc-170
The default exception unwinding code doesn't destroy automatic C++ objects outside of try blocks that go out of scope because of an exception. Resource leaks and undefined behavior may result when a C++ exception is thrown.

I think that ImageMagick is built without it and this is the reason why moving the class out of it fixes the problem. I will look into it, thanks.

mmomtchev · 2023-09-09T21:24:49Z

Alas this fixes the isolated reproduction, but not the bindings themselves - they already use /EHsc. So the isolated repro is invalid.

I managed to get ASAN running with MSVC in a Node addon and I found the culprit - when creating the lambda, the copy constructor of the captured object writes after the end of the allocated area. I am baffled how this is possible, so I am still convinced that this is indeed a compiler problem.

mmomtchev · 2023-09-10T08:17:05Z

Ok, this has nothing to do with SWIG, sizeof(Magick::Exception) in the SWIG generated bindings is 64 bytes
sizeof(Magick::Exception) in the Magick++ library is 72 bytes, so it is obviously a purely MSVC problem that has nothing to do with exceptions, lambdas or whatever. I am still looking for the cause though.

since we are telling the users to define it themselves

Refs: swig#2563

alternative solution to swig#2675 (since Node-API require C++11 anyway)

mmomtchev · 2023-09-15T13:25:12Z

Ok, clang on macOS has a somewhat similar alignment problem. Because of the custom alignment that ImageMagick types use, SWIG_NAPI_AsyncError which contains a lambda which contains an ImageMagick exception, must be aligned on 16 bytes boundary and for some strange reason clang does not support throwing objects that require more than 8 bytes alignment.

It is an edge case but now it is obvious that this will cause even more trouble in the future. I will try to preserve the feature but remove the lambda.

* No more throwing of capturing lambdas * Uses the right C++ tool for the job * emit_action overhaul commit 0302ddc Author: Momtchil Momtchev <momtchil@momtchev.com> Date: Sat Sep 16 17:39:06 2023 +0200 deleting a hash deletes its members too commit b4adbbe Author: Momtchil Momtchev <momtchil@momtchev.com> Date: Sat Sep 16 17:12:12 2023 +0200 fix double dealloc of trethrow commit 6b88e0c Author: Momtchil Momtchev <momtchil@momtchev.com> Date: Sat Sep 16 17:12:00 2023 +0200 tidy up commit af16ad4 Author: Momtchil Momtchev <momtchil@momtchev.com> Date: Sat Sep 16 17:07:17 2023 +0200 free all the allocations commit 36a0951 Author: Momtchil Momtchev <momtchil@momtchev.com> Date: Sat Sep 16 15:55:40 2023 +0200 fix the special_variables test commit 6f51103 Author: Momtchil Momtchev <momtchil@momtchev.com> Date: Sat Sep 16 15:48:46 2023 +0200 fix variable expansion commit 15ea779 Author: Momtchil Momtchev <momtchil@momtchev.com> Date: Sat Sep 16 13:54:55 2023 +0200 fix %exception and restore swig_exception.i to its former glory commit 2c4db7c Author: Momtchil Momtchev <momtchil@momtchev.com> Date: Sat Sep 16 12:30:35 2023 +0200 fix compilation with exceptions disabled commit 25f3091 Author: Momtchil Momtchev <momtchil@momtchev.com> Date: Sat Sep 16 12:05:41 2023 +0200 second attempt This time the code uses the right tools for the job Exceptions are saved in an exception_ptr The exception_ptr crosses the thread boundary The user code handler is placed in a separate function The exception_ptr is rethrown in the V8 thread Donwnsides: * Overhaul of emit_action() in emit.cxx * Bypassing the NAPI error handling commit d75d839 Author: Momtchil Momtchev <momtchil@momtchev.com> Date: Fri Sep 15 17:16:08 2023 +0200 async exceptions w/o lambdas

mmomtchev · 2023-09-16T17:54:22Z

I refactored the asynchronous exceptions around std::exception_ptr and now everything is much cleaner, there is a catch(...) in the background thread and a std::rethrow_exception which uses a non-modified catch handler in the main thread.

This is a much better solution, the only downside is that I had to refactor emit_action, there is now a new emit_action_hash which returns separate fragments in a hash and the old emit_action which calls emit_action_hash but should behave as before. Normally the algorithm hasn't been modified at all, there is simply a wrapper and some renamed variables.

dot-asm reviewed Jul 23, 2023

View reviewed changes

Doc/Manual/Javascript.html Outdated Show resolved Hide resolved

mmomtchev marked this pull request as ready for review July 25, 2023 21:21

mmomtchev force-pushed the async branch from efdfcea to 425efd6 Compare July 30, 2023 16:47

ojwb added the Javascript label Jul 31, 2023

mmomtchev added 18 commits September 7, 2023 19:21

first working async

05c2331

comment generously

f4ea253

first complete design proposal

1c454d9

first dynamic lambda capture list

801ec81

temp shouldn't be on the stack

b03b3a5

altnerative async design without lambdas / variable copying

cb7d368

the custom marshalOutput is not needed anymore

da8041c

make the reference example work

1b2b738

fix the no exceptions version

b10ed49

allow enabling async from CLI

033b22d

first running async unit test

cad14f7

add primitive types

2058838

check that the returned value is indeed a promise

82bdfed

async overloading

a802402

reduce the code duplication

37b9d19

reduce duplication

ef1bd7b

translate require with single quotes

3135882

play nicely with operator renaming

fdc4222

mmomtchev added 10 commits September 7, 2023 19:23

run the examples in async mode but ignore errors

900b11c

remove the capture initializer

c62b053

add the 3rd form of SWIG_Raise

9959afb

macro with variable arguments

4ed8259

a much cleaner solution for SWIG_NAPI_AsyncThrow

e6ccd0a

fix SWIG_NAPI_Raise w/o exceptions

ec05b68

fix endless recursion

0a041a6

generic copying of exceptions requires C++11

9962ec2

SWIG_NAPI_AsyncThrow should depend on _CPPUNWIND

0003891

remove duplicates after the merge

d01da01

mmomtchev force-pushed the async branch from b4220f3 to d01da01 Compare September 7, 2023 17:38

mmomtchev added 2 commits September 8, 2023 00:06

fix a nasty typo

6bce44e

examples are always sync by default

a80314c

mmomtchev added 2 commits September 10, 2023 12:59

add a documentation warning

8549db1

do not redefine NAPI_CPP_EXCEPTIONS

05b6ddb

since we are telling the users to define it themselves

mmomtchev mentioned this pull request Sep 12, 2023

Javascript backends do not apply the freearg template when exiting through an exception #2563

Open

mmomtchev added 4 commits September 12, 2023 18:51

Always execute freearg on the error path

8bbda98

Refs: swig#2563

always run freearg when exiting with an exception

a4ef4bf

replace the freearg typemap with a smart pointer

6921bab

alternative solution to swig#2675 (since Node-API require C++11 anyway)

simplify the code using unique_ptr<T[]>

ccb15be

mmomtchev mentioned this pull request Nov 20, 2023

%feature and asterisks #2657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous execution for the Node-API generator #2654

Asynchronous execution for the Node-API generator #2654

mmomtchev commented Jul 17, 2023 •

edited

dot-asm commented Jul 23, 2023

dot-asm commented Jul 23, 2023

mmomtchev commented Jul 24, 2023

mmomtchev commented Jul 25, 2023

dot-asm commented Jul 30, 2023

mmomtchev commented Jul 30, 2023

mmomtchev commented Aug 12, 2023 •

edited

mmomtchev commented Aug 16, 2023 •

edited

mmomtchev commented Sep 8, 2023 •

edited

dot-asm commented Sep 9, 2023

mmomtchev commented Sep 9, 2023

mmomtchev commented Sep 9, 2023

mmomtchev commented Sep 9, 2023

mmomtchev commented Sep 10, 2023

mmomtchev commented Sep 15, 2023

mmomtchev commented Sep 16, 2023

Asynchronous execution for the Node-API generator #2654

Are you sure you want to change the base?

Asynchronous execution for the Node-API generator #2654

Conversation

mmomtchev commented Jul 17, 2023 • edited

dot-asm commented Jul 23, 2023

dot-asm commented Jul 23, 2023

mmomtchev commented Jul 24, 2023

mmomtchev commented Jul 25, 2023

dot-asm commented Jul 30, 2023

mmomtchev commented Jul 30, 2023

mmomtchev commented Aug 12, 2023 • edited

mmomtchev commented Aug 16, 2023 • edited

mmomtchev commented Sep 8, 2023 • edited

dot-asm commented Sep 9, 2023

mmomtchev commented Sep 9, 2023

mmomtchev commented Sep 9, 2023

mmomtchev commented Sep 9, 2023

mmomtchev commented Sep 10, 2023

mmomtchev commented Sep 15, 2023

mmomtchev commented Sep 16, 2023

mmomtchev commented Jul 17, 2023 •

edited

mmomtchev commented Aug 12, 2023 •

edited

mmomtchev commented Aug 16, 2023 •

edited

mmomtchev commented Sep 8, 2023 •

edited