Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking Change Proposal: Move compilation_id out of the protobuf to improve performance in multi-thread/multi-isolate context #3579

Closed
3 tasks done
ntkme opened this issue Apr 21, 2023 · 25 comments · Fixed by #3599
Assignees
Labels
embedded Related to the embedded Sass protocol enhancement New feature or request

Comments

@ntkme
Copy link
Contributor

ntkme commented Apr 21, 2023


Today we have the wire data encoded as:

[ varint proto_length ][ proto ]

This is a propose for a breaking change to remove compilation_id field inside proto, and replace it with a top level session_id like below to serve the same purpose:

[ varint session_id ][ varint proto_length ][ proto ]

The motivation is to improve I/O performance in a multi-thread context, by offloading message encoding and decoding to different CPU cores running different threads/isolates to reduce the pressure on the main isolate/thread due to potential heavy I/O multiplexing on stdio. With this change, both host, and compiler would be able to pass raw proto to a thread/isolate that is currently working on a specific compilation, and the decoding of the proto can happen on the child thread's CPU cycle, giving the main thread more free CPU cycle to purely process multiplexed I/O.

session_id should be in the range of uint32, the reason to choose varint as its encoding instead is:

  1. For most of cases it would be shorter
  2. Avoid the possible confusion on endianness of uint32 on wire.
@ntkme
Copy link
Contributor Author

ntkme commented Apr 21, 2023

More specifically, these can be removed and replaced with a single session_id on wire:

  • CompileRequest.id
  • VersionRequest.id
  • ProtocolError.id
  • CompileResponse.id
  • VersionResponse.id
  • LogEvent.compilation_id
  • CanonicalizeRequest.compilation_id
  • ImportRequest.compilation_id
  • FileImportRequest.compilation_id
  • FunctionCallRequest.compilation_id

These should remain, to reserve the potential of a compiler implementation sending multiple of same kind concurrently in a single compilation:

  • CanonicalizeRequest.id
  • ImportRequest.id
  • FileImportRequest.id
  • FunctionCallRequest.id
  • CanonicalizeResponse.id
  • ImportResponse.id
  • FileImportResponse.id
  • FunctionCallResponse.id

@nex3
Copy link
Contributor

nex3 commented Apr 21, 2023

I was hoping we could extract this information from the serialized protobuf directly, but since field order isn't guaranteed I guess that's not safe. This is probably the next best option.

We should bear in mind the possibility of #2745 as we design this. That is, we may want to have the ability to ensure that multiple compilations take place within the same compiler thread. We could potentially track this separately, but that would complicate the compiler's ability to multiplex new compilation requests.

One possibility would be to preserve all the existing id fields and just declare (for now) that IncomingRequests must have an id field that matches their session_id.

@ntkme
Copy link
Contributor Author

ntkme commented Apr 21, 2023

One possibility would be to preserve all the existing id fields and just declare (for now) that IncomingRequests must have an id field that matches their session_id.

That is a good migration path. We can feature flag dart-sass-embedded to accept and send session_id on wire, which developers can opt-in.

@ntkme
Copy link
Contributor Author

ntkme commented Apr 22, 2023

That is, we may want to have the ability to ensure that multiple compilations take place within the same compiler thread.

This is more of an issue due to lack of shared memory across isolates in Dart. If we can find a way to share data across isolates, this can potentially be solved differently. Read-write access across isolates would be better, but even with read-only shared data it can work to some degree. E.g. The main isolates owns a shared cache. All isolates can read from the shared cache. When worker isolates mutate the shared cache, they commit to a thread-local cache first, then thread-local cache can be asynchronously send back to main isolate for a persisted commit to the shared cache.

@nex3
Copy link
Contributor

nex3 commented Apr 25, 2023

That's true, I could imagine a situation where we want parallel compilation that still share state with one another. It seems quite tricky to handle given that for a lot of entrypoints almost everything will be cacheable, but it's certainly not impossible.

In that case, maybe we should explicitly make the wire-level ID a compilation ID, and plan to have a separate session ID in the CompileRequest. That does raise a question for what to do with VersionRequest, but we could potentially reserve ID 0 for that. WDYT?

@ntkme
Copy link
Contributor Author

ntkme commented Apr 25, 2023

I think we're having different interpretation on session_id:

  • My original interpretation of session_id is that a session is a single end-to-end "conversation": E.g. a CompilationRequest to CompilationResponse, or a VersionRequest to VersionResponse.
  • If I understand correctly, your interpretation seems to be more of a worker affinity concept. At least with the current limitations, cache affinity has to be implemented as worker (isolate) affinity in dart.

With your interpretation of worker affinity, I think we can call the session_id as worker_id. We can even keep the current proto definition as is without change, and just have a top level worker_id. - We don't really need to reserve a wire id for handling VersionRequest on main isolate. We can just hand it off to a worker like CompilationRequest - it is a low frequency use case where the performance cost of handling it off likely does not matter.

In other words, no changes to current proto definition at all, just a change to wire encoding as:

[ varint worker_id ][ varint proto_length ][ proto ]

Specially, the worker_id controls the affinity of worker/isolate/thread. Requests sent with same worker_id will always be processed by the same worker on the compiler side.

If we go with this explicit approach, we need to think about lifecycle of workers:

  • They can get created on-demand, but how do we terminate.
    • Do we keep them forever until process is terminated?
    • Do we kill a worker if idling for too long?
    • Do we need an explicit API for managing workers?

My take is that we go with the simplest option that we keep created workers until process is terminated. - Host implementation can make the judgement of when to cleanup.


However,

The real big issue of worker affinity idea is execution order. In this implementation we would allow enqueuing of multiple CompilationRequests to a single worker. Imagining we have a synchronous compilation A which contains a custom JS function. The JS function starts a synchronous compilation B asking to run on the same worker. We can deadlock a worker with FIFO order. LIFO, on the other hand, risks back into stack overflow, which was part of the reasons we land here.

This complexity of worker affinity is why I was thinking to use session_id to represent each single end-to-end "conversation" between host and compiler. In that design, no concurrent duplicated session_id are allowed. There is no queuing, everything runs concurrently. - It is much simpler than worker affinity and does not have the queuing and execution order issue, but obviously, supporting shared cache would be difficult in this model.

@nex3
Copy link
Contributor

nex3 commented Apr 25, 2023

I didn't mean to imply that the wire ID would be tightly semantically tied to workers. Rather, I meant that it would be tied to a single compilation—essentially the same as your idea of session_id, but not generalized to all incoming requests.

You're right that it's unlikely to be a performance issue in practice to have VersionRequest allocate its own isolate, but it's hard to predict what additional compiler requests we'll want in the future. For example, I could imagine giving the host more fine-grained control over the parallelism in the compiler via incoming requests, or the ability to monitor build metrics, in which case we'd probably want a way to easily disambiguate those lightweight requests to the central control from the heavyweight compile requests that need parallelism. Carving out a wire ID for "things that aren't compile requests" leaves that door open, while still allowing us to expand the wire ID system beyond compile requests in the future if we do find additional incoming requests that need parallelism.

To summarize, here's my proposal:

  • We add a wire ID as described.
  • We remove Compile{Request,Response}.id and *.compilation_id and just use this wire ID instead.
  • We reserve the wire ID 0 for VersionRequests and potential future compilation-independent incoming requests.
  • We forbid re-using the wire ID for an outstanding CompileRequest. This avoids the execution order issues you've described.

Semantically, the wire ID isn't any different than the compilation ID is today—it doesn't intrinsically map to workers or threads. It's just defined in such a way that a host can efficiently parallelize it as an implementation detail.

@ntkme
Copy link
Contributor Author

ntkme commented Apr 25, 2023

@nex3 Your proposal sounds good to me.

@ntkme
Copy link
Contributor Author

ntkme commented Apr 25, 2023

Pinging maintainers of other known host implementations for comments:

@ahabhgk (https://github.com/ahabhgk/sass-embedded-host-rust)
@bep (https://github.com/bep/godartsass)
@johnfairh (https://github.com/johnfairh/swift-sass)
@larsgrefer (https://github.com/larsgrefer/dart-sass-java)

@johnfairh
Copy link
Contributor

Sounds fine to me.

Keeping in mind 'Version' as a placeholder for 'Version and other slow-path commands that don't exist yet': we would keep VersionRequest/Response.id as today, as some [potentially] non-zero ID that is treated as part of the same 'must not overlap' namespace as the CompilerRequest 'wire ID's?

@bep
Copy link

bep commented Apr 26, 2023

I don't see any real problems on my end (godartsass) -- other than some work, but if helps with performance, I'm always thumbs up.

@ntkme
Copy link
Contributor Author

ntkme commented Apr 26, 2023

as some [potentially] non-zero ID that is treated as part of the same 'must not overlap' namespace as the CompilerRequest 'wire ID's?

We don’t necessarily need an id on VersionRequest itself if we reserve wire id 0, as we don’t really expect to receive concurrent/repeated requests for checking version, and even that happens the response will always be the same.

For other potential future requests, they will be as different protobuf type if any. Since we only have a single wire id 0 reserved for all these, the process of these requests would be always first checking the received message type for wire id 0 and then proceeding to process accordingly depends on the message type.

@nex3
Copy link
Contributor

nex3 commented Apr 27, 2023

I think just for consistency it makes sense to retain VersionRequest.id even though there's no real risk of getting concurrent version requests confused with one another.

@larsgrefer
Copy link
Contributor

The proto decoding and encoding never seemed to be a performance bottleneck for me, but If you have problems there or see the potential for performance improvements, we can definitely look into this.

Have you already done some tests to see how much improvement is possible here?

Im also not sure if it's a good idea that the wire-transfer now becomes more than just protobuf.

The java protobuf library has dedicated support for writing and reading the currently used var-int delimited messages, which allows me to implement the actual stream communication without any custom code:

https://github.com/larsgrefer/dart-sass-java/blob/cf28c3e3b8b5c43d6c8b6b23c32988135c64de06/sass-embedded-host/src/main/java/de/larsgrefer/sass/embedded/connection/StreamConnection.java#L24-L41

So the current protocol is just "plain old" protobuf on a stream with no additional IO logic needed.
I only need to work with type-safe and compile-safe protobuf message objects and the protobuf library handles the IO for me. With this proposal, we'd get a new layer of complexity, as the wire protocol would be more than just protobuf.

@ntkme
Copy link
Contributor Author

ntkme commented Apr 27, 2023

The performance issue is mainly on dart-sdk/dart-sass-embedded. Here are some related issues and discussions:

sass/dart-sass#1959
dart-lang/sdk#44804
dart-lang/sdk#52121

@nex3
Copy link
Contributor

nex3 commented Apr 27, 2023

Have you already done some tests to see how much improvement is possible here?

It's difficult to get meaningful test results without a real app that uses custom importers/functions and has a high degree of concurrency. It's certainly the case that double-parsing the protobufs to dispatch between workers is wasted effort.

That said, we could also potentially mitigate this by having the dispatcher iterate through the raw bytes to find the message's ID without fully parsing it. This would be more complex to implement on the compiler's side—it effectively requires us to write a protobuf "lexer"—but a single-pass traversal should still be fairly efficient, especially if (as I suspect) serializers will tend to put the ID at the beginning of the message. And, crucially, it wouldn't require a change from the current wire format. @ntkme what do you think of that possibility?

@ntkme
Copy link
Contributor Author

ntkme commented Apr 27, 2023

The complexity is not only for the compiler side. For example, the ruby host already has proper multithreaded implementation, but not truly multithreaded unless running with JRuby or Truffleruby which do not have global interpreter lock. In the future I would like to move to a Ractor based model where it provides true concurrency without GIL limitations.

In the Ractor model, it faces the problem that protobuf Ruby object are not shareable - therefore there is the exact same double-parsing problem. In other words, it not only that the compiler need to implement a “lexer” to peak into the message for finding out id, the host can face the same issue depends on language features and limitations.

@larsgrefer
Copy link
Contributor

Would it be possible or helpful to implement a container object like this?:

message Container {
  uint32 id = 1;
  
  //Serialized InboundMessage or OutboundMessage
  bytes message = 2;
}

I'd hope that such a message should be fast and easy to encode and decode on the main dispatching thread.
The actual message bytes could then be passed to the worker threads.

@ntkme
Copy link
Contributor Author

ntkme commented Apr 27, 2023

I think a container object would be more or less on the same level of code complexity for implementation comparing to a wire ID.

However, in actual protobuf implementation in most language, this would require doing an extra memory copy of the bytes when decoding the container (with wire id we can pass the wire bytes as is). Memory copy of large payload can potentially slow down performance a little bit.

For example:

  • compileString() with large input
  • Custom importer with large input
  • Large css output
  • Large source map output

More specifically it would have more impact on embedded-host-node’s legacy API and sassc-embedded’s SassC shim where the base importer is fully emulated and all imports need to go through stdio.

@nex3
Copy link
Contributor

nex3 commented Apr 28, 2023

The critical issue with parsing raw protobufs is that the order fields are serialized in is not guaranteed. So the ID could come at the beginning or it could come at the end, and the only way to find it reliably is to understand the wire format well enough to traverse it and locate the appropriate tag. That's certainly not impossible—the protobuf wire format is relatively simple—but it's something any endpoint that wants parallelism would have to implement itself, in addition to some logic for de-duplicating IDs of in-flight requests across workers.

I think ultimately it's simpler to make the transport protocol a little more complex in exchange for making the multiplexing implementations much simpler. I'd rather make it easier to build more performant hosts than less performant ones, given that we're stuck with a trade-off between the two.

We could potentially have two separate protocols to allow some hosts to go on using existing multi-message infrastracture, but honestly I'm not sure how useful that would be. Even if hosts don't have host-side parallelism, they're likely to want to take advantage of compiler-side parallelism.

@larsgrefer
Copy link
Contributor

Having to implement an additional "raw" protobuf parser just to find the id in order to avoid the "real" protobuf parsing does not make sense to me, too.

If the additional varint on the wire helps the embedded compiler and/or other host implementations I'm fine with it.

I was just a bit hesitant because I don't have any performance or concurrency problems with protobuf encoding/decoding in my host implementation. So from my point of view, the proposed change seems unnecessary.

@nex3 nex3 added enhancement New feature or request embedded Related to the embedded Sass protocol labels May 17, 2023
@nex3 nex3 transferred this issue from sass/embedded-protocol May 17, 2023
@nex3
Copy link
Contributor

nex3 commented May 19, 2023

The official proposal is now up at #3582.

@nex3
Copy link
Contributor

nex3 commented May 19, 2023

The proposal has landed. I'm just going to give it a week for public comment because it's already been discussed in some depth here, and getting this all squared away is blocking Dart Sass releases.

nex3 added a commit that referenced this issue May 30, 2023
@nex3
Copy link
Contributor

nex3 commented May 31, 2023

As I'm working on implementing this, I became a little concerned again about #2745. As the protocol stands currently, if we want to share state across compilations the compiler will have to find a way to communicate that state across multiple isolates. For example, if we want to handle a widely-used module that's shared across many entrypoints, we'll either need to serialize each function/mixin/variable call/response or serialize the entire module each time it's loaded. This could have real benefits in terms of parallelism across many compilations with shared state, but it's not obvious that those benefits would outweigh the cross-isolate overhead relative to sequential compilations that are able to directly access the shared modules.

All that said, it's not intrinsically a breaking change to add back CompileRequest.id later on—hosts that don't set it explicitly will just look like they're setting 0 for all compilations, which will be valid since they'll have to use a different wire ID for each request anyway. This leaves open the possibility of bringing back the current compilation_id structure if we decide that that's the best way forward for #2745.

@ntkme ntkme closed this as completed May 31, 2023
@nex3 nex3 reopened this May 31, 2023
@nex3
Copy link
Contributor

nex3 commented May 31, 2023

I'd like to keep this open until the implementation has landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
embedded Related to the embedded Sass protocol enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants