Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare callback-based and ReadableStream-based exposure of video MediaStreamTrack media flow #69

Closed
youennf opened this issue May 7, 2021 · 23 comments

Comments

@youennf
Copy link
Contributor

youennf commented May 7, 2021

MediaStreamTrack is encapsulating a flow of video frames and some applications might want to be notified when a new video frame happens:

  • Existing applications would like to paint into a canvas for every track frame.
  • Future applications might want to manipulate raw video frames (MediaStreamTrackProcessor use cases).

It would be interesting to evaluate the pros and cons of a callback-based approach vs. a ReadableStream-based approach.

@youennf
Copy link
Contributor Author

youennf commented May 7, 2021

Some early thoughts:

  • ReadableStream back-pressure mechanism does not seem to fit well with video sources: cameras tend to be push sources, canvas streams also, RTCPeerConnection streams will anyway need to decode all frames.
  • Buffering video frames is potentially not great: memory can grow easily, some sources have a limited buffer pool. It seems best to leave that ability to the web application.
  • If applications cannot keep up with the video frames, they might almost always want to process the latest frame.
  • ReadableStream teeing for video frames is not really useful but is a potential footgun.

Let's now focus on a callback-based algorithm along those lines:

  • If a new frame is generated, store it in a 'frame' slot and enqueue a task to fire an event
  • In that task, if the 'frame' slot is not null, fire the event and nullify the 'frame' slot. If the 'frame' slot is null, abort these steps.
    With this kind of algorithm, if web app cannot keep up with the frame rate, the event mechanism will adapt automatically.
    Contrary to WebSocket where messages might overflow the application, if too many video frames are generated for the application to keep up, frames will be dropped and the event queue will stay slim: a single event will signal the latest frame (and potentially all the dropped frames if that would be useful).

@youennf
Copy link
Contributor Author

youennf commented May 7, 2021

Another thing that makes more sense with callback-based approach vs. ReadableStream approach is the possibility to expose a video frame object that only stays valid for the duration of the callback.
It seems it would allow some decent functionality (encode the frame, render it, convert it to RGB...) while limiting the risk of keeping big objects around.

The application could also explicitly retain the object (hence a potential memory copy?) for a longer period of time, or in a BYO buffer, that is later processed through WASM.

This requires deeper analysis on whether this is a good idea.

@youennf
Copy link
Contributor Author

youennf commented May 7, 2021

One advantage of ReadableStream is the possibility to use pipeTo for instance pipeTo on native objects.
This is for instance useful in WebRTC encoded transforms where the script transform can easily use the sframe transform.

That said, MediaStreamTrack already has pipeTo operations, for RTCPeerConnection, MediaRecorder or HTMLMediaElement for instance. It seems that additional native objects could use directly a MediaStreamTrack instead of a ReadableStream.

@guidou
Copy link
Contributor

guidou commented May 10, 2021

Some early thoughts:

  • ReadableStream back-pressure mechanism does not seem to fit well with video sources: cameras tend to be push sources, canvas streams also, RTCPeerConnection streams will anyway need to decode all frames.

I think back-pressure fits well with video sources. It could be used, for example, to make a screen capturer to stop producing frames if they are not being read. For camera capturers, back-pressure could be used to control a power-saving mode, for example.

  • Buffering video frames is potentially not great: memory can grow easily, some sources have a limited buffer pool. It seems best to leave that ability to the web application.
  • If applications cannot keep up with the video frames, they might almost always want to process the latest frame.
  • ReadableStream teeing for video frames is not really useful but is a potential footgun.

Let's now focus on a callback-based algorithm along those lines:

  • If a new frame is generated, store it in a 'frame' slot and enqueue a task to fire an event
  • In that task, if the 'frame' slot is not null, fire the event and nullify the 'frame' slot. If the 'frame' slot is null, abort these steps.
    With this kind of algorithm, if web app cannot keep up with the frame rate, the event mechanism will adapt automatically.
    Contrary to WebSocket where messages might overflow the application, if too many video frames are generated for the application to keep up, frames will be dropped and the event queue will stay slim: a single event will signal the latest frame (and potentially all the dropped frames if that would be useful).

This description looks similar to what would happen in a (stream-based) MediaStreamTrackProcessor with a maxBufferSize of 1, except that in MSTP the enqueuing of tasks on every new frame is unnecessary and does not occur.

@youennf
Copy link
Contributor Author

youennf commented May 11, 2021

I think back-pressure fits well with video sources. It could be used, for example, to make a screen capturer to stop producing frames if they are not being read. For camera capturers, back-pressure could be used to control a power-saving mode, for example.

In that case, the application might end up getting old content. If an application reads 1 frame per second, it might receive a 1 second old frame if the capturer stops capturing until asked to. Applications will probably often prefer the freshest content.

Or we go into a pull-based model (capture when web page asks to capture). In that case, a dedicated API would be more suitable.

As of power-saving mode, applications can already tune frame rate to adapt the source throughput.
The UA could also decide to decrease frame rate on its own as well if it sees a lot of frames are dropped.

This description looks similar to what would happen in a (stream-based) MediaStreamTrackProcessor with a maxBufferSize of 1, except that in MSTP the enqueuing of tasks on every new frame is unnecessary and does not occur.

Right, the web application would not be able to control the exact maxBufferSize. I guess web page could provide a hint in case it knows that processing for most frames will be quick except for a few frames.

As of the enqueuing of task, this is spec language, implementations can do whatever they actually want.
It seems that in general, you end up with either a new microtask or a new task for each exposed frame, which hopefully should match each generated frame.

@guidou
Copy link
Contributor

guidou commented May 11, 2021

I think back-pressure fits well with video sources. It could be used, for example, to make a screen capturer to stop producing frames if they are not being read. For camera capturers, back-pressure could be used to control a power-saving mode, for example.

In that case, the application might end up getting old content. If an application reads 1 frame per second, it might receive a 1 second old frame if the capturer stops capturing until asked to. Applications will probably often prefer the freshest content.

Or we go into a pull-based model (capture when web page asks to capture). In that case, a dedicated API would be more suitable.

I do not quite get your point here, but the application would always be getting the latest available frame.
Also, readable streams are a pull-based model.

As of power-saving mode, applications can already tune frame rate to adapt the source throughput.
The UA could also decide to decrease frame rate on its own as well if it sees a lot of frames are dropped.

Of course. I'm just replying to the argument that back-pressure does not fit the MediaStreamTrack model.
The fact that you can implement a similar thing in a different way doesn't contradict this.

This description looks similar to what would happen in a (stream-based) MediaStreamTrackProcessor with a maxBufferSize of 1, except that in MSTP the enqueuing of tasks on every new frame is unnecessary and does not occur.

Right, the web application would not be able to control the exact maxBufferSize. I guess web page could provide a hint in case it knows that processing for most frames will be quick except for a few frames.

Are you talking about your proposal (fixed size of 1?) or about the proposed MediaStreamTrackProcessor (maxBufferSize configurable via the constructor)?
Either way, I think this particular aspect is orthogonal to callback-based vs stream-based. You can have largely the same type of buffering in both cases, whether it's a fixed size, a configurable immutable size, or even one that the application can change dynamically via a setter.
The only difference I see is that in your event proposal the minimum maxBufferSize is effectively 2, since you can always have one in the slot and one queued in the event loop, while with streams the minimum maximum buffer size can effectively be 1. Although small, having the ability to have a maxBufferSize of 1 is an advantage of the stream-based approach over the event-based one.
Of course, you can have an effective maxBufferSize of 1 using a different approach. Then again, you can reimplement the whole functionality of streams using a different API.

@youennf
Copy link
Contributor Author

youennf commented May 11, 2021

I do not quite get your point here, but the application would always be getting the latest available frame.

What I am saying is that, similarly to MSTP, we take a model where we drop video frames if the application is not fast enough to process them. This is really a push model where sources generate a frame every delta, not a pull model where sources generate a frame 'on demand'.
Most sources are push AFAIK: audio, camera, video decoder, screen capture.

As I see it, ReadableStream can more easily decimate frame rate of a 30 fps MediaStreamTrack by calling reader.read() once per second. As I said, other APIs exist for that (with power efficiency benefits). And this decimation can be shimed with an event handler (unregistering/reregistering the event handler with a timer).

I am not very clear about which ReadableStream benefits Chrome is currently using in its prototype (except for worker transfer). Can you enumerate them?

Are you talking about your proposal (fixed size of 1?) or about the proposed MediaStreamTrackProcessor (maxBufferSize configurable via the constructor)?

I was not clear. I was thinking that UA could, if they want to, have an internal maxBufferSize. But it would be an implementation detail, not something that surfaces to web pages.

@guidou
Copy link
Contributor

guidou commented May 11, 2021

I do not quite get your point here, but the application would always be getting the latest available frame.

What I am saying is that, similarly to MSTP, we take a model where we drop video frames if the application is not fast enough to process them. This is really a push model where sources generate a frame every delta, not a pull model where sources generate a frame 'on demand'.
Most sources are push AFAIK: audio, camera, video decoder, screen capture.

I don't see any fundamental advantage or disadvantage between pull-based and push-based approaches if both need to do similar buffering and dropping. Also, we're talking about a sink. Note that some sinks like WebAudio are pull-based too.

As I see it, ReadableStream can more easily decimate frame rate of a 30 fps MediaStreamTrack by calling reader.read() once per second. As I said, other APIs exist for that (with power efficiency benefits). And this decimation can be shimed with an event handler (unregistering/reregistering the event handler with a timer).

You can shim both approaches on top of the other. You can always create a stream on JS and use the event handler to make the frames available there, and you can always fire an event containing a frame whenever you read a frame from a stream.

I am not very clear about which ReadableStream benefits Chrome is currently using in its prototype (except for worker transfer). Can you enumerate them?

Here are some benefits I see in MSTP with respect to your preliminary event-based proposal. Of course, these benefits are not necessarily true for any other possible alternative:

  • Transferability is a major benefit of ReadableStreams, since it gives us an alternative that is better than canvas in many cases without having to wait for transferable tracks. We agree that we should pursue transferable tracks, but I expect it will not be a trivial project that will be available in the short term. Also, having a stream-based API to expose frames is totally compatible with transferable tracks, so it does not conflict in any way with that effort.
  • Another benefit of transferable streams over your event-based sketch is that it allows you to keep the track on the main thread and the data flow on a separate thread, which exactly matches how tracks work today. This can be important for some existing complex codebases wishing to add processing capability with minimal disruption to existing complex main-thread code dealing with tracks.
  • Another benefit over your proposed event-based algorithm is that you don't need to buffer two frames, but only one. In fact you can buffer nothing and take advantage of back-pressure to drop all frames until there is a pull signal. This can be more important than it looks, because on some platforms the number of allowed camera frames may be quite low. For example, Chromium on some platforms allows only 3 camera frames. If you buffer 2 and are still processing one you are already temporarily stopping the capturer. Again, this is an advantage over the specific algorithm you sketched, not over all possible alternative approaches, including pull-based or push-based approaches.
  • Another benefit is that streams is an established API on the Web with well-understood underlying mechanisms (including transferability) and programming models. Part of this benefit is reflected as ergonomic integration with transform streams and a stream-based track source (e.g., MSTGenerator). The streaming programming model is in general a good match for when you have a stream (in the generic sense) of data to be processed, which is one of the motivating use cases.
  • Some use cases similar to the decimation example you proposed might be easier to write with streams than with an event-based solution. I see this specific case as a marginal benefit since I expect that most API styles can be shimmed on top of each other.

Can you enumerate the actual benefits of an event-based approach (or any other concrete proposal) over using streams?

Are you talking about your proposal (fixed size of 1?) or about the proposed MediaStreamTrackProcessor (maxBufferSize configurable via the constructor)?

I was not clear. I was thinking that UA could, if they want to, have an internal maxBufferSize. But it would be an implementation detail, not something that surfaces to web pages.

I agree. We could amend the MSTP spec to say that if a maxBufferSize is not specified via the constructor, the UA can decide the internal maxBufferSize and even adjust it at runtime.

@youennf
Copy link
Contributor Author

youennf commented May 12, 2021

I don't see any fundamental advantage or disadvantage between pull-based and push-based approaches if both need to do similar buffering and dropping. Also, we're talking about a sink. Note that some sinks like WebAudio are pull-based too.

I do not think we are talking about sinks. Sinks are renderers (or RTCRtpSender). MediaStreamTrack encapsulates a source and may be consumed by sinks.

Here are some benefits I see in MSTP

Thanks for listing these benefits.

  • Transferability is a major benefit of ReadableStreams

Transferability of ReadableStream is suboptimal compared to transferability of MediaStreamTrack since it transfers only part of the data. An application might want to react to muted/unmuted/ended. An application might want to reduce frame rate because it cannot keep up.
By transferring a ReadableStream, applications will have to split their logics in two parts.

  • Another benefit of transferable streams over your event-based sketch is that it allows you to keep the track on the main thread and the data flow on a separate thread

Can you describe the usecase in more details?
Since you agree this is a complex usecase, Iit is fair to have slightly more complex code there, something like:

  • clone the track
  • transfer the cloned track and do processing on it in a worker
  • when wanting to change constraints in the main thread, postMessage to the worker to also apply the constraints
  • Another benefit over your proposed event-based algorithm is that you don't need to buffer two frames

I am not sure to follow why UA would need to buffer two frames in either proposal. Can you elaborate?

  • Another benefit is that streams is an established API on the Web with well-understood underlying mechanisms

I disagree.
MSTP API is bigger, more complex and more difficult to learn than a simple event or promise based method.

For instance, developer will need to: create a MSTP, get a readable attribute, call getReader on it to get a reader, then call read() on it to get a chunk and then get a value attribute on the chunk.

In terms of well-understood underlying mechanisms, MSTP is the first API that would introduce lossy sources.
In particular maxBufferSize is difficult to understand when you come from WhatWG streams because in streams, sources can have a queue and a queue size but MSTP cannot use it since once a chunk is in the ReadableStream queue, the chunk will be delivered if read() is called.

Can you enumerate the actual benefits of an event-based approach (or any other concrete proposal) over using streams?

MediaStreamTrack is already a kind of a 'readable stream'.
Introducing a ReadableStream that derives from a MediaStreamTrack will end up in redundant and mismatched APIs.
For instance MediaStreamTrack.onended event vs. closed promise, or tee vs. clone.

Additionally, API surface is much smaller and easier to learn, with no loss of functionality.
And ReadableStream.tee is a footgun.

If you look at some potential benefits of ReadableStream, they do not apply here:

  • pipeTo or the ability for native APIs to consume ReadableStream: this is not used here since native APIs are consuming MediaStreamTrack which allows more optimization than consuming a ReadableStream of video frames.
  • backpressure: given the lossy nature of MSTP, backpressure is not done automatically from following the streams spec algorithms, like would be the case with say a ReadableStream on WebSocket connection which would, at some point, block the sender from sending messages using TCO control flow.

@guidou
Copy link
Contributor

guidou commented May 12, 2021

I don't see any fundamental advantage or disadvantage between pull-based and push-based approaches if both need to do similar buffering and dropping. Also, we're talking about a sink. Note that some sinks like WebAudio are pull-based too.

I do not think we are talking about sinks. Sinks are renderers (or RTCRtpSender). MediaStreamTrack encapsulates a source and may be consumed by sinks.

Yes, I am talking about sinks. MSTP is not a track. MSTP is a sink that when connected to a track and makes the underlying stream of data sent by the track available as a ReadableStream. The purpose of MSTP is allowing the application to take advantage of this mechanism to create a custom sink. For example, a ReadableStream together with WebCodecs and WebTransport is a sink that is very similar to a peer connection. This is an example currently being experimented on.

Here are some benefits I see in MSTP

Thanks for listing these benefits.

  • Transferability is a major benefit of ReadableStreams

Transferability of ReadableStream is suboptimal compared to transferability of MediaStreamTrack since it transfers only part of the data. An application might want to react to muted/unmuted/ended. An application might want to reduce frame rate because it cannot keep up.

First off, it's not suboptimal at all. And second, it is orthogonal to transferring the track. If an application wants to move the whole logic to the worker, it can move the track. If an application wants to move only the data flow and keep the control logic on the main thread it can move just the ReadableStream.
Note that all MediaStreamTrack applications today work by having the control logic on the main thread and the data flow off the main thread, since that is how MediaStreamTracks work today. This is exactly the model transferable streams provides and it most likely makes it easier to migrate applications that want to add processing off the main thread without changing their existing logic, which includes listening to muted/unmuted/ended.

By transferring a ReadableStream, applications will have to split their logics in two parts.

And that is exactly what I would expect most applications will want.
If you have an existing complex codebase doing a lot of logic with tracks on the main thread (which is the only possible way today) and they just want to add funny hats, they can transfer the streams to a worker, put only the funny-hat logic there and keep their existing logic for handling the track on the main thread (e.g., connecting the track to a media element or peer connection).

  • Another benefit of transferable streams over your event-based sketch is that it allows you to keep the track on the main thread and the data flow on a separate thread

Can you describe the usecase in more details?

See above. Any existing application that just wants to do processing of the underlying frames on a worker but does not want to move the whole track-related logic to the worker. A track on a worker is a lot less useful than a track on a main thread because many of the existing platform sinks (e.g., peer connections and media elements) are not available on workers, so I would expect that many applications will prefer to keep tracks on the main thread.

Since you agree this is a complex usecase, Iit is fair to have slightly more complex code there, something like:

  • clone the track
  • transfer the cloned track and do processing on it in a worker
  • when wanting to change constraints in the main thread, postMessage to the worker to also apply the constraints

It's not clear at all to me what you mean here because you haven't explained this together with a more complete proposal.
Let's say you clone a track and transfer the clone to the worker to add funny hats. Where do the events fire? Where do the frames with the funny hats end up? In the cloned track? In a new track (using an MSTGenerator-like object but without streams)? In all tracks connected to the source of the clone, somehow substituting the original source frames? In something that is not a track at all?
If you follow any of the first two approaches (which are the simpler ones IMO), you won't be able to connect the track to an RTCPeerConnection or media element because you can't transfer those objects to the worker. None of these issues occur when only the streams are transferred as this maintains the original model of having the track on the main window and the data flow off the main thread.

  • Another benefit over your proposed event-based algorithm is that you don't need to buffer two frames

I am not sure to follow why UA would need to buffer two frames in either proposal. Can you elaborate?

It would keep one on the slot and another one on the event handler, but you're right that this should be interpreted as only one buffered frame since the other one is already available for consumption. Somehow I misinterpreted 'firing the event' in your algorithm as 'starting a task to fire the event'. Still, with MSTP you can have zero buffering if you make frames available to the stream when the pull signal is provided and drop frames when it's not.

  • Another benefit is that streams is an established API on the Web with well-understood underlying mechanisms

I disagree.
MSTP API is bigger, more complex and more difficult to learn than a simple event or promise based method.

For instance, developer will need to: create a MSTP, get a readable attribute, call getReader on it to get a reader, then call read() on it to get a chunk and then get a value attribute on the chunk.

I disagree that this is particularly complex or difficult, even less so if we consider that this is an already established pattern.

In terms of well-understood underlying mechanisms, MSTP is the first API that would introduce lossy sources.

For the code doing the read, sources can always be lossy since the controller in a stream can also drop chunks. Also, stream sources can be written in JS and they can be lossy as well.
I doubt that the concept of frames being potentially dropped will be surprising to developers.

In particular maxBufferSize is difficult to understand when you come from WhatWG streams because in streams, sources can have a queue and a queue size but MSTP cannot use it since once a chunk is in the ReadableStream queue, the chunk will be delivered if read() is called.

I don't find it particularly difficult to understand, and developers don't even have to use it explicitly as there will be a default behavior if maxBufferSize is not provided.

Can you enumerate the actual benefits of an event-based approach (or any other concrete proposal) over using streams?

MediaStreamTrack is already a kind of a 'readable stream'.

I don't think that's the case. You can't read data flowing through a track as you can do in any stream system. AFAICT MediaStreamTrack has never had the objective of providing stream-like functionality.
I see a track as an object that allows connecting an underlying stream of media originating from a source to some sinks. A track allows controlling how the (invisible) stream is connected between sources and sinks, but it is not itself a stream.

Introducing a ReadableStream that derives from a MediaStreamTrack will end up in redundant and mismatched APIs.
For instance MediaStreamTrack.onended event vs. closed promise, or tee vs. clone.

What do you mean by redundant and mismatched APIs? MSTP (or its readable stream) is not a track and does not intend to be a substitute for a track. It is a sink.
Saying that the ReadableStream methods are redundant is equivalent to saying that the methods or events of other sinks are redundant. Why would closing a stream be redundant but not closing a peer connection?. Is the ended event of the media element redundant with the ended event of the track?
Tracks and sinks are different abstractions with different purposes.

Additionally, API surface is much smaller and easier to learn, with no loss of functionality.

Like I said before, I don't know exactly what a concrete proposal based on transferred tracks and events looks like. Thus, it's not clear to me in which object these events containing the frames are generated, or how processed frames are put back on an existing or new track (or something else entirely). This makes it completely impossible to know if it's actually easier to use or if there is loss of functionality.
If the approach is just to replace the readable stream in MSTP with an event handler, and the writable stream in MSTG with a method to write frames (with MSTG and MSTP visible only in workers), then I would say that there is a huge amount of functionality that is lost. But I'm very interested in seeing a more complete version of your proposal.

And ReadableStream.tee is a footgun.

One that you don't have to use to support any of the proposed use cases. If you want multiple seadable treams for the same track you can use multiple MSTPs.

If you look at some potential benefits of ReadableStream, they do not apply here:

  • pipeTo or the ability for native APIs to consume ReadableStream: this is not used here since native APIs are consuming MediaStreamTrack which allows more optimization than consuming a ReadableStream of video frames.

ReadableStream is not a replacement for track and is not intended to be used as one. It is a sink, so I don't really understand this concern. Yes, you can't play a ReadableStream on a media element, just like you can't play a peer connection (or an event handler or some arbitrary callback) in it.
You can still use tracks, including an MSTG (which is a track), with any API that accepts tracks.

  • backpressure: given the lossy nature of MSTP, backpressure is not done automatically from following the streams spec algorithms, like would be the case with say a ReadableStream on WebSocket connection which would, at some point, block the sender from sending messages using TCO control flow.

It is correct that the use of the circular buffer in MSTP means that not all of the standard backpressure mechanism is usable, but parts of it (e.g., the pull signal) are. For example, an implementation could pause a capturer when the buffer is full and use the pull signal to restart it. Different strategies with the pull signal can be used for different capturers with different characteristics.

Note that you didn't really answer my question about the actual benefits of an event-based approach. You mainly mentioned some shortcomings you see in the stream-based approach. I'm particularly interested in knowing how the event-based approach allows doing everything in a worker while allowing connecting a track carrying the processed frames to a platform sink running on the main thread and how that is actually better than the stream-based approach.

@youennf
Copy link
Contributor Author

youennf commented May 17, 2021

That is good discussion, and very important to have before we dive into API specifics.
It seems like something we should discuss either at next interim or in editor call.

Note that you didn't really answer my question about the actual benefits of an event-based approach.

I am not really talking about an event-based approach, I am merely mentioning this as an alternative amongst several others.

About potential alternative benefits, I thought I already did. I anticipate: more lightweight, simpler API, less footgun, more inline with the MediaStreamTrack model as defined by the mediacapture-main spec, which probably makes it easier to implement in all browsers and more extendable in the future should there be a need.

I'm particularly interested in knowing how the event-based approach allows doing everything in a worker while allowing connecting a track carrying the processed frames to a platform sink running on the main thread and how that is actually better than the stream-based approach.

I only talked about MSTP, not MSTG here, which seems to be what you are referring to.
The same question applies in the usefulness of defining a MediaStreamTrack source as a WritableStream, compared to alternatives, say UnderlyingSink for instance.

@guidou
Copy link
Contributor

guidou commented May 18, 2021

I only talked about MSTP, not MSTG here, which seems to be what you are referring to.
The same question applies in the usefulness of defining a MediaStreamTrack source as a WritableStream, compared to alternatives, say UnderlyingSink for instance.

I'm talking about what is needed to support the intended use cases, which includes MSTP and MSTG. I would assume that your concerns about streams in MSTP also apply to streams in MSTG and my reply to those concerns would be similar.
At this point, I think we have discussed enough about the pros and cons of the stream-based approach and what's left is to consider a concrete alternative proposal that can support the intended use cases so that we can make a comparison.

@youennf
Copy link
Contributor Author

youennf commented May 19, 2021

Let's do a tentative conclusion first. I think we agree on the following points:

  • ReadableStream backpressure mechanism is not used in MSTP / ReadableStream enqueuing mechanism is not used in MSTP
  • There is no loss in functionality by transferring MediaStreamTrack instead of ReadableStream. There are know advantages in fact to do the former compared to the latter.
  • Some potential footguns have been identified due to the use of ReadableStream.
  • ReadableStream is a known API but just getting a frame from a MediaStreamTrack through MSTP requires a lot of API calls. To properly compare ease of use, alternative APIs should be considered.

I can take an action to prepare alternate proposals so that we can further discuss pros and cons.

@guidou
Copy link
Contributor

guidou commented May 19, 2021

Let's do a tentative conclusion first. I think we agree on the following points:

No, we don't agree on these points.

  • ReadableStream backpressure mechanism is not used in MSTP / ReadableStream enqueuing mechanism is not used in MSTP

Since we use a circular queue instead of the stream's noncircular queue, we can't use desiredSize for backpressure. However, shimming desiredSize using the circular queue and a count-queueing strategy is trivial. So backpressure is totally usable in practice.

  • There is no loss in functionality by transferring MediaStreamTrack instead of ReadableStream. There are know advantages in fact to do the former compared to the latter.

We do not know that these claims are true until we have a concrete alternative that uses transferable tracks.
Just making tracks transferable is not enough to support the use cases so any advantages or disadvantages are still unknown.

  • Some potential footguns have been identified due to the use of ReadableStream.
    IIRC, you mentioned that using tee(), which is neither required nor intuitive to support any intended use cases, has disadvantages. Any other "footgun", since you talked in plural?
  • ReadableStream is a known API but just getting a frame from a MediaStreamTrack through MSTP requires a lot of API calls. To properly compare ease of use, alternative APIs should be considered.

Can you clarify your definition of "a lot"?
AFAICT, it's a single call to getReader() and then one read() call per frame.
If you use piping, no explicit calls are needed apart from the initial call to pipeTo()/pipeThrough().
This doesn't look like a lot to me.

I can take an action to prepare alternate proposals so that we can further discuss pros and cons.

I do agree with this one.

@youennf
Copy link
Contributor Author

youennf commented May 19, 2021

  • ReadableStream backpressure mechanism is not used in MSTP / ReadableStream enqueuing mechanism is not used in MSTP

Since we use a circular queue instead of the stream's noncircular queue, we can't use desiredSize for backpressure. However, shimming desiredSize using the circular queue and a count-queueing strategy is trivial. So backpressure is totally usable in practice.

The backpressure you are referring to is implemented by MediaStreamTrackProcessor, not ReadableStream, since the circular queue is handled by MediaStreamTrackProcessor.
This can be seen by not exposing a ReadableStream but exposing a single read method directly in MediaStreamTrackProcessor.

We do not know that these claims are true until we have a concrete alternative that uses transferable tracks.
Just making tracks transferable is not enough to support the use cases so any advantages or disadvantages are still unknown.

One proposal which I thought you were proposing is the following: transfer MediaStreamTrack to worker and use MediaStreamTrackProcessr in the worker to get the fames.
The main advantage I see with transferring track is that the code doing the processing can interact with the track to do things like decrease frame rate or decrease size, or react to muted/unmuted events...

Can you clarify your definition of "a lot"?

I was referring to https://github.com/w3c/mediacapture-extensions/issues/23#issuecomment-839659431, in particular:

create a MSTP, get a readable attribute, call getReader on it to get a reader, then call read() on it to get a chunk and then get a value attribute on the chunk.

I forgot the chunk.done check before getting chunk.value.
Compare it to an event-based API, which would be reduced to a one-liner: track.onvideoframe = (event) => { /* do processing on event.frame */ }

@guidou
Copy link
Contributor

guidou commented May 19, 2021

  • ReadableStream backpressure mechanism is not used in MSTP / ReadableStream enqueuing mechanism is not used in MSTP

Since we use a circular queue instead of the stream's noncircular queue, we can't use desiredSize for backpressure. However, shimming desiredSize using the circular queue and a count-queueing strategy is trivial. So backpressure is totally usable in practice.

The backpressure you are referring to is implemented by MediaStreamTrackProcessor, not ReadableStream, since the circular queue is handled by MediaStreamTrackProcessor.

There is no practical difference, since the desiredSize value is used by the underlying source of the stream.

This can be seen by not exposing a ReadableStream but exposing a single read method directly in MediaStreamTrackProcessor.

We do not know that these claims are true until we have a concrete alternative that uses transferable tracks.
Just making tracks transferable is not enough to support the use cases so any advantages or disadvantages are still unknown.

One proposal which I thought you were proposing is the following: transfer MediaStreamTrack to worker and use MediaStreamTrackProcessr in the worker to get the fames.

I'm in favor of making the track transferable so I do support that approach. However, my proposal was to keep MSTP with the stream interface. Since this discussion is about replacing the streams interface with something else we can assume that any benefit of making the track transferable applies to both approaches.

The main advantage I see with transferring track is that the code doing the processing can interact with the track to do things like decrease frame rate or decrease size, or react to muted/unmuted events...

That's an advantage if you want to have that logic in the worker and a disadvantage if you want to have that logic on the main thread. The streams approach supports both since you can choose to transfer the track or just the stream.

Can you clarify your definition of "a lot"?

I was referring to #23 (comment), in particular:

create a MSTP, get a readable attribute, call getReader on it to get a reader, then call read() on it to get a chunk and then get a value attribute on the chunk.

I forgot the chunk.done check before getting chunk.value.

My disagreement here is that I don't see that as "a lot". It's pretty straightforward code similar to that of any stream-based system.

Compare it to an event-based API, which would be reduced to a one-liner: track.onvideoframe = (event) => { /* do processing on event.frame */ }
Using the pipeThrough approach (which I think will be more common) is similar.

const transform = new TransformStream({
transform(frame, controller) {
// do processing on frame
},
});

Note that you get the frame directly instead of an event from which you have to get the frame, so it's arguably fewer calls than the event-based approach.

@youennf
Copy link
Contributor Author

youennf commented May 19, 2021

There is no practical difference, since the desiredSize value is used by the underlying source of the stream.

The desiredSize for the MSTP readable stream is probably 0 so that the readable stream does not buffer any video frame.
That way the circular buffer is in full control and the readable stream is not doing much in practice.

  • There is no loss in functionality by transferring MediaStreamTrack instead of ReadableStream.

Oh, I see, my phrasing was ambiguous. Let's put it this way:
There is no loss in functionality by transferring MediaStreamTrack instead of transferring ReadableStream.

Do you agree with the new phrasing?

My disagreement here is that I don't see that as "a lot". It's pretty straightforward code similar to that of any stream-based system.

I think the code would be more something like:
const processor = new MediaStreamTrackProcessor(track, size);
const transform = new TransformStream({
transform(frame, controller) { /* do processing */ }
});
processor.readable.pipeThrough(transform);

This is 5 lines of code compared to 1 line.
This is 2 new objects that the developer needs to construct explicitly plus 3 implicit objects compared to zero.
This is 2 full APIs to understand vs a single event handler.

There needs to be a very good justification for such API complexity.

@guidou
Copy link
Contributor

guidou commented May 20, 2021

There is no practical difference, since the desiredSize value is used by the underlying source of the stream.

The desiredSize for the MSTP readable stream is probably 0 so that the readable stream does not buffer any video frame.
That way the circular buffer is in full control and the readable stream is not doing much in practice.

If you start buffering, desiredSize becomes negative, which means the stream underlying source might want to do something to slow down the frame rate to move desiredSize back to zero. Depending on the MediaStream source, the underlying source might take different actions. For example, pause a screen capturer, adjust the frame rate of the camera, etc.

Also, it could be that there are some cases where having a positive desiredSize is acceptable. For example, if the MediaStream source is not a camera and the sink is not sensitive to latency (e.g, MediaRecorder).

All these are valid and possible uses of backpressure, which is executed by the stream underlying source. Whether you consider the underlying source as part of the stream or the processsor or nothing (you say the stream is not doing much in practice) is not really important. That is how the backpressure model is intended to work in the streams model.

  • There is no loss in functionality by transferring MediaStreamTrack instead of ReadableStream.

Oh, I see, my phrasing was ambiguous. Let's put it this way:
There is no loss in functionality by transferring MediaStreamTrack instead of transferring ReadableStream.

Do you agree with the new phrasing?

I don't. You lose the ability to directly control the track from the main thread, which some applications might want to keep (tracks on main, and media off main has always been the model). You also lose the ability to send the track to a platform sink, which might be useful for some applications (e.g., show before and after effects). Yes, there are workarounds to these, but they're not for free, so some functionality is lost.

My disagreement here is that I don't see that as "a lot". It's pretty straightforward code similar to that of any stream-based system.

I think the code would be more something like:
const processor = new MediaStreamTrackProcessor(track, size);
const transform = new TransformStream({
transform(frame, controller) { /* do processing */ }
});
processor.readable.pipeThrough(transform);

This is 5 lines of code compared to 1 line.
This is 2 new objects that the developer needs to construct explicitly plus 3 implicit objects compared to zero.
This is 2 full APIs to understand vs a single event handler.

There needs to be a very good justification for such API complexity.

Your phrasing here is totally misleading. You're claiming "such API complexity" while in practice no developer would have trouble understanding the stream example. Moreover, you're making a comparison using an empty example that does nothing useful. How does your non-stream based code look like if you have to produce output and make it available on a track on the main thread?
I doubt it will be as simple as just adding pipeTo to the last line of your stream example.
So, before you start making claims about API complexity, come up with a complete proposal.

@youennf
Copy link
Contributor Author

youennf commented May 20, 2021

All these are valid and possible uses of backpressure, which is executed by the stream underlying source.

It seems we are not able to converge on this part of the discussion.
To make progress, we could define the different backpressure occurences (for instance, application cannot keep up with frame processing vs. application not wanting to process frames) and how much each one is desired (P1, P2..) in the context of video processing.
We might also want input from others.

For example, if the MediaStream source is not a camera and the sink is not sensitive to latency (e.g, MediaRecorder).

The issue is not really about latency but about buffering.
If frames are from a buffer pool and we want no memory copies, we want to limit buffering as much as possible. Maybe we will fail there and we will need to acknowledge this in the API (memory copy under the hood if needed). There is a desire to try achieving this or be as close as possible.

I am also fuzzy about the sources that would be able apply backpressure.
Canvas capture is in the hand of the JS application. RTCPeerConnection decoder needs to continue decoding. Applications prefer non outdated frames for live sources like camera or screen capture.

in practice no developer would have trouble understanding the stream example.

I have first-hand experience of developers having difficulties understanding the TransformStream part of your example.
In particular, the impact of returning a Promise as part of the transform value. Also the impact of queuing strategies.

I would also say that streams in general make it very easy to do buffering. This is very handy in general but potentially troublesome with video frames.

How does your non-stream based code look like if you have to produce output and make it available on a track on the main thread?

That is a fair point, I will try to come up with alternative APIs so that we can best compare full-transform use cases.
Not sure I will have time for next interim but hopefully sometime next week.

That said, this does not invalidate the comparison for usecases that do not require creating a transformed MediaStreamTrack, say doing object recognition, or applying a filter before encoding through WebCodec.

@guidou
Copy link
Contributor

guidou commented May 20, 2021

All these are valid and possible uses of backpressure, which is executed by the stream underlying source.

It seems we are not able to converge on this part of the discussion.
To make progress, we could define the different backpressure occurences (for instance, application cannot keep up with frame processing vs. application not wanting to process frames) and how much each one is desired (P1, P2..) in the context of video processing.
We might also want input from others.

Agree. I was just objecting to the claim that backpressure does not or cannot happen.

For example, if the MediaStream source is not a camera and the sink is not sensitive to latency (e.g, MediaRecorder).

The issue is not really about latency but about buffering.
If frames are from a buffer pool and we want no memory copies, we want to limit buffering as much as possible. Maybe we will fail there and we will need to acknowledge this in the API (memory copy under the hood if needed). There is a desire to try achieving this or be as close as possible.

The example I gave said input not from a camera, since that's the more common case for a buffer pool, but I should have been more explicit. How to handle buffering is orthogonal to stream/not streams. We can discuss it further, but probably in a different github issue and whatever conclusion we arrive to will most likely apply to any mechanism we use to expose frames.

I am also fuzzy about the sources that would be able apply backpressure.
Canvas capture is in the hand of the JS application. RTCPeerConnection decoder needs to continue decoding. Applications prefer non outdated frames for live sources like camera or screen capture.>

Since the UA knows the sources involved, I think we should be able to let the UA choose strategies (while also allowing the user some amount of control if desired). Like I said before, this is something worth discussing in a separate issue.

in practice no developer would have trouble understanding the stream example.

I have first-hand experience of developers having difficulties understanding the TransformStream part of your example.
In particular, the impact of returning a Promise as part of the transform value. Also the impact of queuing strategies.

You can always find developers that don't understand a particular concept, especially if they're unfamiliar with it. That doesn't mean it's difficult to learn or complex in general. It is not necessary to understand all the details about queuing strategies and other concepts to use streams effectively in practice, just like it wouldn't be necessary to understand any internal techniques or optimizations that might be applied to an event-based solution.

I would also say that streams in general make it very easy to do buffering. This is very handy in genneral but potentially troublesome with video frames.

In this case, the streams are provided by the UA, so the UA can make sure no unnecessary buffering occurs.

How does your non-stream based code look like if you have to produce output and make it available on a track on the main thread?

That is a fair point, I will try to come up with alternative APIs so that we can best compare full-transform use cases.
Not sure I will have time for next interim but hopefully sometime next week.

That said, this does not invalidate the comparison for usecases that do not require creating a transformed MediaStreamTrack, say doing object recognition, or applying a filter before encoding through WebCodec.

Looking forward to the full proposal.

@youennf
Copy link
Contributor Author

youennf commented Aug 5, 2021

Some issues identified while drilling into how to make streams appropriate for a realtime video pipeline:
whatwg/streams#1157
whatwg/streams#1156
whatwg/streams#1155
#56

@dontcallmedom
Copy link
Member

I believe the ongoing work on WHATWG Streams is expected to resolve the challenges identified in this issue - can it be closed?

@dontcallmedom dontcallmedom transferred this issue from w3c/mediacapture-extensions Jan 4, 2022
@alvestrand
Copy link
Contributor

Closing issue with the assumption that solutions in the WHATWG Streams space will solve these problems with no further modifications to the mediacapture-transform spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants