-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare callback-based and ReadableStream-based exposure of video MediaStreamTrack media flow #69
Comments
Some early thoughts:
Let's now focus on a callback-based algorithm along those lines:
|
Another thing that makes more sense with callback-based approach vs. ReadableStream approach is the possibility to expose a video frame object that only stays valid for the duration of the callback. The application could also explicitly retain the object (hence a potential memory copy?) for a longer period of time, or in a BYO buffer, that is later processed through WASM. This requires deeper analysis on whether this is a good idea. |
One advantage of ReadableStream is the possibility to use pipeTo for instance pipeTo on native objects. That said, MediaStreamTrack already has pipeTo operations, for RTCPeerConnection, MediaRecorder or HTMLMediaElement for instance. It seems that additional native objects could use directly a MediaStreamTrack instead of a ReadableStream. |
I think back-pressure fits well with video sources. It could be used, for example, to make a screen capturer to stop producing frames if they are not being read. For camera capturers, back-pressure could be used to control a power-saving mode, for example.
This description looks similar to what would happen in a (stream-based) MediaStreamTrackProcessor with a maxBufferSize of 1, except that in MSTP the enqueuing of tasks on every new frame is unnecessary and does not occur. |
In that case, the application might end up getting old content. If an application reads 1 frame per second, it might receive a 1 second old frame if the capturer stops capturing until asked to. Applications will probably often prefer the freshest content. Or we go into a pull-based model (capture when web page asks to capture). In that case, a dedicated API would be more suitable. As of power-saving mode, applications can already tune frame rate to adapt the source throughput.
Right, the web application would not be able to control the exact maxBufferSize. I guess web page could provide a hint in case it knows that processing for most frames will be quick except for a few frames. As of the enqueuing of task, this is spec language, implementations can do whatever they actually want. |
I do not quite get your point here, but the application would always be getting the latest available frame.
Of course. I'm just replying to the argument that back-pressure does not fit the MediaStreamTrack model.
Are you talking about your proposal (fixed size of 1?) or about the proposed MediaStreamTrackProcessor (maxBufferSize configurable via the constructor)? |
What I am saying is that, similarly to MSTP, we take a model where we drop video frames if the application is not fast enough to process them. This is really a push model where sources generate a frame every delta, not a pull model where sources generate a frame 'on demand'. As I see it, ReadableStream can more easily decimate frame rate of a 30 fps MediaStreamTrack by calling reader.read() once per second. As I said, other APIs exist for that (with power efficiency benefits). And this decimation can be shimed with an event handler (unregistering/reregistering the event handler with a timer). I am not very clear about which ReadableStream benefits Chrome is currently using in its prototype (except for worker transfer). Can you enumerate them?
I was not clear. I was thinking that UA could, if they want to, have an internal maxBufferSize. But it would be an implementation detail, not something that surfaces to web pages. |
I don't see any fundamental advantage or disadvantage between pull-based and push-based approaches if both need to do similar buffering and dropping. Also, we're talking about a sink. Note that some sinks like WebAudio are pull-based too.
You can shim both approaches on top of the other. You can always create a stream on JS and use the event handler to make the frames available there, and you can always fire an event containing a frame whenever you read a frame from a stream.
Here are some benefits I see in MSTP with respect to your preliminary event-based proposal. Of course, these benefits are not necessarily true for any other possible alternative:
Can you enumerate the actual benefits of an event-based approach (or any other concrete proposal) over using streams?
I agree. We could amend the MSTP spec to say that if a maxBufferSize is not specified via the constructor, the UA can decide the internal maxBufferSize and even adjust it at runtime. |
I do not think we are talking about sinks. Sinks are renderers (or RTCRtpSender). MediaStreamTrack encapsulates a source and may be consumed by sinks.
Thanks for listing these benefits.
Transferability of ReadableStream is suboptimal compared to transferability of MediaStreamTrack since it transfers only part of the data. An application might want to react to muted/unmuted/ended. An application might want to reduce frame rate because it cannot keep up.
Can you describe the usecase in more details?
I am not sure to follow why UA would need to buffer two frames in either proposal. Can you elaborate?
I disagree. For instance, developer will need to: create a MSTP, get a readable attribute, call getReader on it to get a reader, then call read() on it to get a chunk and then get a value attribute on the chunk. In terms of well-understood underlying mechanisms, MSTP is the first API that would introduce lossy sources.
MediaStreamTrack is already a kind of a 'readable stream'. Additionally, API surface is much smaller and easier to learn, with no loss of functionality. If you look at some potential benefits of ReadableStream, they do not apply here:
|
Yes, I am talking about sinks. MSTP is not a track. MSTP is a sink that when connected to a track and makes the underlying stream of data sent by the track available as a ReadableStream. The purpose of MSTP is allowing the application to take advantage of this mechanism to create a custom sink. For example, a ReadableStream together with WebCodecs and WebTransport is a sink that is very similar to a peer connection. This is an example currently being experimented on.
First off, it's not suboptimal at all. And second, it is orthogonal to transferring the track. If an application wants to move the whole logic to the worker, it can move the track. If an application wants to move only the data flow and keep the control logic on the main thread it can move just the ReadableStream.
And that is exactly what I would expect most applications will want.
See above. Any existing application that just wants to do processing of the underlying frames on a worker but does not want to move the whole track-related logic to the worker. A track on a worker is a lot less useful than a track on a main thread because many of the existing platform sinks (e.g., peer connections and media elements) are not available on workers, so I would expect that many applications will prefer to keep tracks on the main thread.
It's not clear at all to me what you mean here because you haven't explained this together with a more complete proposal.
It would keep one on the slot and another one on the event handler, but you're right that this should be interpreted as only one buffered frame since the other one is already available for consumption. Somehow I misinterpreted 'firing the event' in your algorithm as 'starting a task to fire the event'. Still, with MSTP you can have zero buffering if you make frames available to the stream when the pull signal is provided and drop frames when it's not.
I disagree that this is particularly complex or difficult, even less so if we consider that this is an already established pattern.
For the code doing the read, sources can always be lossy since the controller in a stream can also drop chunks. Also, stream sources can be written in JS and they can be lossy as well.
I don't find it particularly difficult to understand, and developers don't even have to use it explicitly as there will be a default behavior if maxBufferSize is not provided.
I don't think that's the case. You can't read data flowing through a track as you can do in any stream system. AFAICT MediaStreamTrack has never had the objective of providing stream-like functionality.
What do you mean by redundant and mismatched APIs? MSTP (or its readable stream) is not a track and does not intend to be a substitute for a track. It is a sink.
Like I said before, I don't know exactly what a concrete proposal based on transferred tracks and events looks like. Thus, it's not clear to me in which object these events containing the frames are generated, or how processed frames are put back on an existing or new track (or something else entirely). This makes it completely impossible to know if it's actually easier to use or if there is loss of functionality.
One that you don't have to use to support any of the proposed use cases. If you want multiple seadable treams for the same track you can use multiple MSTPs.
ReadableStream is not a replacement for track and is not intended to be used as one. It is a sink, so I don't really understand this concern. Yes, you can't play a ReadableStream on a media element, just like you can't play a peer connection (or an event handler or some arbitrary callback) in it.
It is correct that the use of the circular buffer in MSTP means that not all of the standard backpressure mechanism is usable, but parts of it (e.g., the pull signal) are. For example, an implementation could pause a capturer when the buffer is full and use the pull signal to restart it. Different strategies with the pull signal can be used for different capturers with different characteristics. Note that you didn't really answer my question about the actual benefits of an event-based approach. You mainly mentioned some shortcomings you see in the stream-based approach. I'm particularly interested in knowing how the event-based approach allows doing everything in a worker while allowing connecting a track carrying the processed frames to a platform sink running on the main thread and how that is actually better than the stream-based approach. |
That is good discussion, and very important to have before we dive into API specifics.
I am not really talking about an event-based approach, I am merely mentioning this as an alternative amongst several others. About potential alternative benefits, I thought I already did. I anticipate: more lightweight, simpler API, less footgun, more inline with the MediaStreamTrack model as defined by the mediacapture-main spec, which probably makes it easier to implement in all browsers and more extendable in the future should there be a need.
I only talked about MSTP, not MSTG here, which seems to be what you are referring to. |
I'm talking about what is needed to support the intended use cases, which includes MSTP and MSTG. I would assume that your concerns about streams in MSTP also apply to streams in MSTG and my reply to those concerns would be similar. |
Let's do a tentative conclusion first. I think we agree on the following points:
I can take an action to prepare alternate proposals so that we can further discuss pros and cons. |
No, we don't agree on these points.
Since we use a circular queue instead of the stream's noncircular queue, we can't use desiredSize for backpressure. However, shimming desiredSize using the circular queue and a count-queueing strategy is trivial. So backpressure is totally usable in practice.
We do not know that these claims are true until we have a concrete alternative that uses transferable tracks.
Can you clarify your definition of "a lot"?
I do agree with this one. |
The backpressure you are referring to is implemented by MediaStreamTrackProcessor, not ReadableStream, since the circular queue is handled by MediaStreamTrackProcessor.
One proposal which I thought you were proposing is the following: transfer MediaStreamTrack to worker and use MediaStreamTrackProcessr in the worker to get the fames.
I was referring to https://github.com/w3c/mediacapture-extensions/issues/23#issuecomment-839659431, in particular:
I forgot the chunk.done check before getting chunk.value. |
There is no practical difference, since the desiredSize value is used by the underlying source of the stream.
I'm in favor of making the track transferable so I do support that approach. However, my proposal was to keep MSTP with the stream interface. Since this discussion is about replacing the streams interface with something else we can assume that any benefit of making the track transferable applies to both approaches.
That's an advantage if you want to have that logic in the worker and a disadvantage if you want to have that logic on the main thread. The streams approach supports both since you can choose to transfer the track or just the stream.
My disagreement here is that I don't see that as "a lot". It's pretty straightforward code similar to that of any stream-based system.
const transform = new TransformStream({ Note that you get the frame directly instead of an event from which you have to get the frame, so it's arguably fewer calls than the event-based approach. |
The desiredSize for the MSTP readable stream is probably 0 so that the readable stream does not buffer any video frame.
Oh, I see, my phrasing was ambiguous. Let's put it this way: Do you agree with the new phrasing?
I think the code would be more something like: This is 5 lines of code compared to 1 line. There needs to be a very good justification for such API complexity. |
If you start buffering, desiredSize becomes negative, which means the stream underlying source might want to do something to slow down the frame rate to move desiredSize back to zero. Depending on the MediaStream source, the underlying source might take different actions. For example, pause a screen capturer, adjust the frame rate of the camera, etc. Also, it could be that there are some cases where having a positive desiredSize is acceptable. For example, if the MediaStream source is not a camera and the sink is not sensitive to latency (e.g, MediaRecorder). All these are valid and possible uses of backpressure, which is executed by the stream underlying source. Whether you consider the underlying source as part of the stream or the processsor or nothing (you say the stream is not doing much in practice) is not really important. That is how the backpressure model is intended to work in the streams model.
I don't. You lose the ability to directly control the track from the main thread, which some applications might want to keep (tracks on main, and media off main has always been the model). You also lose the ability to send the track to a platform sink, which might be useful for some applications (e.g., show before and after effects). Yes, there are workarounds to these, but they're not for free, so some functionality is lost.
Your phrasing here is totally misleading. You're claiming "such API complexity" while in practice no developer would have trouble understanding the stream example. Moreover, you're making a comparison using an empty example that does nothing useful. How does your non-stream based code look like if you have to produce output and make it available on a track on the main thread? |
It seems we are not able to converge on this part of the discussion.
The issue is not really about latency but about buffering. I am also fuzzy about the sources that would be able apply backpressure.
I have first-hand experience of developers having difficulties understanding the TransformStream part of your example. I would also say that streams in general make it very easy to do buffering. This is very handy in general but potentially troublesome with video frames.
That is a fair point, I will try to come up with alternative APIs so that we can best compare full-transform use cases. That said, this does not invalidate the comparison for usecases that do not require creating a transformed MediaStreamTrack, say doing object recognition, or applying a filter before encoding through WebCodec. |
Agree. I was just objecting to the claim that backpressure does not or cannot happen.
The example I gave said input not from a camera, since that's the more common case for a buffer pool, but I should have been more explicit. How to handle buffering is orthogonal to stream/not streams. We can discuss it further, but probably in a different github issue and whatever conclusion we arrive to will most likely apply to any mechanism we use to expose frames.
Since the UA knows the sources involved, I think we should be able to let the UA choose strategies (while also allowing the user some amount of control if desired). Like I said before, this is something worth discussing in a separate issue.
You can always find developers that don't understand a particular concept, especially if they're unfamiliar with it. That doesn't mean it's difficult to learn or complex in general. It is not necessary to understand all the details about queuing strategies and other concepts to use streams effectively in practice, just like it wouldn't be necessary to understand any internal techniques or optimizations that might be applied to an event-based solution.
In this case, the streams are provided by the UA, so the UA can make sure no unnecessary buffering occurs.
Looking forward to the full proposal. |
Some issues identified while drilling into how to make streams appropriate for a realtime video pipeline: |
I believe the ongoing work on WHATWG Streams is expected to resolve the challenges identified in this issue - can it be closed? |
Closing issue with the assumption that solutions in the WHATWG Streams space will solve these problems with no further modifications to the mediacapture-transform spec. |
MediaStreamTrack is encapsulating a flow of video frames and some applications might want to be notified when a new video frame happens:
It would be interesting to evaluate the pros and cons of a callback-based approach vs. a ReadableStream-based approach.
The text was updated successfully, but these errors were encountered: