Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is frame-level fanout in scope of WebRTC encoded transform? #211

Open
youennf opened this issue Oct 24, 2023 · 17 comments
Open

Is frame-level fanout in scope of WebRTC encoded transform? #211

youennf opened this issue Oct 24, 2023 · 17 comments

Comments

@youennf
Copy link
Collaborator

youennf commented Oct 24, 2023

Several proposals to extend WebRTC encoded transforms have been proposed to support frame-level fanout, aka one-ended sender streams (#207, #201, #202 for instance).

It does not seem there is consensus on whether WebRTC encoded transform is the right tool for this.
The alternative is a separate API (outside of the packet-level fanout vs. frame-level fanout discussion).
We should try to reach consensus on this question before diving in more concrete proposals.

@youennf
Copy link
Collaborator Author

youennf commented Oct 24, 2023

My personal thoughts:

  • On sender side, the encoded transform model is to transform encoded content just before packetization. It is constrained in a way so that the encoded transform works at the pace of the encoder.
  • With frame fanout, there is no UA encoder, the web application is responsible to provide both the source data and encoder directly as encoded content. In that context, the pace of the encoder is no longer in control of the UA. This is an important change and it is appropriate for the UA to know this through declarative web APIs instead of heuristics.
  • It also requires the UA to know what to do in case the sender is going over budget. Should it drop frames? Should it do buffering? Should it instruct the web applications to somehow reduce the throughput? This probably warrants an error/adaptation mechanism that WebRTC encoded transform does not require.

Given this, I would think a separate API is a better choice. It would also allows to use both APIs jointly.
I do not think the API surface would be very big and piggy backing on WebRTC encoded transform probably does not save us much here. Here is a quick illustration of what it could be:

[Exposed=DedicatedWorker]
interface RTCRtpSenderEncodedSource {
    constructor()
    readonly attribute RTCRtpSenderEncodedSourceHandle handle;
    // Need congestion API, error API and enqueue API
    undefined enqueue((EncodedAudioChunk or EncodedVideoChunk) chunk, DOMString mimeType)
};

typedef (EncodedAudioChunk or EncodedVideoChunk) RTCEncodedMediaChunkData;
dictionary RTCEncodedChunk {
    RTCEncodedChunkData data;
    DOMString mimeType;
}

[Exposed=(DedicatedWorker, Window),
Transferable]
interface RTCRtpSenderEncodedSourceHandle {
}

partial interface RTCRtpSender {
    undefined replaceTrack(RTCRtpSenderEncodedSourceHandle handle);
    readonly attribute RTCRtpSenderEncodedSourceHandle encodedSourceHandle;
}

@alvestrand
Copy link
Contributor

Funny, I was just typing this into a document for internal discussion (wanted to get some feedback before proposing this in the WG):

Establishing a WebCodec-using sender and/or receiver

When a sender (or receiver, but they’re similar enough that just the sender side is explained in detail here) is first established (through AddTransceiver, through ontrack, or through other means), the sender is in an unconfigured state; it leaves this state either by having its “transform” member set, or by starting to process RTP packets.

There is a set of classes that implement the RTCRtpScriptSink interface (introduced in PR 207). For this version, we assume the same initialization steps as for RTCRtpTransformer.

This mirrors the RTCRtpScriptTransformer interface, but has different semantics:

When applied, the sender will not initialize its encoder component. It will expect all frames to come in via the RTCRtpScriptSink object’s “writable” stream.
When applied, the sender will assume no responsibility for configuring upstream bandwidth or send signals upstream; it will make this information available through the RTCRtpScriptSink object interface only.

A similar process, using the RTCRtpScriptSource interface, applies to the RTPReceiver.

@youennf
Copy link
Collaborator Author

youennf commented Oct 25, 2023

Does it mean that a dedicated API might be a better fit than extending WebRTC encoded transform on your side?

@alvestrand
Copy link
Contributor

I don't think of those as separate APIs, rather that we'd want to have a set of building blocks that are able to be composed into the functionality we need.

I think that the current RTCEncodedTransform consists of 3-4 of those building blocks (frame source, frame sink and a hidden feedback channel), and we need to tease them further apart.

@alvestrand
Copy link
Contributor

The document I wrote up to explain the idea that parallels @youennf 's comment (which doesn't yet contain any example code or WebIDL) can be viewed here:

https://github.com/w3c/webrtc-encoded-transform/blob/webcodecs-integration-explainer/webcodecs-explainer.md

@guidou
Copy link
Collaborator

guidou commented Oct 30, 2023

I think this proposal from @youennf is a great place to start and iterate.
My main comment is that it should support RTCEncodedVideoFrames/RTCEncodedAudioFrames instead of or in addition to the WebCodec versions. We no longer need to "move" them across PCs (so no changes to the current Encoded Transform spec are needed), but it should be possible to clone a frame from an input PC (already possible), update its metadata (need a new setMetadata method) and enqueue it in an RTCRtpSenderEncodedSource.
This, together with something along the lines of @alvestrand's proposal for congestion control should be enough to satisfy all the requirements and address all the concerns presented so far, assuming we agree that a frame-level API is a good idea (which I think it is).

@guidou
Copy link
Collaborator

guidou commented Nov 13, 2023

@youennf WDYT about supporting RTCEncodedAudio/VideoFrames with RTCRtpSenderEncodedSource?
The frames wouldn't be moved from one PC to another one, but they might be copied. For example, the frame might be cloned from the stream of one of the input PCs, and enqueued on the RTCRtpSenderEncodedSource of the output PCs after invoking a setMetadata() method to adjust frame IDs to the output peer connection.

@youennf
Copy link
Collaborator Author

youennf commented Nov 14, 2023

This, together with something along the lines of @alvestrand's proposal for congestion control should be enough to satisfy all the requirements and address all the concerns presented so far, assuming we agree that a frame-level API

That is a fair summary.

it should support RTCEncodedVideoFrames/RTCEncodedAudioFrames instead of or in addition to the WebCodec versions

It might be ok as a convenience/shortcut for web developers.
The internal algorithm should ideally process WebCodecs objects so that the API surface is based on WebCodecs.

enqueued on the RTCRtpSenderEncodedSource of the output PCs after invoking a setMetadata() method to adjust frame IDs to the output peer connection

Why would we need to call setMetadata()?
This API basically replaces the encoder which does not have the notion of a frameID. Shouldn't it be the backend that computes it?

@guidou
Copy link
Collaborator

guidou commented Nov 29, 2023

enqueued on the RTCRtpSenderEncodedSource of the output PCs after invoking a setMetadata() method to adjust frame IDs to the output peer connection

Why would we need to call setMetadata()? This API basically replaces the encoder which does not have the notion of a frameID. Shouldn't it be the backend that computes it?

A use case we want to support[1] is to get frames from two or more incoming PCs that are sending the same encoded data, and forward the frames to one or more output PCs as if they were coming from a single, more reliable peer connection. This forwarding occurs at multiple points in multiple network paths between the original servers sending the data and the final recipients and it involves dropping duplicate frames. Failure of individual input peer connections should be tolerated without the next node in the path noticing it.

One of the requirements for this forwarding mechanism is to preserve the existing metadata of the source frames. In particular:

  • CSRCs. It is necessary to preserve the CSRCs of the incoming streams since they are used by the destination node for various application features.
  • Dependency descriptors and frame IDs, so that forwarding nodes can make better decisions involving scalable streams (e.g., drop temporal layers to save bandwidth). They may also be needed to do decoding correctly.
  • RTP timestamps, so that the destination node can play back frames at the right time.

The only way to satisfy these requirements is for the RTCRtpSenderEncodedSource to support taking RTCRtpEncodedVideoFrames so that this metadata can be forwarded to the next node in the path. WebCodecs encoded chunks do not have a mechanism to set any of this metadata (except for the timestamp).

The reason we need a setMetadata method is that, since the incoming frames come from multiple input PCs, it may be necessary to adjust some of the metadata of the output frame so that it properly reflects the decisions made by the forwarder. For example, frames with the same payload may have different frame IDs if they come from different servers. Thus, it must be possible to rewrite this ID for the output frame so that the next hop sees a consistent frame ID scheme.

[1] First proposed in the July meeting

@youennf
Copy link
Collaborator Author

youennf commented Dec 6, 2023

I think it makes sense for the RTC encoder to be able to provide more data than what would provide a WebCodecs encoder.
I am not sure of the existing API shape, reusing RTCRtpEncodedVideoFrame might be too much (setting PT for instance could be tedious to web developers for instance).
In general, keeping as much as possible immutability seems to be a good thing (using a constructor to mutate like WebCodecs might be better).

@alvestrand
Copy link
Contributor

If using a webcodecs encoder to encode a frame that is to be sent on an RTCPeerConnection, there has to be a step that supplies the metadata that the WebCodecs encoder does not supply. Doing that as part of the conversion step between WebCodecs encodedFrame and RTCRtpEncodedFrame seems logical. It's a small amount of data.

WRT the PT, I think we're getting to a consensus that frames should be marked with MIMEtype, and that the processing step that needs a PT should look up the PT from the MIMEtype - since this lookup is transceiver-dependent, it seems logical to do the lookup when the frame is enqueued on a packetizer, not before this.

This means, of course, that MIMEtype needs to be writable.

@guidou
Copy link
Collaborator

guidou commented Dec 6, 2023

I am not sure of the existing API shape, reusing RTCRtpEncodedVideoFrame might be too much (setting PT for instance could be tedious to web developers for instance).

It is not always necessary to set PT. For the forwarding use case, you start with a received frame that already has full metadata, including PT. You just need to update some of the metadata for the output frame.
For the forwarding use case, updating frame ID, dependencies, and rtpTimestamp should probably be enough.
I'm fine with producing the output frame with a constructor that takes the original frame and a metadata object with the fields to update. We can iterate on the right surface, restrictions, errors, etc.

@youennf
Copy link
Collaborator Author

youennf commented Dec 7, 2023

One possibility would be to first define how to plug an encoder in the current pipeline.
We then extend the API for encoder to be considered as a source, via a simple boolean at construction time for instance.

This two stage process would allow us to:

  1. Focus first on how to define encoder as a transform, its processing model, feedback API and so on.
  2. Focus second on the interactions of encoder as a source (and not as a transform) with regards to track, setParameters and so on.

Here is a rough sketch of what it could be:

[Exposed=DedicatedWorker]
interface RTCRtpSenderEncoder {
    constructor(boolean isSource)
    readonly attribute RTCRtpSenderEncoderHandle handle;
    readonly attribute ReadableStream? readable; // null if isSource is true.
    readonly attribute WritableStream writable;
    // Need congestion API, error API and so on.
};

[Exposed=(DedicatedWorker, Window),
Transferable]
interface RTCRtpSenderEncoderHandle {
    readonly boolean attribute isSource;
}

partial interface RTCRtpSender {
    readonly attribute RTCRtpSenderEncoderHandle? encoder;
}

Another approach would be to reuse the script transform/transformer approach and not rely on transferable.
For instance, it does not make sense to allow transfer a RTCRtpSenderEncoderHandle out of agent cluster.

@jan-ivar
Copy link
Member

jan-ivar commented Dec 7, 2023

Bikeshed comment: new Foo and new Bar are generally better than new Foo(isBar) if it avoids nullable.

@alvestrand
Copy link
Contributor

alvestrand commented Dec 7, 2023

Note that we don't have to define an encoder transform for the frame-level fanout use case. The source isn't an encoder for that use case.

But the topic of this issue seems to have drifted a bit.

@guidou
Copy link
Collaborator

guidou commented Dec 7, 2023

One possibility would be to first define how to plug an encoder in the current pipeline. We then extend the API for encoder to be considered as a source, via a simple boolean at construction time for instance.

That seems to address a separate issue from the frame-level fanout / forwarding case.
In this case we assume that we already have the encoded frames as RTCEncodedVideo|AudioFrames.

@dontcallmedom-bot
Copy link

This issue had an associated resolution in WebRTC December 12 2023 meeting – 12 December 2023 (RtpSender Encoded Source):

RESOLUTION: seems like a promising direction for which to see a more complete proposal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants