Synchronize audio/video with data in WebRTC #133

tidoust · 2018-06-26T12:10:41Z

Some potential use cases for WebRTC presented during last WebRTC WG F2F would require a mechanism to synchronize audio/video with data.

One possible approach would be to use Real-Time Text to timestamp data with the audio/video, but the question of the synchronization of that stream with the audio/video streams remains to some extent (depending on the use case and precision required).

Some questions to handle:

What is the browser's role in the synchronization? Should it sync things up on its own? Is it enough to expose some info and knobs such as the relationship to the Performance.now clock and the average latency of the processing pipeline, as done in the Web Audio API?
If processing of the data needs to be done by the app, what does it mean to synchronize the streams? I.e. would triggering an event at the right time be enough when high-precision is required? Or should the app rather have a way to monitor the information one way or the other through a worklet, or a processing mechanism close to the rendering pipeline such as requestAnimationFrame?

It is interesting to see the relationship between this need and similar needs expressed over the years by media companies. For instance, the Timing on the Web effort (#46) would help solve the first point, and was not triggered by WebRTC use cases. Also the second point was recently discussed at length in the Media & Entertainment IG (see w3c/media-and-entertainment#4).

The text was updated successfully, but these errors were encountered:

dontcallmedom · 2018-08-14T13:23:19Z

Early discussions on related topics at the June 2018 WebRTC F2F meeting

dontcallmedom · 2022-01-25T08:38:36Z

The topic was mentioned during the recent media production workshop.

The overall direction is tracked as a WebRTC NV use case, but could use more direct involvement from potential users to get more momentum.

darkvertex · 2022-02-01T22:26:23Z

The overall direction is tracked as a WebRTC NV use case, but could use more direct involvement from potential users to get more momentum.

Hi! 👋 I've been lurking on here. Thought I'd share you a first hand VR-related use case where it could have been very useful:

We needed to deliver N concurrent synced video feeds from a multi-lens VR camera rig from a location with poor computational capacity (too low for live-stitching panoramas onsite.) For reasons I cannot disclose we needed to livestream 360 video with a VR camera that wasn't able to livestream a stitched 360 video natively. The workaround we decided was to receive the individual feeds elsewhere with a more powerful computer and produce the stitched 360 monoscopic panoramic video to stream to whatever. (You can conceptualize each lens feed as an RTP/WebRTC video track.)

Our camera had 8 physical lenses horizontally but with just 4 we already had sufficient panoramic coverage so to save some bandwidth we only send 4.. but which 4? Depends what's visible and near what lens; maybe we want 4 even or 4 odd. We designed the sending software to let you pick a subset of the cameras. We can switch which are active on a whim, mid-stream.

One approach to dynamic feed switching in WebRTC could be to prenegotiate all the tracks possible that you could need and only send video on those that you consider active, but it's a little tricky to distinguish between a video feed being suddenly inactive because it was intentionally disabled due to a reconfiguration at the sender VS a video feed being suddenly inactive because there was some network congestion / data loss down the pipe. Renegotiating WebRTC video tracks between configuration switches is possible but we felt it interrupted the flow considerably as it added some overhead, so we didn't go with it.

We needed the camera configuration and identity metadata to be timestamped with the video frames so it could correlate in perfect sync during reconfigurations. (Feed 1 may be showing camera 0, but maybe five seconds later it's showing camera 1, for example.) Identity matters for a realtime panoramic stitch because they are different perspectives in space and the algorithm must be kept informed or it'll look wonky. Unsynced changes are not useful as it glitches the result of the 360 processing and a WebRTC data channel (to my knowledge) could not do this with today's WebRTC generation.

We absolutely needed the camera identity to be in sync with the video frames. Since WebRTC data channels fell short, we simplified further down and settled on a pure RTP approach. We opted to hijack the outgoing H264 bitstream and inject NAL units of type SEI (Session Enhancement Information) Subtype 5, aka "unregistered user data" SEI , in with the frame data in the data of the RTP video track. You can slip small amounts of userdata (text or json or whatever) in the video feed this way and not corrupt anything. All video players safely dismiss it.

On a custom software receiver (not a regular browser) you can recover the H264 packets from the RTP track, recover the original NAL units and read the metadata, and your video processing can react accordingly since the metadata changes in sync with the video frames.

If data channels could be in sync with video without codec gymnastics, or if another convenient mechanism existed for a generic timestamped metadata stream, I think we may have stuck with WebRTC for our use case. (I personally would have liked that as it could have made it easier to debug things from a web app in-browser instead of some custom standalone software.)

Ultimately data being in sync with video is important to any kind of realtime actor with a need for a status HUD:

Imagine a FPV drone web app where you can control it and there's a HUD overlay showing the gyroscope data in sync with the video,
or one of those creepy walking robot dogs and there's charts graphing the servo rotations and you can see exactly when one of them jams because it's in sync with the video showing you the same.

Sending device health and state information in perfect sync with the video feed is crucial for a trustworthy assessment of what's happening on screen of a remote entity. Being able to do this in an official and reliable capacity would be exciting!

By the way, I took a superficial look at the WebCodecs API but it seems it's very focused on decoding video images and not so much on letting me see individual NAL units of an H264 bitstream. (To those familiar with it: Is this a fair observation or did I miss something?) It sure would be cool if it had some callbacks to access non-video/audio metadata.

Sorry for the wall of text. Hope it helps shed some light on why synced data could open the door to some very handy in-browser use cases.

dontcallmedom · 2022-02-02T07:48:06Z

@darkvertex this is amazing input, thank you so much! Could I ask you to bring it to the WebRTC NV use case repository as a new issue (e.g. titled "detailed example of value of A/V/data sync in WebRTC")?

This would make sure others in the WebRTC WG see it, and hopefully respond with insights and questions.

chrisn · 2022-02-02T11:51:58Z

It sure would be cool if it had some callbacks to access non-video/audio metadata.

This relates to w3c/webcodecs#198, and a recent proposal to add API support for SEI events presented to M&E IG (minutes), although that focuses on the HTML video element. We're currently investigating whether this aligns with the DataCue proposal. This seems like another area where cross-group discussion could be helpful.

darkvertex · 2022-02-02T18:02:09Z

@darkvertex this is amazing input, thank you so much! Could I ask you to bring it to the WebRTC NV use case repository as a new issue (e.g. titled "detailed example of value of A/V/data sync in WebRTC")?

Thank you! I didn't know that other repo existed; thanks for directing me. I made the issue:
w3c/webrtc-nv-use-cases#74

tidoust added the Media label Jun 26, 2018

tidoust added this to Exploration in Strategy Team's Incubation Pipeline (Funnel) Jun 26, 2018

tidoust mentioned this issue Jun 26, 2018

RTC vNext #26

Closed

dontcallmedom moved this from Exploration to Investigation in Strategy Team's Incubation Pipeline (Funnel) Aug 14, 2018

djee-ms mentioned this issue Sep 24, 2019

Transform/Projectionmatrix of current frame (Locatable camera) microsoft/MixedReality-WebRTC#83

Open

dontcallmedom self-assigned this Jun 15, 2021

darkvertex mentioned this issue Feb 2, 2022

Detailed example of potential value of A/V/data sync in WebRTC w3c/webrtc-nv-use-cases#74

Open

darkvertex mentioned this issue Feb 2, 2022

Emit metadata (SPS,VUI,SEI,...) during decoding w3c/webcodecs#198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronize audio/video with data in WebRTC #133

Synchronize audio/video with data in WebRTC #133

tidoust commented Jun 26, 2018

dontcallmedom commented Aug 14, 2018

dontcallmedom commented Jan 25, 2022

darkvertex commented Feb 1, 2022

dontcallmedom commented Feb 2, 2022

chrisn commented Feb 2, 2022

darkvertex commented Feb 2, 2022

Synchronize audio/video with data in WebRTC #133

Synchronize audio/video with data in WebRTC #133

Comments

tidoust commented Jun 26, 2018

dontcallmedom commented Aug 14, 2018

dontcallmedom commented Jan 25, 2022

darkvertex commented Feb 1, 2022

dontcallmedom commented Feb 2, 2022

chrisn commented Feb 2, 2022

darkvertex commented Feb 2, 2022