Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTCRtpContributingSource.audioLevel not guaranteed to be in sync with audio playout #1085

Closed
taylor-b opened this issue Mar 16, 2017 · 6 comments
Assignees

Comments

@taylor-b
Copy link
Contributor

My assumption is that this feature exists so that applications can show audio level UI indications for different participants of a call.

However, I don't see how this can be done in a robust manner, since the RTCRtpContributingSource objects are updated whenever a packet is received and not when audio is played out, with how the spec currently reads:

Each time an RTP packet is received, the RTCRtpContributingSource objects are updated.

Consider these situations:

  1. There is a noticeable delay between packets being received and audio playing out, due to poor network conditions, resulting in an audioLevel that's updated well in advance of audio playout; e.g., you see the volume indicator move before the speaker opens their mouth.
  2. Traffic is bursty, resulting in the audio level jumping around when there's a burst of traffic, then remaining stagnant for a while.
  3. Packets arrive out of order... and timestamp actually decreases?

How can these problems be mitigated? Could we change "Each time an RTP packet is received" to "each time a frame of media is delivered to the MediaStreamTrack" (or whatever the right terminology there is)?

Otherwise, what can an application do? Use getStats to figure out the playout delay, and then delay updating the audio level UI for that amount of time?

@taylor-b
Copy link
Contributor Author

Another option: add a method on RTCRtpReceiver to get the remote timestamp of the last frame that was played out.

An application could call getContributingSources and see a source with timestamp X, call getCurrentPlayoutTimestamp and get Y, and then wait for Y - X before updating the audio level UI.

The advantages of this approach are that it's simpler from an implementation perspective, and it allows the application to get information sooner, in case that's ever desired.

@jesup
Copy link

jesup commented Mar 17, 2017

Moving the point from packet reception to packet-coming-out-of-jitter-buffer (i.e. when it's played) is straightforward, and roughly what was intended.

The only reason timestamps would make sense is that if the application is polling the stats, and it could set a timeout to update UI (and maybe switch elements around) in sync with the timestamp it got for the level. This might avoid something like: Poll - audio level change is still in jitter buffer for 1 more ms, audio level changes 1 ms later, and the next poll isn't for 100ms - so any update would lag by 99ms.

Downside of timestamps is that you'll always be setting timers for updating the UI. In practice an app might apply UI changes either immediately (ahead of the change), or on the next poll/update. Probably applying changes immediately would be better, since in reality any indication you get of a change is more like "level changed sometime since you last polled".

That brings us to (barring crazy jitter depths or applications that poll every 10ms) a place where the current text actually isn't bad in practice - getting the notification 'early' compensates for the lag due to polling. Not perfect, but a partial/rough compensation. The timestamp idea would make it a more correct approximation (modulo polling frequency), but in practice would mean every poll would be followed by a timeout to update the UI - multiplied by the number of sources you're displaying

@taylor-b
Copy link
Contributor Author

Moving the point from packet reception to packet-coming-out-of-jitter-buffer (i.e. when it's played) is straightforward, and roughly what was intended.

So it sounds like you're in favor of this approach? Do you have any suggestion about the correct spec terminology? Since there's no concept of a jitter buffer, would it be accurate to say "when the RTCRtpReceiver's remote source produces a frame of media", or maybe "delivers a frame of media to the MediaStreamTrack"?

getting the notification 'early' compensates for the lag due to polling

I don't feel good about this, though; there's no guarantee that the polling lag and jitter buffer delay will always cancel each other out perfectly.

@taylor-b
Copy link
Contributor Author

Another issue that was brought to my attention recently: this part of the description of audioLevel means that implementations are required to decode a packet and compute the audio level as soon as a packet is received:

If an RFC 6464 extension header is not present, the browser will compute the value as if it had come from RFC 6464 and use that.

Doing this would be bad for performance. Chrome currently only decodes a packet and computes the audio level (for getStats) when more data is needed for playout.

taylor-b added a commit to taylor-b/webrtc-pc that referenced this issue Mar 29, 2017
Fixes w3c#1085.

May not be the correct terminology, but the intention is that the
contributing source objects are updated at playout time, such that if
an application is using them to drive an audio level UI, that UI will be
in sync with the audio played out by the browser.
@taylor-b
Copy link
Contributor Author

Tried making a PR. I think the main question is whether we can come up with a definition of a point in time for "playout" whose interpretation isn't too ambiguous.

@fippo
Copy link
Contributor

fippo commented Mar 29, 2017

What is the general usage model for showing the audio level -- polling the value of audioLevel inside a requestAnimationFrame to update the UI?

The alternative here would be to have the contributing source emit an event when its value changes. Which might be 50 times a second...

taylor-b added a commit to taylor-b/webrtc-pc that referenced this issue Apr 6, 2017
Fixes w3c#1085.

May not be the correct terminology, but the intention is that the
contributing source objects are updated at playout time, such that if
an application is using them to drive an audio level UI, that UI will be
in sync with the audio played out by the browser.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants