Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add current audio latency metric #124

Merged
merged 14 commits into from
Dec 1, 2023
Merged

Conversation

henbos
Copy link
Contributor

@henbos henbos commented Oct 28, 2023

Fixes #119.

Part of #96. See also follow up #128.


Preview | Diff

@henbos
Copy link
Contributor Author

henbos commented Oct 28, 2023

This is the follow-up for the audio frames counters, only covering the latency metrics (current and total).

Wording of algorithm is #108 and needs a separate PR (edit: and here it is: #127) to refactor both audio and video in the same PR.

@henbos
Copy link
Contributor Author

henbos commented Oct 28, 2023

Where the discussion left off from earlier is: #117 (comment)

An audio frame is the group of audio samples that happen at the same time, one per audio channel.

This is my understanding too.

Just poll track.stats at two points in time and take the delta delivered delay / delta delivered frames. [...]

How can you do this if you don't know if you have an average, rolling average, or something else? Besides, it's still nonsensical. If I have three audio buffers delivered, of 1024 frames each, and latencies are respectively 10 frames, 500 frames and 36 frames, it goes:

(10 + 500 + 36) / (1024 + 1024 + 1024) = 0.177734375

this number doesn't mean anything. (10 + 500 + 36) / 3 means something, that's a regular average, but it masks jitter by definition because it's an average, so it's a low pass.

My understanding is that if "current audio latency" has a meaningful definition, then so does the total audio latency, because it is the same metric except that you increment for each audio frame rather than only exposing the latest value.

It is up to the UA to estimate the latency of an audio frame. The app does not need to care as long as it is accurate.

@henbos
Copy link
Contributor Author

henbos commented Oct 28, 2023

If I have three audio buffers delivered, of 1024 frames each, and latencies are respectively 10 frames, 500 frames and 36 frames, it goes:

Do you mean the first set of 1024 frames have a latency of 10 ms, the second set of 1024 frames have a latency of 500 ms, and the third set of 1024 frames have a latency of 36 ms? If so I would expect:

deliveredFramesLatency = 1024 * 10 + 1024 * 500 + 1024 * 36 = 559104 ms
deliveredFrames = 1024 + 1024 + 1024 = 3072

Meaning the average latency per frame is 559104 / 3072 = 182 ms.

(And if buffering frames in order to send them in a batch adds latency, then that would be part of the latency measurement.)

@henbos
Copy link
Contributor Author

henbos commented Oct 30, 2023

I tried to clarify that latency includes buffering. @o1ka who is the Chrome WebRTC audio expert, does this PR make sense to you?

@o1ka
Copy link

o1ka commented Nov 1, 2023

The definition makes sense to me. The question re: average / rolling average is not quite clear. "current audio latency", i.e. the latency of the last delivered frame, has a specific value, it's not a statistic. The aggregated "delivered audio frames latency" allows the clients to have whatever statistics they need: overall average, the histogram of "average over polling interval", etc.

@henbos
Copy link
Contributor Author

henbos commented Nov 1, 2023

@padenot @jan-ivar I tried to clarify that this is per frame, that it includes buffering, and that it is measured in ms. I tried to adjust the calculation padenot made earlier as I understood how it should work. Does it make sense now? What are the next steps?

@henbos
Copy link
Contributor Author

henbos commented Nov 3, 2023

I forgot if I linked this or not, but the "total measure + number of measurements" (in this case total latency and number of delivered frames) is based on the same principle as is described in the bottom half of WebRTC's Guidelines for design of stats objects, and MediaStreamTrack.stats will likely be frequently used together with RTCPeerConnection.getStats() even if they are independent APIs. Use cases include looking at several second averages for sustained quality implications rather than temporary glitches or even entire call values for the purpose of A/B testing.

I am also in strong favor of also exposing most recent values (e.g. current latency) since I think that has more value in cases such as WebAudio or more real-time use cases than averages. I think the two compliment each other.

@henbos henbos changed the title Add audio latency metrics. Add current audio latency metric Nov 8, 2023
@henbos
Copy link
Contributor Author

henbos commented Nov 8, 2023

PTAL @padenot and @jan-ivar. As discussed in the meeting yesterday, I have updated this PR to:

  • Only add current latency, not total latency. A separate PR will be created to address min/max/avg, issue filed: [Audio Stats] Add average, min and max latency #128.
  • Changed the wording a bit to say current latency, rather than latency of the last audio frame specifically, as to give the user agent room to expose a representative value even if the UA has different buffer sizes in different layers of the stack, which is an implementation detail we do not need to expose in a current latency metric.

@henbos
Copy link
Contributor Author

henbos commented Nov 8, 2023

Assuming this PR can merge this week, I'll create the follow-up (min/max/avg) PR after this one has merged since I apparently don't know how to create PR chains on GitHub

index.html Outdated Show resolved Hide resolved
index.html Outdated Show resolved Hide resolved
@henbos
Copy link
Contributor Author

henbos commented Nov 13, 2023

@jan-ivar @padenot Ping, is there anything that I've not addressed? If you prefer something to be phrased differently, please let me know how, but IIUC there is nothing preventing this to merge. Please let me know if that is not correct and if so what needs to happen

@padenot
Copy link

padenot commented Nov 13, 2023

I don't understand why this contains things that have not been agreed upon, such as dropped frames.

index.html Outdated
@@ -545,14 +547,26 @@ <h4>The MediaStreamTrackAudioStats interface</h4>
such as if the track is muted or disabled, then the counters do
not increase.</p>
</div>
<li>
<p>The <dfn data-lt="current audio latency">current audio
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a rewrite. When defining something, it's best to proceed like so:

  • Define what a certain concept is. In our case, it's the time, in milliseconds, between the point in time an audio input device has acquired a signal, and the time it is able to provide it to an output device. Similarly, we could spec the second part as the point in time the audio data is available to script. It's also important to say that this must be the latest available figure. "Current" isn't very precise for something that changes 300 times a second.
  • Then say that it can be unavailable. Do we really want a nullable here, forcing authors to use an if statement (or something like that)?
  • Don't use the word "note" in normative statements, this is confusing. Or use an explicit non-normative section, in which case you can use the word may.
  • In specification, say what needs to be done, not what doesn't need to be done: saying that this shouldn't include the output latency is bad, it should naturally follow from other definitions. If there's room for confusion, add an informative note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define what a certain concept is [...] Don't use the word "note" [...] say what needs to be done, not what doesn't need to be done

Reworded. If this is not to your liking please suggest an edit that is.

I now have a definition of "input latency" which can be referenced by both the "latest input latency" (this PR) and for min/max/avg latency (future PR being discussed in #128).

In our case, it's the time, in milliseconds, between the point in time an audio input device has acquired a signal, and the time it is able to provide it to an output device.

The sink may not be an output device. But I reworded it to use your language except I say "to delivery of any of its sinks". Does that make sense to you?

Similarly, we could spec the second part as the point in time the audio data is available to script.

Not sure how to address this comment. Isn't "the script" just another sink?

Then say that it can be unavailable. Do we really want a nullable here, forcing authors to use an if statement (or something like that)?

I agree it would be nice to avoid "if not null" since it would only really be null before the first frames have been delivered. I guess it could be up to the user agent to make a sensible initial guess? I don't know, I removed nullable for API ergonomics reasons, but it's a bit under-specified what should happen if you call track.stats.latency inside the getUserMedia promise's resolve() in which case I don't know if a measurement has necessarily been made yet, depending on implementation?

@@ -578,9 +592,11 @@ <h4>The MediaStreamTrackAudioStats interface</h4>
set {{MediaStreamTrackAudioStats/[[DeliveredFramesDuration]]}}
to [= delivered audio frames duration =],
set {{MediaStreamTrackAudioStats/[[DroppedFrames]]}} to
[= dropped audio frames =] and
[= dropped audio frames =],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this is here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henbos
Copy link
Contributor Author

henbos commented Nov 13, 2023

I don't understand why this contains things that have not been agreed upon, such as dropped frames.

This PR does not add dropped frames. Dropped frames was merged in a previous PR (#117). At the time I didn't understand you had a problem with them, so I did what Jan-Ivar proposed and split up the PR in order to make progress (using the editors can integrate label). Since we haven't resolved this disagreement, I just filed #129.

As for this PR, it only adds input latency. Do you have any concerns with the definitions of latency? Edit: I saw your other comments now

Copy link
Contributor Author

@henbos henbos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take another look and suggest edits

@@ -578,9 +592,11 @@ <h4>The MediaStreamTrackAudioStats interface</h4>
set {{MediaStreamTrackAudioStats/[[DeliveredFramesDuration]]}}
to [= delivered audio frames duration =],
set {{MediaStreamTrackAudioStats/[[DroppedFrames]]}} to
[= dropped audio frames =] and
[= dropped audio frames =],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index.html Outdated
@@ -545,14 +547,26 @@ <h4>The MediaStreamTrackAudioStats interface</h4>
such as if the track is muted or disabled, then the counters do
not increase.</p>
</div>
<li>
<p>The <dfn data-lt="current audio latency">current audio
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define what a certain concept is [...] Don't use the word "note" [...] say what needs to be done, not what doesn't need to be done

Reworded. If this is not to your liking please suggest an edit that is.

I now have a definition of "input latency" which can be referenced by both the "latest input latency" (this PR) and for min/max/avg latency (future PR being discussed in #128).

In our case, it's the time, in milliseconds, between the point in time an audio input device has acquired a signal, and the time it is able to provide it to an output device.

The sink may not be an output device. But I reworded it to use your language except I say "to delivery of any of its sinks". Does that make sense to you?

Similarly, we could spec the second part as the point in time the audio data is available to script.

Not sure how to address this comment. Isn't "the script" just another sink?

Then say that it can be unavailable. Do we really want a nullable here, forcing authors to use an if statement (or something like that)?

I agree it would be nice to avoid "if not null" since it would only really be null before the first frames have been delivered. I guess it could be up to the user agent to make a sensible initial guess? I don't know, I removed nullable for API ergonomics reasons, but it's a bit under-specified what should happen if you call track.stats.latency inside the getUserMedia promise's resolve() in which case I don't know if a measurement has necessarily been made yet, depending on implementation?

@henbos
Copy link
Contributor Author

henbos commented Nov 16, 2023

Friendly ping @padenot and @jan-ivar, do you have time to take a look before the editor's meeting today?

<div class="note">
<p>A sink that consumes audio may add additional processing
latency not included in this measurement, such as playout delay
or encode time.</p>
Copy link
Contributor

@youennf youennf Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think the intent of this value is for realtime processing.
Adding a note about the intent (goal is to allow monitoring not realtime processing) would be nice.
If we want very precise latency info for say web audio apps, it might be best to expose this value in web audio measuring microphone to web audio sink.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note about monitoring, WDYT @youennf ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is useful for Web Audio applications as is. Web Audio applications that perform recordings of audio input device need to know how far in the past, audio recordings should be shifted to be in phase with other elements. In a well-behaved system, there is little expectations for this value to change, but it should be possible to see it changed if need be (a common cause will be drift compensation when crossing time domains).

Examples: https://ampedstudio.com/manual/studio-settings/ (end of page, workaround by using a loop-back mic), https://www.w3.org/2021/03/media-production-workshop/talks/ulf-hammarqvist-audio-latency.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that value is useful, why not adding it to WebAudio directly? This should be cheap enough to do and would give a more precise latency info given it would compute the actual latency between microphone and web audio graph.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Web Audio API does not deal with input, only output. It's problematic for various reasons, but it's how it was done. The object that is used to refer to the microphone is the MediaStreamTrack, and so this is why we're adding it there.

The latency in between the microphone and the web audio graph is precisely what is being added here, in the sense that the web audio graph is a sink, because it runs off of a system audio output callback.

But there can be other sinks, such as an HTMLMediaElement, or using the microphone in some other way, directly feeding it to a MediaRecorder, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Web Audio API does not deal with input,
The object that is used to refer to the microphone is the MediaStreamTrack

I do not understand this.
MediaStreamAudioSourceNode is the web audio sink of MediaStreamTrack.
Can you clarify why microphone-to-web-audio latency should not be put there instead?
At least in Safari, there is a small buffer at MediaStreamAudioSourceNode which could impact the web audio input latency.

there can be other sinks, such as an HTMLMediaElement

Exactly, with different latencies potentially. It makes more sense to put values there for those applications.
For instance, at least for Safari's implementation, droppedVideoFrames makes sense in https://w3c.github.io/media-playback-quality/#dom-htmlvideoelement-getvideoplaybackquality but does not make sense inMediaStreamTrack.

Copy link
Contributor Author

@henbos henbos Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re dropped video frames, in Chrome it is possible for frames to be dropped prior to sink delivery (measured by track.stats) or by the sink, for example the RTCPeerConnection (measured by pc.getStats). It would clearly be wrong to expose encoder frame drops inside track.stats. We could expose track frame drops inside the media-source stats object that is returned by getStats, but this would mean that every sink has to report something that is outside of its own scope. And let me remind you that the reason we're talking about track.stats rather than getStats is that when we added these and other metrics to webrtc-stats, we got pushback for adding stuff that is out of scope. (Which I think is justified, as there are non-WebRTC use cases that care about track stats.)

So likewise, if MediaStreamAudioSourceNode is adding latency that is not accounted for by track.stats, then that seems like a good place to measure it. But I still see value in each part of the pipeline measuring its own thing rather than letting components be responsible for neighbor components and making assumptions about how many steps in the pipeline is of interest to the application

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly ping @padenot and @jan-ivar. Are there any changes requested regarding the latency definition?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @henbos. If MediaStreamAudioSourceNode (or its Track sibling) adds latency, then we can add something there. More likely than not we won't add something there, because if it adds latency there is a design problem in the implementation, and we'll argue in the exact same way we're arguing about audio frame drops: it's better for the Web that we fix our implementation than to add a way to measure avoidable latency overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we agree, does that mean we can merge this PR?

Copy link
Contributor

@youennf youennf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

index.html Outdated Show resolved Hide resolved
Co-authored-by: youennf <youennf@users.noreply.github.com>
@alvestrand alvestrand merged commit 96c9d9d into w3c:main Dec 1, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Audio Stats] Add current latency
6 participants