Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active speaker information in mixed (RFC 6465) streams #27

Closed
aboba opened this issue Jan 31, 2014 · 2 comments
Closed

Active speaker information in mixed (RFC 6465) streams #27

aboba opened this issue Jan 31, 2014 · 2 comments
Labels

Comments

@aboba
Copy link
Contributor

aboba commented Jan 31, 2014

From: Emil Ivov emcho@jitsi.org
To: public-orca@w3.org
Subject: active speaker information in mixed streams
Date: Tue, 28 Jan 2014 14:44:20 +0100
URL: http://lists.w3.org/Archives/Public/public-orca/2014Jan/0039.html

Hey all,

I just posted this to the WebRTC list here:

http://lists.w3.org/Archives/Public/public-webrtc/2014Jan/0256.html

But I believe it's a question that is also very much worth resolving
for ORTC, so I am also asking it here:

One requirement that we often bump against is the possibility to
extract active speaker information from an incoming mixed audio
stream. Acquiring the CSRC list from RTP would be a good start. Audio
levels as per RFC6465 would be even better.

Thoughts?

Emil

https://jitsi.org


[Emil Ivov]

With regard to energy levels, there are two main use cases:

  • acting on changes of the current speaker (e.g. in order to upscale
    their corresponding video and thumbnail everyone else)

* showing energy levels for all participants

[Gustavo Garcia]

  1. The client-mixer audio level [RFC6464] is sent by Chrome in the corresponding RTP
    extension header, but AFAIK that information is not used by the browser receiving it.
  2. You have access to the audio level of the tracks received with the getStats API (in Chrome)

[Roman Shpount]

First of all, the latest value of audio level is almost useless. You need
to apply some sort of averaging function to the audio level values you
received to get something that make sense (see section 5 of RFC 6464). For
instance, returning a max audio level for the specified interval, which
should be much longer then an individual packet duration makes much more
sense.

Second, since scenarios were received audio will not be decoded would be
very uncommon for orca clients, saving from exposing audio level from RTP
packets are not significant in comparison with calculating this value
directly from decoded audio.

As far ssrcs are concerned it would make sense to expose the latest list of
contributing sources with some sort of time stamp indicating the last time
each ssrc was seen. You can also expire and remove ssrcs from the list
after some period of time.

[Bernard Aboba]

With regard to energy levels, there are two main use cases:

  1. acting on changes of the current speaker (e.g. in order to upscale their corresponding video and thumbnail everyone else)
  2. showing energy levels for all participants

[BA] I believe that the polling proposal could address need #2 by delivering a list of CSRCs as well as an (averaged) level, but I'm not sure about #1.
#1 is about timely dominant speaker identification, presumably without false speaker switches.

To do this well, you may need to do more than firing an event based on changes in a ranked list of speakers based on averaged levels; better approaches tend to actually process the audio.

For example, see http://webee.technion.ac.il/Sites/People/IsraelCohen/Publications/CSL_2012_Volfin.pdf

Rather than providing access to per-packet hdr extensions or triggering an event for
each new level (which could end up resulting in an event in a large fraction of packets)
could the Web audio specification be used?
https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/specification.html

In particular, I am thinking about the ScriptProcessorNode Interface (Section 4.12).


[Peter Thatcher]

dictionary RtpContributingSource {
unsigned int csrc;
int audioLevel;
}

partial interface RtpReceiver {
sequence getContributingSources();
}

Also, is it enough to require JS to poll? Why not have an event for
when the values change?

partial interface RtpReceiver {
// Gets sequence
attribute EventHandler? oncontributingsources;
}

[Justin Uberti]

As others have mentioned, the event rate here could be very high (50+ PPS),
and I don't think that resolution is really needed for active speaker
identification. I have seen systems that work well even when sampling this
information at ~ 5 Hz.

As such I am still inclined to leave this as a polling interface and allow
apps to control the resolution by their poll rate.

[Roman Shpount]

I would actually think that callback will be more efficient as long as you
can specify a number of packets for each callback and max number of CSRCs.
This should be similar to ScriptProcessorNode in Web audio and will allow
to control the latency acceptable to the application, will not require any
processing when it is not used, will provide detailed info about audio
levels to implement any required post processing, and will allow to
optimize allocation of needed data structures.

[Peter Thatcher]

Would it make sense to have an async getter that calls the callback
function more than once? For example, to get the current value once,
call like this:

rtpReceiver.getContributorSources(function(contributorSources) {
// Use the contributor sources just once.
});

And to get called back every 100ms, call like this:

rtpReceiver.getContributorSources(function(contributorSources) {
// Use the contributor sources every 100ms.
return true;
}, 100);

And to stop the callback:

rtpReceiver.getContributorSources(function(contributorSources) {
if (iAmAllDone) {
// I'm all done. Don't call me anymore.
return false;
}
return true;
}, 100);

That's somewhat halfway between an async getter and an event. Are
there any existing HTML5 APIs like that?

[Roman Shpount]

How about something like this:

ContributingSourceProcessorNode createContributingSourceProcessor(optional
unsigned long interval = 100,
optional unsigned long maxContributingSources = 16);

interface ContributingSourceProcessorNode {
attribute EventHandler onContributingSourceProcess;
};

dictionary ContributingSource {
readonly attribute double packetTime;
unsigned int csrc;
int audioLevel;
}

interface ContributingSourceProcessingEvent : Event {
readonly attribute sequence contributingSources;
};

This way you can create a processor node and specify the frequency with
which it should be called.

[Peter Thatcher]
Looks more complicated. What's the benefit? The callback-based version of my proposal already allows specifying the frequency, and is more simple.

@aboba
Copy link
Contributor Author

aboba commented Apr 4, 2014

Here is a recap of where we are:
http://lists.w3.org/Archives/Public/public-ortc/2014Apr/0006.html

robin-raymond pushed a commit to robin-raymond/ortc that referenced this issue Apr 12, 2014
…c#27

Support for control of quality, resolution, framerate and layering added, as described inhttps://github.com/w3c/issues/31
RTCRtpListener object added and figure in Section 1 updated, as described in w3c#32
More complete support for RTP and Codec Parameters added, as described in w3c#33
Data Channel transport problem fixed, as described in w3c#34
Various NITs fixed, as described in w3c#37
Section 2.2 and 2.3 issues fixed, as described in w3c#38
Default values of some dictionary attributes added, to partially address the issue described in w3c#39
Support for ICE TCP added, as described in w3c#41
Fixed issue with sequences as attributes, as described in w3c#43
Fix for issues with onlocalcandidate, as described in w3c#44
Initial stab at a Stats API, as requested in w3c#46
Added support for ICE gather policy, as described in w3c#47
@aboba
Copy link
Contributor Author

aboba commented Apr 29, 2014

This has been re-classified as a 1.2 feature.

robin-raymond pushed a commit to robin-raymond/ortc that referenced this issue Apr 29, 2014
- Support for contributing sources removed (re-classified as a 1.2 feature), as described in w3c#27
- Cleanup of DataChannel construction, as described in w3c#60
- Separate proposal on simulcast/layering, as described in w3c#61
- Separate proposal on quality, as described in w3c#62
- Fix for TCP candidate type, as described in w3c#63
- Fix to the fingerprint attribute, as described in w3c#64
- Fix to RTCRtpFeatures, as described in w3c#65
- Support for retrieval of remote certificates, as described in w3c#67
- Support for ICE error handling, described in w3c#68
- Support for Data Channel send rate control, as described in w3c#69
- Support for capabilities and settings, as described in w3c#70
- Removal of duplicate RTCRtpListener functionality, as described in w3c#71
- ICE gathering state added, as described in w3c#72
- Removed ICE role from the ICE transport constructor, as described in w3c#73
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants