Skip to content
This repository was archived by the owner on Feb 25, 2026. It is now read-only.
This repository was archived by the owner on Feb 25, 2026. It is now read-only.

Active speaker information in mixed (RFC 6465) streams #27

@aboba

Description

@aboba

From: Emil Ivov emcho@jitsi.org
To: public-orca@w3.org
Subject: active speaker information in mixed streams
Date: Tue, 28 Jan 2014 14:44:20 +0100
URL: http://lists.w3.org/Archives/Public/public-orca/2014Jan/0039.html

Hey all,

I just posted this to the WebRTC list here:

http://lists.w3.org/Archives/Public/public-webrtc/2014Jan/0256.html

But I believe it's a question that is also very much worth resolving
for ORTC, so I am also asking it here:

One requirement that we often bump against is the possibility to
extract active speaker information from an incoming mixed audio
stream. Acquiring the CSRC list from RTP would be a good start. Audio
levels as per RFC6465 would be even better.

Thoughts?

Emil

https://jitsi.org


[Emil Ivov]

With regard to energy levels, there are two main use cases:

  • acting on changes of the current speaker (e.g. in order to upscale
    their corresponding video and thumbnail everyone else)

* showing energy levels for all participants

[Gustavo Garcia]

  1. The client-mixer audio level [RFC6464] is sent by Chrome in the corresponding RTP
    extension header, but AFAIK that information is not used by the browser receiving it.
  2. You have access to the audio level of the tracks received with the getStats API (in Chrome)

[Roman Shpount]

First of all, the latest value of audio level is almost useless. You need
to apply some sort of averaging function to the audio level values you
received to get something that make sense (see section 5 of RFC 6464). For
instance, returning a max audio level for the specified interval, which
should be much longer then an individual packet duration makes much more
sense.

Second, since scenarios were received audio will not be decoded would be
very uncommon for orca clients, saving from exposing audio level from RTP
packets are not significant in comparison with calculating this value
directly from decoded audio.

As far ssrcs are concerned it would make sense to expose the latest list of
contributing sources with some sort of time stamp indicating the last time
each ssrc was seen. You can also expire and remove ssrcs from the list
after some period of time.

[Bernard Aboba]

With regard to energy levels, there are two main use cases:

  1. acting on changes of the current speaker (e.g. in order to upscale their corresponding video and thumbnail everyone else)
  2. showing energy levels for all participants

[BA] I believe that the polling proposal could address need #2 by delivering a list of CSRCs as well as an (averaged) level, but I'm not sure about #1.
#1 is about timely dominant speaker identification, presumably without false speaker switches.

To do this well, you may need to do more than firing an event based on changes in a ranked list of speakers based on averaged levels; better approaches tend to actually process the audio.

For example, see http://webee.technion.ac.il/Sites/People/IsraelCohen/Publications/CSL_2012_Volfin.pdf

Rather than providing access to per-packet hdr extensions or triggering an event for
each new level (which could end up resulting in an event in a large fraction of packets)
could the Web audio specification be used?
https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/specification.html

In particular, I am thinking about the ScriptProcessorNode Interface (Section 4.12).


[Peter Thatcher]

dictionary RtpContributingSource {
unsigned int csrc;
int audioLevel;
}

partial interface RtpReceiver {
sequence getContributingSources();
}

Also, is it enough to require JS to poll? Why not have an event for
when the values change?

partial interface RtpReceiver {
// Gets sequence
attribute EventHandler? oncontributingsources;
}

[Justin Uberti]

As others have mentioned, the event rate here could be very high (50+ PPS),
and I don't think that resolution is really needed for active speaker
identification. I have seen systems that work well even when sampling this
information at ~ 5 Hz.

As such I am still inclined to leave this as a polling interface and allow
apps to control the resolution by their poll rate.

[Roman Shpount]

I would actually think that callback will be more efficient as long as you
can specify a number of packets for each callback and max number of CSRCs.
This should be similar to ScriptProcessorNode in Web audio and will allow
to control the latency acceptable to the application, will not require any
processing when it is not used, will provide detailed info about audio
levels to implement any required post processing, and will allow to
optimize allocation of needed data structures.

[Peter Thatcher]

Would it make sense to have an async getter that calls the callback
function more than once? For example, to get the current value once,
call like this:

rtpReceiver.getContributorSources(function(contributorSources) {
// Use the contributor sources just once.
});

And to get called back every 100ms, call like this:

rtpReceiver.getContributorSources(function(contributorSources) {
// Use the contributor sources every 100ms.
return true;
}, 100);

And to stop the callback:

rtpReceiver.getContributorSources(function(contributorSources) {
if (iAmAllDone) {
// I'm all done. Don't call me anymore.
return false;
}
return true;
}, 100);

That's somewhat halfway between an async getter and an event. Are
there any existing HTML5 APIs like that?

[Roman Shpount]

How about something like this:

ContributingSourceProcessorNode createContributingSourceProcessor(optional
unsigned long interval = 100,
optional unsigned long maxContributingSources = 16);

interface ContributingSourceProcessorNode {
attribute EventHandler onContributingSourceProcess;
};

dictionary ContributingSource {
readonly attribute double packetTime;
unsigned int csrc;
int audioLevel;
}

interface ContributingSourceProcessingEvent : Event {
readonly attribute sequence contributingSources;
};

This way you can create a processor node and specify the frequency with
which it should be called.

[Peter Thatcher]
Looks more complicated. What's the benefit? The callback-based version of my proposal already allows specifying the frequency, and is more simple.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions