From: Emil Ivov emcho@jitsi.org
To: public-orca@w3.org
Subject: active speaker information in mixed streams
Date: Tue, 28 Jan 2014 14:44:20 +0100
URL: http://lists.w3.org/Archives/Public/public-orca/2014Jan/0039.html
Hey all,
I just posted this to the WebRTC list here:
http://lists.w3.org/Archives/Public/public-webrtc/2014Jan/0256.html
But I believe it's a question that is also very much worth resolving
for ORTC, so I am also asking it here:
One requirement that we often bump against is the possibility to
extract active speaker information from an incoming mixed audio
stream. Acquiring the CSRC list from RTP would be a good start. Audio
levels as per RFC6465 would be even better.
Thoughts?
Emil
https://jitsi.org
[Emil Ivov]
With regard to energy levels, there are two main use cases:
- acting on changes of the current speaker (e.g. in order to upscale
their corresponding video and thumbnail everyone else)
* showing energy levels for all participants
[Gustavo Garcia]
- The client-mixer audio level [RFC6464] is sent by Chrome in the corresponding RTP
extension header, but AFAIK that information is not used by the browser receiving it.
- You have access to the audio level of the tracks received with the getStats API (in Chrome)
[Roman Shpount]
First of all, the latest value of audio level is almost useless. You need
to apply some sort of averaging function to the audio level values you
received to get something that make sense (see section 5 of RFC 6464). For
instance, returning a max audio level for the specified interval, which
should be much longer then an individual packet duration makes much more
sense.
Second, since scenarios were received audio will not be decoded would be
very uncommon for orca clients, saving from exposing audio level from RTP
packets are not significant in comparison with calculating this value
directly from decoded audio.
As far ssrcs are concerned it would make sense to expose the latest list of
contributing sources with some sort of time stamp indicating the last time
each ssrc was seen. You can also expire and remove ssrcs from the list
after some period of time.
[Bernard Aboba]
With regard to energy levels, there are two main use cases:
- acting on changes of the current speaker (e.g. in order to upscale their corresponding video and thumbnail everyone else)
- showing energy levels for all participants
[BA] I believe that the polling proposal could address need #2 by delivering a list of CSRCs as well as an (averaged) level, but I'm not sure about #1.
#1 is about timely dominant speaker identification, presumably without false speaker switches.
To do this well, you may need to do more than firing an event based on changes in a ranked list of speakers based on averaged levels; better approaches tend to actually process the audio.
For example, see http://webee.technion.ac.il/Sites/People/IsraelCohen/Publications/CSL_2012_Volfin.pdf
Rather than providing access to per-packet hdr extensions or triggering an event for
each new level (which could end up resulting in an event in a large fraction of packets)
could the Web audio specification be used?
https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/specification.html
In particular, I am thinking about the ScriptProcessorNode Interface (Section 4.12).
[Peter Thatcher]
dictionary RtpContributingSource {
unsigned int csrc;
int audioLevel;
}
partial interface RtpReceiver {
sequence getContributingSources();
}
Also, is it enough to require JS to poll? Why not have an event for
when the values change?
partial interface RtpReceiver {
// Gets sequence
attribute EventHandler? oncontributingsources;
}
[Justin Uberti]
As others have mentioned, the event rate here could be very high (50+ PPS),
and I don't think that resolution is really needed for active speaker
identification. I have seen systems that work well even when sampling this
information at ~ 5 Hz.
As such I am still inclined to leave this as a polling interface and allow
apps to control the resolution by their poll rate.
[Roman Shpount]
I would actually think that callback will be more efficient as long as you
can specify a number of packets for each callback and max number of CSRCs.
This should be similar to ScriptProcessorNode in Web audio and will allow
to control the latency acceptable to the application, will not require any
processing when it is not used, will provide detailed info about audio
levels to implement any required post processing, and will allow to
optimize allocation of needed data structures.
[Peter Thatcher]
Would it make sense to have an async getter that calls the callback
function more than once? For example, to get the current value once,
call like this:
rtpReceiver.getContributorSources(function(contributorSources) {
// Use the contributor sources just once.
});
And to get called back every 100ms, call like this:
rtpReceiver.getContributorSources(function(contributorSources) {
// Use the contributor sources every 100ms.
return true;
}, 100);
And to stop the callback:
rtpReceiver.getContributorSources(function(contributorSources) {
if (iAmAllDone) {
// I'm all done. Don't call me anymore.
return false;
}
return true;
}, 100);
That's somewhat halfway between an async getter and an event. Are
there any existing HTML5 APIs like that?
[Roman Shpount]
How about something like this:
ContributingSourceProcessorNode createContributingSourceProcessor(optional
unsigned long interval = 100,
optional unsigned long maxContributingSources = 16);
interface ContributingSourceProcessorNode {
attribute EventHandler onContributingSourceProcess;
};
dictionary ContributingSource {
readonly attribute double packetTime;
unsigned int csrc;
int audioLevel;
}
interface ContributingSourceProcessingEvent : Event {
readonly attribute sequence contributingSources;
};
This way you can create a processor node and specify the frequency with
which it should be called.
[Peter Thatcher]
Looks more complicated. What's the benefit? The callback-based version of my proposal already allows specifying the frequency, and is more simple.
From: Emil Ivov emcho@jitsi.org
To: public-orca@w3.org
Subject: active speaker information in mixed streams
Date: Tue, 28 Jan 2014 14:44:20 +0100
URL: http://lists.w3.org/Archives/Public/public-orca/2014Jan/0039.html
Hey all,
I just posted this to the WebRTC list here:
http://lists.w3.org/Archives/Public/public-webrtc/2014Jan/0256.html
But I believe it's a question that is also very much worth resolving
for ORTC, so I am also asking it here:
One requirement that we often bump against is the possibility to
extract active speaker information from an incoming mixed audio
stream. Acquiring the CSRC list from RTP would be a good start. Audio
levels as per RFC6465 would be even better.
Thoughts?
Emil
https://jitsi.org
[Emil Ivov]
With regard to energy levels, there are two main use cases:
their corresponding video and thumbnail everyone else)
* showing energy levels for all participants
[Gustavo Garcia]
extension header, but AFAIK that information is not used by the browser receiving it.
[Roman Shpount]
First of all, the latest value of audio level is almost useless. You need
to apply some sort of averaging function to the audio level values you
received to get something that make sense (see section 5 of RFC 6464). For
instance, returning a max audio level for the specified interval, which
should be much longer then an individual packet duration makes much more
sense.
Second, since scenarios were received audio will not be decoded would be
very uncommon for orca clients, saving from exposing audio level from RTP
packets are not significant in comparison with calculating this value
directly from decoded audio.
As far ssrcs are concerned it would make sense to expose the latest list of
contributing sources with some sort of time stamp indicating the last time
each ssrc was seen. You can also expire and remove ssrcs from the list
after some period of time.
[Bernard Aboba]
With regard to energy levels, there are two main use cases:
[BA] I believe that the polling proposal could address need #2 by delivering a list of CSRCs as well as an (averaged) level, but I'm not sure about #1.
#1 is about timely dominant speaker identification, presumably without false speaker switches.
To do this well, you may need to do more than firing an event based on changes in a ranked list of speakers based on averaged levels; better approaches tend to actually process the audio.
For example, see http://webee.technion.ac.il/Sites/People/IsraelCohen/Publications/CSL_2012_Volfin.pdf
Rather than providing access to per-packet hdr extensions or triggering an event for
each new level (which could end up resulting in an event in a large fraction of packets)
could the Web audio specification be used?
https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/specification.html
In particular, I am thinking about the ScriptProcessorNode Interface (Section 4.12).
[Peter Thatcher]
dictionary RtpContributingSource {
unsigned int csrc;
int audioLevel;
}
partial interface RtpReceiver {
sequence getContributingSources();
}
Also, is it enough to require JS to poll? Why not have an event for
when the values change?
partial interface RtpReceiver {
// Gets sequence
attribute EventHandler? oncontributingsources;
}
[Justin Uberti]
As others have mentioned, the event rate here could be very high (50+ PPS),
and I don't think that resolution is really needed for active speaker
identification. I have seen systems that work well even when sampling this
information at ~ 5 Hz.
As such I am still inclined to leave this as a polling interface and allow
apps to control the resolution by their poll rate.
[Roman Shpount]
I would actually think that callback will be more efficient as long as you
can specify a number of packets for each callback and max number of CSRCs.
This should be similar to ScriptProcessorNode in Web audio and will allow
to control the latency acceptable to the application, will not require any
processing when it is not used, will provide detailed info about audio
levels to implement any required post processing, and will allow to
optimize allocation of needed data structures.
[Peter Thatcher]
Would it make sense to have an async getter that calls the callback
function more than once? For example, to get the current value once,
call like this:
rtpReceiver.getContributorSources(function(contributorSources) {
// Use the contributor sources just once.
});
And to get called back every 100ms, call like this:
rtpReceiver.getContributorSources(function(contributorSources) {
// Use the contributor sources every 100ms.
return true;
}, 100);
And to stop the callback:
rtpReceiver.getContributorSources(function(contributorSources) {
if (iAmAllDone) {
// I'm all done. Don't call me anymore.
return false;
}
return true;
}, 100);
That's somewhat halfway between an async getter and an event. Are
there any existing HTML5 APIs like that?
[Roman Shpount]
How about something like this:
ContributingSourceProcessorNode createContributingSourceProcessor(optional
unsigned long interval = 100,
optional unsigned long maxContributingSources = 16);
interface ContributingSourceProcessorNode {
attribute EventHandler onContributingSourceProcess;
};
dictionary ContributingSource {
readonly attribute double packetTime;
unsigned int csrc;
int audioLevel;
}
interface ContributingSourceProcessingEvent : Event {
readonly attribute sequence contributingSources;
};
This way you can create a processor node and specify the frequency with
which it should be called.
[Peter Thatcher]
Looks more complicated. What's the benefit? The callback-based version of my proposal already allows specifying the frequency, and is more simple.