Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove remaining "audio" references in spec #72

Merged
merged 4 commits into from
Jan 24, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions audio-explainer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Audio in mediacapture-transform

This document contains arguments for including audio processing in the Breakout
Box mechanism, and preserves pieces of text that have been removed from the spec
because there is no WG consensus on including audio.

# Spec changes needed

Include <audio> tags as a possible destination

Under "Use cases supported", include:

- *Audio processing*: This is the equivalent of the video processing use case, but for audio tracks. This use case overlaps partially with the {{AudioWorklet}} interface, but the model provided by this specification differs in significant ways:
- Pull-based programming model, as opposed to {{AudioWorklet}}'s clock-based model. This means that processing of each single block of audio data does not have a set time budget.
- Offers direct access to the data and metadata from the original {{MediaStreamTrack}}. In particular, timestamps come directly from the track as opposed to an {{AudioContext}}.
- Easier integration with video processing by providing the same API and programming model and allowing both to run on the same scope.
- Does not run on a real-time thread. This means that the model is not suitable for applications with strong low-latency requirements.

These differences make the model provided by this specification more
suitable than {{AudioWorklet}} for processing that requires more tolerance
to transient CPU spikes, better integration with video
{{MediaStreamTrack}}s, access to track metadata (e.g., timestamps), but
not strong low-latency requirements such as local audio rendering.

An example of this would be <a href="https://arxiv.org/abs/1804.03619">
audio-visual speech separation</a>, which can be used to combine the video
and audio tracks from a speaker on the sender side of a video call and
remove noise not coming from the speaker (i.e., the "Noisy cafeteria" case).
Other examples that do not require integration with video but can benefit
from the model include echo detection and other forms of ML-based noise
cancellation.
- Under Multi-source processing, add: "Audio-visual speech separation, referenced above, is another case of multi-source processing."
- *Custom audio or video sink*: In this use case, the purpose is not producing a processed {{MediaStreamTrack}}, but to consume the media in a different way. For example, an application could use [[WEBCODECS]] and [[WEBTRANSPORT]] to create an {{RTCPeerConnection}}-like sink, but using different codec configuration and networking protocols.

Under 'MediaStreamTrackProcessor', include:

If the track is an audio track, the chunks will be {{AudioData}} objects.

Under "Security and Privacy considerations", include AudioData as an alternative
to VideoFrame.

The additional IDL would be:

<pre class="idl">
[Exposed=DedicatedWorker]
interface AudioTrackGenerator {
constructor();
readonly attribute WritableStream writable;
attribute boolean muted;
readonly attribute MediaStreamTrack track;
};
</pre>
alvestrand marked this conversation as resolved.
Show resolved Hide resolved
46 changes: 15 additions & 31 deletions index.bs
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ Markup Shorthands: css no, markdown yes
<pre class=anchors>
url: https://wicg.github.io/web-codecs/#videoframe; text: VideoFrame; type: interface; spec: WEBCODECS
url: https://wicg.github.io/web-codecs/#videoencoder; text: VideoEncoder; type: interface; spec: WEBCODECS
url: https://wicg.github.io/web-codecs/#audiodata; text: AudioData; type: interface; spec: WEBCODECS
url: https://www.w3.org/TR/mediacapture-streams/#mediastreamtrack; text: MediaStreamTrack; type: interface; spec: MEDIACAPTURE-STREAMS
url: https://www.w3.org/TR/mediacapture-streams/#dom-constrainulong; text: ConstrainULong; type: typedef; spec: MEDIACAPTURE-STREAMS
url: https://www.w3.org/TR/mediacapture-streams/#dom-constraindouble; text: ConstrainDouble; type: typedef; spec: MEDIACAPTURE-STREAMS
Expand Down Expand Up @@ -60,32 +59,18 @@ This specification provides access to raw media,
which is the output of a media source such as a camera, microphone, screen capture,
or the decoder part of a codec and the input to the
decoder part of a codec. The processed media can be consumed by any destination
that can take a MediaStreamTrack, including HTML &lt;video&gt; and &lt;audio&gt; tags,
that can take a MediaStreamTrack, including HTML &lt;video&gt; tags,
RTCPeerConnection, canvas or MediaRecorder.

This specification explicitly aims to support the following use cases:
- *Video processing*: This is the "Funny Hats" use case, where the input is a single video track and the output is a transformed video track.
- *Audio processing*: This is the equivalent of the video processing use case, but for audio tracks. This use case overlaps partially with the {{AudioWorklet}} interface, but the model provided by this specification differs in significant ways:
- Pull-based programming model, as opposed to {{AudioWorklet}}'s clock-based model. This means that processing of each single block of audio data does not have a set time budget.
- Offers direct access to the data and metadata from the original {{MediaStreamTrack}}. In particular, timestamps come directly from the track as opposed to an {{AudioContext}}.
- Easier integration with video processing by providing the same API and programming model and allowing both to run on the same scope.
- Does not run on a real-time thread. This means that the model is not suitable for applications with strong low-latency requirements.

These differences make the model provided by this specification more
suitable than {{AudioWorklet}} for processing that requires more tolerance
to transient CPU spikes, better integration with video
{{MediaStreamTrack}}s, access to track metadata (e.g., timestamps), but
not strong low-latency requirements such as local audio rendering.

An example of this would be <a href="https://arxiv.org/abs/1804.03619">
audio-visual speech separation</a>, which can be used to combine the video
and audio tracks from a speaker on the sender side of a video call and
remove noise not coming from the speaker (i.e., the "Noisy cafeteria" case).
Other examples that do not require integration with video but can benefit
from the model include echo detection and other forms of ML-based noise
cancellation.
- *Multi-source processing*: In this use case, two or more tracks are combined into one. For example, a presentation containing a live weather map and a camera track with the speaker can be combined to produce a weather report application. Audio-visual speech separation, referenced above, is another case of multi-source processing.
- *Custom audio or video sink*: In this use case, the purpose is not producing a processed {{MediaStreamTrack}}, but to consume the media in a different way. For example, an application could use [[WEBCODECS]] and [[WEBTRANSPORT]] to create an {{RTCPeerConnection}}-like sink, but using different codec configuration and networking protocols.
- *Custom video sink*: In this use case, the purpose is not producing a processed {{MediaStreamTrack}}, but to consume the media in a different way. For example, an application could use [[WEBCODECS]] and [[WEBTRANSPORT]] to create an {{RTCPeerConnection}}-like sink, but using different codec configuration and networking protocols.
- *Multi-source processing*: In this use case, two or more tracks are combined into one. For example, a presentation containing a live weather map and a camera track with the speaker can be combined to produce a weather report application.

Note: There is no WG consensus on whether or not audio use cases should be supported.

Note: The WG expects that the Streams spec will adopt the solutions outlined in
[the relevant explainer](https://github.com/whatwg/streams/blob/main/streams-for-raw-video-explainer.md), to solve some issues with the current Streams specification.

# Specification # {#specification}

Expand All @@ -105,8 +90,8 @@ media frames as input.
A {{MediaStreamTrackProcessor}} allows the creation of a
{{ReadableStream}} that can expose the media flowing through
a given {{MediaStreamTrack}}. If the {{MediaStreamTrack}} is a video track,
the chunks exposed by the stream will be {{VideoFrame}} objects;
if the track is an audio track, the chunks will be {{AudioData}} objects.
the chunks exposed by the stream will be {{VideoFrame}} objects.

This makes {{MediaStreamTrackProcessor}} effectively a sink in the
<a href="https://www.w3.org/TR/mediacapture-streams/#the-model-sources-sinks-constraints-and-settings">
MediaStream model</a>.
Expand Down Expand Up @@ -683,16 +668,15 @@ This API defines a {{MediaStreamTrack}} source and a {{MediaStreamTrack}} sink.
The security and privacy of the source ({{VideoTrackGenerator}}) relies
on the same-origin policy. That is, the data {{VideoTrackGenerator}} can
make available in the form of a {{MediaStreamTrack}} must be visible to
the document before a {{VideoFrame}} or {{AudioData}} object can be constructed
the document before a {{VideoFrame}} object can be constructed
and pushed into the {{VideoTrackGenerator}}. Any attempt to create
{{VideoFrame}} or {{AudioData}} objects using cross-origin data will fail.
{{VideoFrame}} objects using cross-origin data will fail.
Therefore, {{VideoTrackGenerator}} does not introduce any new
fingerprinting surface.

The {{MediaStreamTrack}} sink introduced by this API ({{MediaStreamTrackProcessor}})
exposes {{MediaStreamTrack}} the same data that is exposed by other
{{MediaStreamTrack}} sinks such as WebRTC peer connections, Web Audio
{{MediaStreamAudioSourceNode}} and media elements. The security and privacy
{{MediaStreamTrack}} sinks such as WebRTC peer connections, and media elements. The security and privacy
of {{MediaStreamTrackProcessor}} relies on the security and privacy of the
{{MediaStreamTrack}} sources of the tracks to which {{MediaStreamTrackProcessor}}
is connected. For example, camera, microphone and screen-capture tracks
Expand All @@ -708,7 +692,7 @@ mitigate this risk by limiting the number of pool-backed frames a site can
hold. This can be achieved by reducing the maximum number of buffered frames
and by refusing to deliver more frames to {{MediaStreamTrackProcessor/readable}}
once the budget limit is reached. Accidental exhaustion is also mitigated by
automatic closing of {{VideoFrame}} and {{AudioData}} objects once they
automatic closing of {{VideoFrame}} objects once they
are written to a {{VideoTrackGenerator}}.

# Backwards compatibility with earlier proposals # {#backwards-compatibility}
Expand All @@ -722,7 +706,7 @@ Previous proposals for this interface had an API like this:
[Exposed=Window,DedicatedWorker]
interface MediaStreamTrackGenerator : MediaStreamTrack {
constructor(MediaStreamTrackGeneratorInit init);
attribute WritableStream writable; // VideoFrame or AudioFrame
attribute WritableStream writable; // VideoFrame or AudioData
};

dictionary MediaStreamTrackGeneratorInit {
Expand Down
Loading