w3c · alvestrand · Jan 24, 2022 · Jan 19, 2022 · Jan 20, 2022 · Jan 20, 2022
diff --git a/audio-explainer.md b/audio-explainer.md
@@ -0,0 +1,52 @@
+# Audio in mediacapture-transform
+
+This document contains arguments for including audio processing in the Breakout
+Box mechanism, and preserves pieces of text that have been removed from the spec
+because there is no WG consensus on including audio.
+
+# Spec changes needed
+
+Include &lt;audio&gt; tags as a possible destination
+
+Under "Use cases supported", include:
+
+- *Audio processing*: This is the equivalent of the video processing use case, but for audio tracks. This use case overlaps partially with the {{AudioWorklet}} interface, but the model provided by this specification differs in significant ways:
+    - Pull-based programming model, as opposed to {{AudioWorklet}}'s clock-based model. This means that processing of each single block of audio data does not have a set time budget.
+    - Offers direct access to the data and metadata from the original {{MediaStreamTrack}}. In particular, timestamps come directly from the track as opposed to an {{AudioContext}}.
+    - Easier integration with video processing by providing the same API and programming model and allowing both to run on the same scope.
+    - Does not run on a real-time thread. This means that the model is not suitable for applications with strong low-latency requirements.
+
+    These differences make the model provided by this specification more
+    suitable than {{AudioWorklet}} for processing that requires more tolerance
+    to transient CPU spikes, better integration with video
+    {{MediaStreamTrack}}s, access to track metadata (e.g., timestamps), but
+    not strong low-latency requirements such as local audio rendering.
+
+    An example of this would be <a href="https://arxiv.org/abs/1804.03619">
+    audio-visual speech separation</a>, which can be used to combine the video
+    and audio tracks from a speaker on the sender side of a video call and
+    remove noise not coming from the speaker (i.e., the "Noisy cafeteria" case).
+    Other examples that do not require integration with video but can benefit
+    from the model include echo detection and other forms of ML-based noise
+    cancellation.
+-  Under Multi-source processing, add: "Audio-visual speech separation, referenced above, is another case of multi-source processing."
+- *Custom audio or video sink*: In this use case, the purpose is not producing a processed {{MediaStreamTrack}}, but to consume the media in a different way. For example, an application could use [[WEBCODECS]] and [[WEBTRANSPORT]] to create an {{RTCPeerConnection}}-like sink, but using different codec configuration and networking protocols.
+
+Under 'MediaStreamTrackProcessor', include:
+
+If the track is an audio track, the chunks will be {{AudioData}} objects.
+
+Under "Security and Privacy considerations", include AudioData as an alternative
+to VideoFrame.
+
+The additional IDL would be:
+
+<pre class="idl">
+[Exposed=DedicatedWorker]
+interface AudioTrackGenerator {
+  constructor();
+  readonly attribute WritableStream writable;
+  attribute boolean muted;
+  readonly attribute MediaStreamTrack track;
+};
+</pre>
diff --git a/index.bs b/index.bs
@@ -15,7 +15,6 @@ Markup Shorthands: css no, markdown yes
 <pre class=anchors>
 url: https://wicg.github.io/web-codecs/#videoframe; text: VideoFrame; type: interface; spec: WEBCODECS
 url: https://wicg.github.io/web-codecs/#videoencoder; text: VideoEncoder; type: interface; spec: WEBCODECS
-url: https://wicg.github.io/web-codecs/#audiodata; text: AudioData; type: interface; spec: WEBCODECS
 url: https://www.w3.org/TR/mediacapture-streams/#mediastreamtrack; text: MediaStreamTrack; type: interface; spec: MEDIACAPTURE-STREAMS
 url: https://www.w3.org/TR/mediacapture-streams/#dom-constrainulong; text: ConstrainULong; type: typedef; spec: MEDIACAPTURE-STREAMS
 url: https://www.w3.org/TR/mediacapture-streams/#dom-constraindouble; text: ConstrainDouble; type: typedef; spec: MEDIACAPTURE-STREAMS
@@ -60,32 +59,18 @@ This specification provides access to raw media,
 which is the output of a media source such as a camera, microphone, screen capture,
 or the decoder part of a codec and the input to the
 decoder part of a codec. The processed media can be consumed by any destination
-that can take a MediaStreamTrack, including HTML &lt;video&gt; and &lt;audio&gt; tags,
+that can take a MediaStreamTrack, including HTML &lt;video&gt; tags,
 RTCPeerConnection, canvas or MediaRecorder.
 
 This specification explicitly aims to support the following use cases:
 - *Video processing*: This is the "Funny Hats" use case, where the input is a single video track and the output is a transformed video track.
-- *Audio processing*: This is the equivalent of the video processing use case, but for audio tracks. This use case overlaps partially with the {{AudioWorklet}} interface, but the model provided by this specification differs in significant ways:
-    - Pull-based programming model, as opposed to {{AudioWorklet}}'s clock-based model. This means that processing of each single block of audio data does not have a set time budget.
-    - Offers direct access to the data and metadata from the original {{MediaStreamTrack}}. In particular, timestamps come directly from the track as opposed to an {{AudioContext}}.
-    - Easier integration with video processing by providing the same API and programming model and allowing both to run on the same scope.
-    - Does not run on a real-time thread. This means that the model is not suitable for applications with strong low-latency requirements.
-
-    These differences make the model provided by this specification more
-    suitable than {{AudioWorklet}} for processing that requires more tolerance
-    to transient CPU spikes, better integration with video
-    {{MediaStreamTrack}}s, access to track metadata (e.g., timestamps), but
-    not strong low-latency requirements such as local audio rendering.
-
-    An example of this would be <a href="https://arxiv.org/abs/1804.03619">
-    audio-visual speech separation</a>, which can be used to combine the video
-    and audio tracks from a speaker on the sender side of a video call and
-    remove noise not coming from the speaker (i.e., the "Noisy cafeteria" case).
-    Other examples that do not require integration with video but can benefit
-    from the model include echo detection and other forms of ML-based noise
-    cancellation.
-  - *Multi-source processing*: In this use case, two or more tracks are combined into one. For example, a presentation containing a live weather map and a camera track with the speaker can be combined to produce a weather report application. Audio-visual speech separation, referenced above, is another case of multi-source processing.
-  - *Custom audio or video sink*: In this use case, the purpose is not producing a processed {{MediaStreamTrack}}, but to consume the media in a different way. For example, an application could use [[WEBCODECS]] and [[WEBTRANSPORT]] to create an {{RTCPeerConnection}}-like sink, but using different codec configuration and networking protocols.
+  - *Custom video sink*: In this use case, the purpose is not producing a processed {{MediaStreamTrack}}, but to consume the media in a different way. For example, an application could use [[WEBCODECS]] and [[WEBTRANSPORT]] to create an {{RTCPeerConnection}}-like sink, but using different codec configuration and networking protocols.
+  - *Multi-source processing*: In this use case, two or more tracks are combined into one. For example, a presentation containing a live weather map and a camera track with the speaker can be combined to produce a weather report application.
+
+Note: There is no WG consensus on whether or not audio use cases should be supported.
+
+Note: The WG expects that the Streams spec will adopt the solutions outlined in
+[the relevant explainer](https://github.com/whatwg/streams/blob/main/streams-for-raw-video-explainer.md), to solve some issues with the current Streams specification.
 
 # Specification # {#specification}
 
@@ -105,8 +90,8 @@ media frames as input.
 A {{MediaStreamTrackProcessor}} allows the creation of a
 {{ReadableStream}} that can expose the media flowing through
 a given {{MediaStreamTrack}}. If the {{MediaStreamTrack}} is a video track,
-the chunks exposed by the stream will be {{VideoFrame}} objects;
-if the track is an audio track, the chunks will be {{AudioData}} objects.
+the chunks exposed by the stream will be {{VideoFrame}} objects.
+
 This makes {{MediaStreamTrackProcessor}} effectively a sink in the
 <a href="https://www.w3.org/TR/mediacapture-streams/#the-model-sources-sinks-constraints-and-settings">
 MediaStream model</a>.
@@ -683,16 +668,15 @@ This API defines a {{MediaStreamTrack}} source and a {{MediaStreamTrack}} sink.
 The security and privacy of the source ({{VideoTrackGenerator}}) relies
 on the same-origin policy. That is, the data {{VideoTrackGenerator}} can
 make available in the form of a {{MediaStreamTrack}} must be visible to
-the document before a {{VideoFrame}} or {{AudioData}} object can be constructed
+the document before a {{VideoFrame}} object can be constructed
 and pushed into the {{VideoTrackGenerator}}. Any attempt to create
-{{VideoFrame}} or {{AudioData}} objects using cross-origin data will fail.
+{{VideoFrame}} objects using cross-origin data will fail.
 Therefore, {{VideoTrackGenerator}} does not introduce any new
 fingerprinting surface.
 
 The {{MediaStreamTrack}} sink introduced by this API ({{MediaStreamTrackProcessor}})
 exposes {{MediaStreamTrack}} the same data that is exposed by other
-{{MediaStreamTrack}} sinks such as WebRTC peer connections, Web Audio
-{{MediaStreamAudioSourceNode}} and media elements. The security and privacy
+{{MediaStreamTrack}} sinks such as WebRTC peer connections, and media elements. The security and privacy
 of {{MediaStreamTrackProcessor}} relies on the security and privacy of the
 {{MediaStreamTrack}} sources of the tracks to which {{MediaStreamTrackProcessor}}
 is connected. For example, camera, microphone and screen-capture tracks
@@ -708,7 +692,7 @@ mitigate this risk by limiting the number of pool-backed frames a site can
 hold. This can be achieved by reducing the maximum number of buffered frames
 and by refusing to deliver more frames to {{MediaStreamTrackProcessor/readable}}
 once the budget limit is reached. Accidental exhaustion is also mitigated by
-automatic closing of {{VideoFrame}} and {{AudioData}} objects once they
+automatic closing of {{VideoFrame}} objects once they
 are written to a {{VideoTrackGenerator}}.
 
 # Backwards compatibility with earlier proposals # {#backwards-compatibility}
@@ -722,7 +706,7 @@ Previous proposals for this interface had an API like this:
 [Exposed=Window,DedicatedWorker]
 interface MediaStreamTrackGenerator : MediaStreamTrack {
     constructor(MediaStreamTrackGeneratorInit init);
-    attribute WritableStream writable;  // VideoFrame or AudioFrame
+    attribute WritableStream writable;  // VideoFrame or AudioData
 };
 
 dictionary MediaStreamTrackGeneratorInit {