Skip to content

Action-Response Cycle bottlenecks in interactive music apps #97

@anssiko

Description

@anssiko

The Interactive ML - Powered Music Applications on the Web talk by @teropa explains how a key design consideration in apps for musical instruments is latency between the user input (e.g. a key press on an instrument, a video input) and musical output as illustrated by the Action-Response Cycle:

User Input > Create Input Tensor > Upload to GPU > Run Inference > Download from GPU > Process Output Tensors > Musical Output

This cycle must execute within ~0-20 ms for the experience to feel natural.

Real-time audio is mentioned as a very constrained capability on the web platform currently:

[...] you have this task of generating 48,000 audio samples per second per channel consistently without fault. Because if you fail to do that you have an audible glitch in your outputs. So it's a very hard constraint, and it has to be deterministic because of this reason.

Particularly demanding task is generating actual audio data in the browser with ML (as opposed to generating symbolic music data with ML). Proposals mentioned for consideration that may help lower the latency in this scenario:

  • Inference running in WebAssembly on the CPU on the audio thread
  • WebNN in Worklets

Another use case that involves video input (from webcam) and musical output has the following per-frame path:

Webcam MediaStream > Draw to Canvas > Build Pixel Tensor > Upload to CPU > Run Inference > Download from GPU > Process Output Tensors > Musical Output

Notably, the steps to get data into the model (Webcam MediaStream > Draw to Canvas > Build Pixel Tensor) take half of the time.

The bottleneck of canvas (copy rendered video frames to a canvas element, process pixels extracted from the canvas, and render the result to a canvas) was identified as an inefficient path also in the Media processing hooks for the Web talk by @tidoust.

This calls for APIs to provide better abstractions that allow feeding input data into ML models, @teropa concludes:

Could there be some APIs that give me abstractions to do this in a more direct way to get immediate input into my machine learning model, without having to do quite so much work and run quite so much slow code on each frame.

As a summary, the talk outlines the following areas as important:

  • Low and predictable performance
  • Not compromising CPU/GPU needed by the UI or Audio
  • Inference in AudioWorklet context - Wasm or native [WebNN]?
  • Media integration (e.g. fast streaming inputs from MediaStream)

This issue is to discuss the proposals that involve Web API surface improvements and other problematic aspects of real-time use cases that involve audio.

Looping in @padenot for AudioWorklet expertise as well as to reflect on the recent work on WebCodecs that might also help with these real-time audio use cases. Feel free to tag other folks who might be interested.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Developer's PerspectiveMachine Learning Experiences on the Web: A Developer's PerspectiveDiscussion topicTopic discussed at the workshopUser's PerspectiveMachine Learning Experiences on the Web: A User's Perspective

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions