Add image and audio prompting API #71

domenic · 2025-01-20T07:01:59Z

Closes #40. Somewhat helps with #70.

README.md

michaelwasserman · 2025-01-21T20:21:55Z

README.md

+
+const response1 = await session.prompt([
+  "Give a helpful artistic critique of how well the second image matches the first:",
+  { type: "image", data: referenceImage },


Some models may only accept a single image or single audio input with each request. Consider describing that edge-case detail behavior (e.g. throw a "NotSupportedError" DOMException) when an unsupported number or combination of image/audio prompt pieces are passed. Maybe also give a single-image example here for wider compatibility.

I think we should encapsulate that away from the user so they don't have to worry about it, by sending two requests to the backend.

I think number of images aspect of this can be handled through the contextoverflow event. https://github.com/webmachinelearning/prompt-api#tokenization-context-window-length-limits-and-overflow.

Phi 3.5 vision https://huggingface.co/microsoft/Phi-3.5-vision-instruct, recommends at the most 16 images but passing more (say 17) will only run into context length limitations. The context length limitation can be reached earlier as well with large image.

Sending two requests to the backend may not work though, the assistant is going to add its response in between.

Are there other limitations in terms of no mixing of images/audio or limitations in number of images/audio. Ill have to check what the server-side models do for their API.

Phi 3.5 vision https://huggingface.co/microsoft/Phi-3.5-vision-instruct, recommends at the most 16 images but passing more (say 17) will only run into context length limitations.

I agree with you that contextoverflow is the right way to handle this. This seems very natural to me. Each image takes up some of the token budget.

I'm hesitant to add additional "artificial" restrictions on top of the existing context limit restriction. Hopefully that won't be necessary.

Sending two requests to the backend may not work though, the assistant is going to add its response in between.

At least for some architectures, this should be avoidable, by not sending the control token that triggers a response from the model.

Are there other limitations in terms of no mixing of images/audio or limitations in number of images/audio. Ill have to check what the server-side models do for their API.

I've put Deep Research on this question, and we should hear back soon...

Deep Research report. Some takeaways:

Generally multiple images are fine, and mixing modalities is fine. (Although the report seemed to imply Llama might be more single-turn based.)

There seems to be some hallucination about only one audio per request. Clicking through the provided sources I cannot find backing for this. So, probably audio is fine too, but we might need to keep an eye out.

File size limits are pretty generous, but they do exist.

Pixel size limits are O(1k x 1k), but rescaling is the common solution, sometimes provided automatically. For the web I think automatic rescaling is a good idea.

README.md

Add IDL for new LanguageModel[Factory] API types, etc., per: webmachinelearning/prompt-api#71 API use with currently supported input types is unchanged. API use with new input types throws TypeErrors for now. Move create() WPTs into a new file as separate tests. Bug: 385173789, 385173368 Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647 Reviewed-by: Clark DuVall <cduvall@chromium.org> Commit-Queue: Mike Wasserman <msw@chromium.org> Cr-Commit-Position: refs/heads/main@{#1414456}

This reverts commit 259d706. Reason for revert: Failing win builder: https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket/8724180044051350113/+/u/compile/raw_io.output_text_failure_summary_ Original change's description: > Prompt API: Add multimodal input IDL skeleton > > Add IDL for new LanguageModel[Factory] API types, etc., per: > webmachinelearning/prompt-api#71 > > API use with currently supported input types is unchanged. > API use with new input types throws TypeErrors for now. > > Move create() WPTs into a new file as separate tests. > > Bug: 385173789, 385173368 > Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698 > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647 > Reviewed-by: Clark DuVall <cduvall@chromium.org> > Commit-Queue: Mike Wasserman <msw@chromium.org> > Cr-Commit-Position: refs/heads/main@{#1414456} Bug: 385173789, 385173368 Change-Id: If058bf22fceb6880b7ba12ebccbdca74a0274535 No-Presubmit: true No-Tree-Checks: true No-Try: true Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6221804 Commit-Queue: Mike Wasserman <msw@chromium.org> Auto-Submit: Mike Wasserman <msw@chromium.org> Reviewed-by: Clark DuVall <cduvall@chromium.org> Cr-Commit-Position: refs/heads/main@{#1414477}

This reverts commit 77f9f49. Reason for revert: Workaround Windows IDL compiler path length issues Original change's description: > Revert "Prompt API: Add multimodal input IDL skeleton" > > This reverts commit 259d706. > > Reason for revert: Failing win builder: > https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket/8724180044051350113/+/u/compile/raw_io.output_text_failure_summary_ > > Original change's description: > > Prompt API: Add multimodal input IDL skeleton > > > > Add IDL for new LanguageModel[Factory] API types, etc., per: > > webmachinelearning/prompt-api#71 > > > > API use with currently supported input types is unchanged. > > API use with new input types throws TypeErrors for now. > > > > Move create() WPTs into a new file as separate tests. > > > > Bug: 385173789, 385173368 > > Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698 > > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647 > > Reviewed-by: Clark DuVall <cduvall@chromium.org> > > Commit-Queue: Mike Wasserman <msw@chromium.org> > > Cr-Commit-Position: refs/heads/main@{#1414456} > > Bug: 385173789, 385173368 > Change-Id: If058bf22fceb6880b7ba12ebccbdca74a0274535 > No-Presubmit: true > No-Tree-Checks: true > No-Try: true > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6221804 > Commit-Queue: Mike Wasserman <msw@chromium.org> > Auto-Submit: Mike Wasserman <msw@chromium.org> > Reviewed-by: Clark DuVall <cduvall@chromium.org> > Cr-Commit-Position: refs/heads/main@{#1414477} Bug: 385173789, 385173368, 394123703 Change-Id: I7a47d8ff8c6b6ae797c3198608859640ae81e1df Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6222573 Reviewed-by: Clark DuVall <cduvall@chromium.org> Commit-Queue: Mike Wasserman <msw@chromium.org> Cr-Commit-Position: refs/heads/main@{#1415399}

README.md

beaufortfrancois · 2025-02-12T08:50:55Z

README.md

+
+* For image inputs: [`ImageBitmapSource`](https://html.spec.whatwg.org/#imagebitmapsource), i.e. `Blob`, `ImageData`, `ImageBitmap`, `VideoFrame`, `OffscreenCanvas`, `HTMLImageElement`, `SVGImageElement`, `HTMLCanvasElement`, or `HTMLVideoElement` (will get the current frame). Also raw bytes via `BufferSource` (i.e. `ArrayBuffer` or typed arrays).
+
+* For audio inputs: for now, `Blob`, `AudioBuffer`, `HTMLAudioElement`. Also raw bytes via `BufferSource`. Other possibilities we're investigating include `AudioData` and `MediaStream`, but we're not yet sure if those are suitable to represent "clips".


Shall we use HTMLMediaElement instead and fail if there's no audio track enabled?

What would be the benefit of doing that?

Web developers may want to analyse audio data from a video file. Adding HTMLMediaElement support would allow them to do that without having to use a specific HTMLAudioElement for this.

My question would be more why disallowing audio inputs from HTMLVideoElement in the first place when it actually support audio?

I think it might be confusing to take video as input and only pull out the audio component. Do you know of any other web platform APIs that do that?

WebAudio AudioContext createMediaElementSource() and HTMLMediaelement setsinkid() are two examples.
To a certain extent, HTMLMediaelement capturestream() also applies.

Hmm, those don't seem quite the same to me, because the video content isn't completely discarded, right? In those cases, the audio is re-routed, but the video keeps playing.

I got confused. Thanks for explaining, I now see where those examples don't match what we're trying to achieve with this proposal for HTMLAudioElement support.
The audio data in HTMLAudioElement will be downloaded completely (if necessary) while both video and audio data from an HTMLMediaElement should ideally not be downloaded completely, only the audio component.

Out of curiosity, is there a web platform API that also pull out the audio component of HTMLAudioElement, or for that matter of an HTMLMediaElement?

You raise a good point that this is unprecedented, and so I'm now a bit less sure that including HTMLAudioElement is a good idea.

To confirm, I went through all references to HTMLMediaElement and to HTMLAudioElement in the spec ecosystem and it looks like none of them consume the element in such a way.

In fact, doing so made me more hesitant about this because so many specs deal with patching HTMLMediaElement to make it "more streaming", e.g. streaming it to a web audio context, or to or from a realtime A/V stream.

The one spec that is all about consuming such data, the MediaRecorder spec, actually takes a MediaStream.

The only argument for accepting HTMLAudioElement I have is that to me, HTMLAudioElement is in between HTMLImageElement (which we accept per the precedent set by createImageBitmap()) and HTMLVideoElement. In all three cases, streaming is used to download them. In the image case, streaming to completion happens automatically once the image starts loading. For the video case, that's generally a bad idea, and we need strong incremental streaming support. Audio feels somewhere in between, where by default it streams incrementally, but streaming to completion isn't "as bad", at least on modern internet. So since we want to allow accepting a HTMLImageElement for image prompts, maybe we should accept a HTMLAudioElement for audio prompts?

But on balance perhaps it'd be best to leave out HTMLAudioElement for now, until we hear from developers that it'd be helpful for their use cases. My (not very informed) guess is that most audio sources will end up being Blobs or similar from fetch() anyway. WDYT?

I agree with you that removing HTMLAudioElement support for now is probably best. We should re-add it when the API actually supports streaming data through MediaStreamTrack, like Web Speech.

As you said, web developers will be able to use fetch() for simple cases:

const audio = document.querySelector('audio'); const audioBlob = await (await fetch('file.mp3').response()).blob(); const response = await session.prompt([ "My response to your critique:", { type: "audio", content: audioBlob } ]);

Incorporate updates in github.com/webmachinelearning/prompt-api/pull/71 Use new `union_name_map.conf` to extend the content union. Bug: 385173789, 385173368, 394123703 Change-Id: I75bf612cc8a56fe7eb5cf9d679d27afce24db4f2 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6259144 Auto-Submit: Mike Wasserman <msw@chromium.org> Commit-Queue: Andrey Kosyakov <caseq@chromium.org> Reviewed-by: Clark DuVall <cduvall@chromium.org> Reviewed-by: Andrey Kosyakov <caseq@chromium.org> Cr-Commit-Position: refs/heads/main@{#1419585}

bil-ash · 2025-04-17T02:08:48Z

Would be nice if someone would provide a multimodal example for the prompting API because all the examples I have seen till now are text-only.

bradtriebwasser · 2025-04-17T16:44:30Z

Would be nice if someone would provide a multimodal example for the prompting API because all the examples I have seen till now are text-only.

@bil-ash The README.md outlines some multimodal examples. Is there anything specific you are looking for?

Add image and audio prompting API

2a9f391

Closes #40. Somewhat helps with #70.

beaufortfrancois reviewed Jan 21, 2025

View reviewed changes

README.md Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

bradtriebwasser reviewed Jan 21, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

michaelwasserman reviewed Jan 21, 2025

View reviewed changes

domenic added 3 commits January 22, 2025 12:36

Respond to review feedback

e2e6752

More complicated typedefs!!

996364b

Missing []s

6839d63

michaelwasserman reviewed Jan 31, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

domenic mentioned this pull request Feb 4, 2025

Add multimodal API such as using image as part of prompt #40

Closed

christianliebel mentioned this pull request Feb 4, 2025

FR: Real-time capabilities #80

Open

sushraja-msft reviewed Feb 11, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

etiennenoel reviewed Feb 11, 2025

View reviewed changes

README.md Show resolved Hide resolved

michaelwasserman reviewed Feb 11, 2025

View reviewed changes

README.md Show resolved Hide resolved

domenic added 3 commits February 12, 2025 13:54

Merge branch 'main' into multimodal

ab077fd

Flatten per discussion

3e8f86b

New expectedInputs design

d6d8542

beaufortfrancois reviewed Feb 12, 2025

View reviewed changes

Remove HTMLAudioElement

9a5f773

domenic merged commit 331914a into main Feb 25, 2025
1 check passed

domenic deleted the multimodal branch February 25, 2025 06:01


		* For image inputs: [`ImageBitmapSource`](https://html.spec.whatwg.org/#imagebitmapsource), i.e. `Blob`, `ImageData`, `ImageBitmap`, `VideoFrame`, `OffscreenCanvas`, `HTMLImageElement`, `SVGImageElement`, `HTMLCanvasElement`, or `HTMLVideoElement` (will get the current frame). Also raw bytes via `BufferSource` (i.e. `ArrayBuffer` or typed arrays).

		* For audio inputs: for now, `Blob`, `AudioBuffer`, `HTMLAudioElement`. Also raw bytes via `BufferSource`. Other possibilities we're investigating include `AudioData` and `MediaStream`, but we're not yet sure if those are suitable to represent "clips".

Add image and audio prompting API #71

Add image and audio prompting API #71

Uh oh!

Conversation

domenic commented Jan 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sushraja-msft Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

domenic Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bil-ash commented Apr 17, 2025

Uh oh!

bradtriebwasser commented Apr 17, 2025

Uh oh!

Uh oh!

sushraja-msft Jan 23, 2025 •

edited

Loading

domenic Feb 19, 2025 •

edited

Loading