Skip to content

Conversation

domenic
Copy link
Collaborator

@domenic domenic commented Jan 20, 2025

Closes #40. Somewhat helps with #70.

Closes #40. Somewhat helps with #70.
README.md Outdated

const response1 = await session.prompt([
"Give a helpful artistic critique of how well the second image matches the first:",
{ type: "image", data: referenceImage },

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some models may only accept a single image or single audio input with each request. Consider describing that edge-case detail behavior (e.g. throw a "NotSupportedError" DOMException) when an unsupported number or combination of image/audio prompt pieces are passed. Maybe also give a single-image example here for wider compatibility.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should encapsulate that away from the user so they don't have to worry about it, by sending two requests to the backend.

Copy link
Contributor

@sushraja-msft sushraja-msft Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think number of images aspect of this can be handled through the contextoverflow event. https://github.com/webmachinelearning/prompt-api#tokenization-context-window-length-limits-and-overflow.

Phi 3.5 vision https://huggingface.co/microsoft/Phi-3.5-vision-instruct, recommends at the most 16 images but passing more (say 17) will only run into context length limitations. The context length limitation can be reached earlier as well with large image.

Sending two requests to the backend may not work though, the assistant is going to add its response in between.

Are there other limitations in terms of no mixing of images/audio or limitations in number of images/audio. Ill have to check what the server-side models do for their API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phi 3.5 vision https://huggingface.co/microsoft/Phi-3.5-vision-instruct, recommends at the most 16 images but passing more (say 17) will only run into context length limitations.

I agree with you that contextoverflow is the right way to handle this. This seems very natural to me. Each image takes up some of the token budget.

I'm hesitant to add additional "artificial" restrictions on top of the existing context limit restriction. Hopefully that won't be necessary.

Sending two requests to the backend may not work though, the assistant is going to add its response in between.

At least for some architectures, this should be avoidable, by not sending the control token that triggers a response from the model.

Are there other limitations in terms of no mixing of images/audio or limitations in number of images/audio. Ill have to check what the server-side models do for their API.

I've put Deep Research on this question, and we should hear back soon...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Research report. Some takeaways:

  • Generally multiple images are fine, and mixing modalities is fine. (Although the report seemed to imply Llama might be more single-turn based.)
  • There seems to be some hallucination about only one audio per request. Clicking through the provided sources I cannot find backing for this. So, probably audio is fine too, but we might need to keep an eye out.
  • File size limits are pretty generous, but they do exist.
  • Pixel size limits are O(1k x 1k), but rescaling is the common solution, sometimes provided automatically. For the web I think automatic rescaling is a good idea.

aarongable pushed a commit to chromium/chromium that referenced this pull request Feb 1, 2025
Add IDL for new LanguageModel[Factory] API types, etc., per:
  webmachinelearning/prompt-api#71

API use with currently supported input types is unchanged.
API use with new input types throws TypeErrors for now.

Move create() WPTs into a new file as separate tests.

Bug: 385173789, 385173368
Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647
Reviewed-by: Clark DuVall <cduvall@chromium.org>
Commit-Queue: Mike Wasserman <msw@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1414456}
aarongable pushed a commit to chromium/chromium that referenced this pull request Feb 1, 2025
This reverts commit 259d706.

Reason for revert: Failing win builder:
https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket/8724180044051350113/+/u/compile/raw_io.output_text_failure_summary_

Original change's description:
> Prompt API: Add multimodal input IDL skeleton
>
> Add IDL for new LanguageModel[Factory] API types, etc., per:
>   webmachinelearning/prompt-api#71
>
> API use with currently supported input types is unchanged.
> API use with new input types throws TypeErrors for now.
>
> Move create() WPTs into a new file as separate tests.
>
> Bug: 385173789, 385173368
> Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698
> Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647
> Reviewed-by: Clark DuVall <cduvall@chromium.org>
> Commit-Queue: Mike Wasserman <msw@chromium.org>
> Cr-Commit-Position: refs/heads/main@{#1414456}

Bug: 385173789, 385173368
Change-Id: If058bf22fceb6880b7ba12ebccbdca74a0274535
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6221804
Commit-Queue: Mike Wasserman <msw@chromium.org>
Auto-Submit: Mike Wasserman <msw@chromium.org>
Reviewed-by: Clark DuVall <cduvall@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1414477}
aarongable pushed a commit to chromium/chromium that referenced this pull request Feb 4, 2025
This reverts commit 77f9f49.

Reason for revert: Workaround Windows IDL compiler path length issues

Original change's description:
> Revert "Prompt API: Add multimodal input IDL skeleton"
>
> This reverts commit 259d706.
>
> Reason for revert: Failing win builder:
> https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket/8724180044051350113/+/u/compile/raw_io.output_text_failure_summary_
>
> Original change's description:
> > Prompt API: Add multimodal input IDL skeleton
> >
> > Add IDL for new LanguageModel[Factory] API types, etc., per:
> >   webmachinelearning/prompt-api#71
> >
> > API use with currently supported input types is unchanged.
> > API use with new input types throws TypeErrors for now.
> >
> > Move create() WPTs into a new file as separate tests.
> >
> > Bug: 385173789, 385173368
> > Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698
> > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647
> > Reviewed-by: Clark DuVall <cduvall@chromium.org>
> > Commit-Queue: Mike Wasserman <msw@chromium.org>
> > Cr-Commit-Position: refs/heads/main@{#1414456}
>
> Bug: 385173789, 385173368
> Change-Id: If058bf22fceb6880b7ba12ebccbdca74a0274535
> No-Presubmit: true
> No-Tree-Checks: true
> No-Try: true
> Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6221804
> Commit-Queue: Mike Wasserman <msw@chromium.org>
> Auto-Submit: Mike Wasserman <msw@chromium.org>
> Reviewed-by: Clark DuVall <cduvall@chromium.org>
> Cr-Commit-Position: refs/heads/main@{#1414477}

Bug: 385173789, 385173368, 394123703
Change-Id: I7a47d8ff8c6b6ae797c3198608859640ae81e1df
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6222573
Reviewed-by: Clark DuVall <cduvall@chromium.org>
Commit-Queue: Mike Wasserman <msw@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1415399}
README.md Outdated

* For image inputs: [`ImageBitmapSource`](https://html.spec.whatwg.org/#imagebitmapsource), i.e. `Blob`, `ImageData`, `ImageBitmap`, `VideoFrame`, `OffscreenCanvas`, `HTMLImageElement`, `SVGImageElement`, `HTMLCanvasElement`, or `HTMLVideoElement` (will get the current frame). Also raw bytes via `BufferSource` (i.e. `ArrayBuffer` or typed arrays).

* For audio inputs: for now, `Blob`, `AudioBuffer`, `HTMLAudioElement`. Also raw bytes via `BufferSource`. Other possibilities we're investigating include `AudioData` and `MediaStream`, but we're not yet sure if those are suitable to represent "clips".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use HTMLMediaElement instead and fail if there's no audio track enabled?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the benefit of doing that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Web developers may want to analyse audio data from a video file. Adding HTMLMediaElement support would allow them to do that without having to use a specific HTMLAudioElement for this.

My question would be more why disallowing audio inputs from HTMLVideoElement in the first place when it actually support audio?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be confusing to take video as input and only pull out the audio component. Do you know of any other web platform APIs that do that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, those don't seem quite the same to me, because the video content isn't completely discarded, right? In those cases, the audio is re-routed, but the video keeps playing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused. Thanks for explaining, I now see where those examples don't match what we're trying to achieve with this proposal for HTMLAudioElement support.
The audio data in HTMLAudioElement will be downloaded completely (if necessary) while both video and audio data from an HTMLMediaElement should ideally not be downloaded completely, only the audio component.

Out of curiosity, is there a web platform API that also pull out the audio component of HTMLAudioElement, or for that matter of an HTMLMediaElement?

Copy link
Collaborator Author

@domenic domenic Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise a good point that this is unprecedented, and so I'm now a bit less sure that including HTMLAudioElement is a good idea.

To confirm, I went through all references to HTMLMediaElement and to HTMLAudioElement in the spec ecosystem and it looks like none of them consume the element in such a way.

In fact, doing so made me more hesitant about this because so many specs deal with patching HTMLMediaElement to make it "more streaming", e.g. streaming it to a web audio context, or to or from a realtime A/V stream.

The one spec that is all about consuming such data, the MediaRecorder spec, actually takes a MediaStream.

The only argument for accepting HTMLAudioElement I have is that to me, HTMLAudioElement is in between HTMLImageElement (which we accept per the precedent set by createImageBitmap()) and HTMLVideoElement. In all three cases, streaming is used to download them. In the image case, streaming to completion happens automatically once the image starts loading. For the video case, that's generally a bad idea, and we need strong incremental streaming support. Audio feels somewhere in between, where by default it streams incrementally, but streaming to completion isn't "as bad", at least on modern internet. So since we want to allow accepting a HTMLImageElement for image prompts, maybe we should accept a HTMLAudioElement for audio prompts?

But on balance perhaps it'd be best to leave out HTMLAudioElement for now, until we hear from developers that it'd be helpful for their use cases. My (not very informed) guess is that most audio sources will end up being Blobs or similar from fetch() anyway. WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that removing HTMLAudioElement support for now is probably best. We should re-add it when the API actually supports streaming data through MediaStreamTrack, like Web Speech.

As you said, web developers will be able to use fetch() for simple cases:

const audio = document.querySelector('audio');
const audioBlob = await (await fetch('file.mp3').response()).blob();

const response = await session.prompt([
  "My response to your critique:",
  { type: "audio", content: audioBlob }
]);

aarongable pushed a commit to chromium/chromium that referenced this pull request Feb 12, 2025
Incorporate updates in github.com/webmachinelearning/prompt-api/pull/71
Use new `union_name_map.conf` to extend the content union.

Bug: 385173789, 385173368, 394123703
Change-Id: I75bf612cc8a56fe7eb5cf9d679d27afce24db4f2
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6259144
Auto-Submit: Mike Wasserman <msw@chromium.org>
Commit-Queue: Andrey Kosyakov <caseq@chromium.org>
Reviewed-by: Clark DuVall <cduvall@chromium.org>
Reviewed-by: Andrey Kosyakov <caseq@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1419585}
@domenic domenic merged commit 331914a into main Feb 25, 2025
1 check passed
@domenic domenic deleted the multimodal branch February 25, 2025 06:01
@bil-ash
Copy link

bil-ash commented Apr 17, 2025

Would be nice if someone would provide a multimodal example for the prompting API because all the examples I have seen till now are text-only.

@bradtriebwasser
Copy link

Would be nice if someone would provide a multimodal example for the prompting API because all the examples I have seen till now are text-only.

@bil-ash The README.md outlines some multimodal examples. Is there anything specific you are looking for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add multimodal API such as using image as part of prompt
7 participants