-
Notifications
You must be signed in to change notification settings - Fork 69
Add image and audio prompting API #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
README.md
Outdated
|
||
const response1 = await session.prompt([ | ||
"Give a helpful artistic critique of how well the second image matches the first:", | ||
{ type: "image", data: referenceImage }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some models may only accept a single image or single audio input with each request. Consider describing that edge-case detail behavior (e.g. throw a "NotSupportedError"
DOMException
) when an unsupported number or combination of image/audio prompt pieces are passed. Maybe also give a single-image example here for wider compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should encapsulate that away from the user so they don't have to worry about it, by sending two requests to the backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think number of images aspect of this can be handled through the contextoverflow event. https://github.com/webmachinelearning/prompt-api#tokenization-context-window-length-limits-and-overflow.
Phi 3.5 vision https://huggingface.co/microsoft/Phi-3.5-vision-instruct, recommends at the most 16 images but passing more (say 17) will only run into context length limitations. The context length limitation can be reached earlier as well with large image.
Sending two requests to the backend may not work though, the assistant is going to add its response in between.
Are there other limitations in terms of no mixing of images/audio or limitations in number of images/audio. Ill have to check what the server-side models do for their API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phi 3.5 vision https://huggingface.co/microsoft/Phi-3.5-vision-instruct, recommends at the most 16 images but passing more (say 17) will only run into context length limitations.
I agree with you that contextoverflow is the right way to handle this. This seems very natural to me. Each image takes up some of the token budget.
I'm hesitant to add additional "artificial" restrictions on top of the existing context limit restriction. Hopefully that won't be necessary.
Sending two requests to the backend may not work though, the assistant is going to add its response in between.
At least for some architectures, this should be avoidable, by not sending the control token that triggers a response from the model.
Are there other limitations in terms of no mixing of images/audio or limitations in number of images/audio. Ill have to check what the server-side models do for their API.
I've put Deep Research on this question, and we should hear back soon...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deep Research report. Some takeaways:
- Generally multiple images are fine, and mixing modalities is fine. (Although the report seemed to imply Llama might be more single-turn based.)
- There seems to be some hallucination about only one audio per request. Clicking through the provided sources I cannot find backing for this. So, probably audio is fine too, but we might need to keep an eye out.
- File size limits are pretty generous, but they do exist.
- Pixel size limits are O(1k x 1k), but rescaling is the common solution, sometimes provided automatically. For the web I think automatic rescaling is a good idea.
Add IDL for new LanguageModel[Factory] API types, etc., per: webmachinelearning/prompt-api#71 API use with currently supported input types is unchanged. API use with new input types throws TypeErrors for now. Move create() WPTs into a new file as separate tests. Bug: 385173789, 385173368 Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647 Reviewed-by: Clark DuVall <cduvall@chromium.org> Commit-Queue: Mike Wasserman <msw@chromium.org> Cr-Commit-Position: refs/heads/main@{#1414456}
This reverts commit 259d706. Reason for revert: Failing win builder: https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket/8724180044051350113/+/u/compile/raw_io.output_text_failure_summary_ Original change's description: > Prompt API: Add multimodal input IDL skeleton > > Add IDL for new LanguageModel[Factory] API types, etc., per: > webmachinelearning/prompt-api#71 > > API use with currently supported input types is unchanged. > API use with new input types throws TypeErrors for now. > > Move create() WPTs into a new file as separate tests. > > Bug: 385173789, 385173368 > Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698 > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647 > Reviewed-by: Clark DuVall <cduvall@chromium.org> > Commit-Queue: Mike Wasserman <msw@chromium.org> > Cr-Commit-Position: refs/heads/main@{#1414456} Bug: 385173789, 385173368 Change-Id: If058bf22fceb6880b7ba12ebccbdca74a0274535 No-Presubmit: true No-Tree-Checks: true No-Try: true Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6221804 Commit-Queue: Mike Wasserman <msw@chromium.org> Auto-Submit: Mike Wasserman <msw@chromium.org> Reviewed-by: Clark DuVall <cduvall@chromium.org> Cr-Commit-Position: refs/heads/main@{#1414477}
This reverts commit 77f9f49. Reason for revert: Workaround Windows IDL compiler path length issues Original change's description: > Revert "Prompt API: Add multimodal input IDL skeleton" > > This reverts commit 259d706. > > Reason for revert: Failing win builder: > https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket/8724180044051350113/+/u/compile/raw_io.output_text_failure_summary_ > > Original change's description: > > Prompt API: Add multimodal input IDL skeleton > > > > Add IDL for new LanguageModel[Factory] API types, etc., per: > > webmachinelearning/prompt-api#71 > > > > API use with currently supported input types is unchanged. > > API use with new input types throws TypeErrors for now. > > > > Move create() WPTs into a new file as separate tests. > > > > Bug: 385173789, 385173368 > > Change-Id: Id8ca1c8410f1a97bb7d28b4bc568020ff0412698 > > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6216647 > > Reviewed-by: Clark DuVall <cduvall@chromium.org> > > Commit-Queue: Mike Wasserman <msw@chromium.org> > > Cr-Commit-Position: refs/heads/main@{#1414456} > > Bug: 385173789, 385173368 > Change-Id: If058bf22fceb6880b7ba12ebccbdca74a0274535 > No-Presubmit: true > No-Tree-Checks: true > No-Try: true > Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6221804 > Commit-Queue: Mike Wasserman <msw@chromium.org> > Auto-Submit: Mike Wasserman <msw@chromium.org> > Reviewed-by: Clark DuVall <cduvall@chromium.org> > Cr-Commit-Position: refs/heads/main@{#1414477} Bug: 385173789, 385173368, 394123703 Change-Id: I7a47d8ff8c6b6ae797c3198608859640ae81e1df Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6222573 Reviewed-by: Clark DuVall <cduvall@chromium.org> Commit-Queue: Mike Wasserman <msw@chromium.org> Cr-Commit-Position: refs/heads/main@{#1415399}
README.md
Outdated
|
||
* For image inputs: [`ImageBitmapSource`](https://html.spec.whatwg.org/#imagebitmapsource), i.e. `Blob`, `ImageData`, `ImageBitmap`, `VideoFrame`, `OffscreenCanvas`, `HTMLImageElement`, `SVGImageElement`, `HTMLCanvasElement`, or `HTMLVideoElement` (will get the current frame). Also raw bytes via `BufferSource` (i.e. `ArrayBuffer` or typed arrays). | ||
|
||
* For audio inputs: for now, `Blob`, `AudioBuffer`, `HTMLAudioElement`. Also raw bytes via `BufferSource`. Other possibilities we're investigating include `AudioData` and `MediaStream`, but we're not yet sure if those are suitable to represent "clips". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use HTMLMediaElement
instead and fail if there's no audio track enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the benefit of doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Web developers may want to analyse audio data from a video file. Adding HTMLMediaElement support would allow them to do that without having to use a specific HTMLAudioElement for this.
My question would be more why disallowing audio inputs from HTMLVideoElement in the first place when it actually support audio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be confusing to take video as input and only pull out the audio component. Do you know of any other web platform APIs that do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebAudio AudioContext createMediaElementSource() and HTMLMediaelement setsinkid() are two examples.
To a certain extent, HTMLMediaelement capturestream() also applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, those don't seem quite the same to me, because the video content isn't completely discarded, right? In those cases, the audio is re-routed, but the video keeps playing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got confused. Thanks for explaining, I now see where those examples don't match what we're trying to achieve with this proposal for HTMLAudioElement
support.
The audio data in HTMLAudioElement will be downloaded completely (if necessary) while both video and audio data from an HTMLMediaElement should ideally not be downloaded completely, only the audio component.
Out of curiosity, is there a web platform API that also pull out the audio component of HTMLAudioElement, or for that matter of an HTMLMediaElement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You raise a good point that this is unprecedented, and so I'm now a bit less sure that including HTMLAudioElement
is a good idea.
To confirm, I went through all references to HTMLMediaElement
and to HTMLAudioElement
in the spec ecosystem and it looks like none of them consume the element in such a way.
In fact, doing so made me more hesitant about this because so many specs deal with patching HTMLMediaElement
to make it "more streaming", e.g. streaming it to a web audio context, or to or from a realtime A/V stream.
The one spec that is all about consuming such data, the MediaRecorder
spec, actually takes a MediaStream
.
The only argument for accepting HTMLAudioElement
I have is that to me, HTMLAudioElement
is in between HTMLImageElement
(which we accept per the precedent set by createImageBitmap()
) and HTMLVideoElement
. In all three cases, streaming is used to download them. In the image case, streaming to completion happens automatically once the image starts loading. For the video case, that's generally a bad idea, and we need strong incremental streaming support. Audio feels somewhere in between, where by default it streams incrementally, but streaming to completion isn't "as bad", at least on modern internet. So since we want to allow accepting a HTMLImageElement
for image prompts, maybe we should accept a HTMLAudioElement
for audio prompts?
But on balance perhaps it'd be best to leave out HTMLAudioElement
for now, until we hear from developers that it'd be helpful for their use cases. My (not very informed) guess is that most audio sources will end up being Blob
s or similar from fetch()
anyway. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that removing HTMLAudioElement
support for now is probably best. We should re-add it when the API actually supports streaming data through MediaStreamTrack
, like Web Speech.
As you said, web developers will be able to use fetch()
for simple cases:
const audio = document.querySelector('audio');
const audioBlob = await (await fetch('file.mp3').response()).blob();
const response = await session.prompt([
"My response to your critique:",
{ type: "audio", content: audioBlob }
]);
Incorporate updates in github.com/webmachinelearning/prompt-api/pull/71 Use new `union_name_map.conf` to extend the content union. Bug: 385173789, 385173368, 394123703 Change-Id: I75bf612cc8a56fe7eb5cf9d679d27afce24db4f2 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6259144 Auto-Submit: Mike Wasserman <msw@chromium.org> Commit-Queue: Andrey Kosyakov <caseq@chromium.org> Reviewed-by: Clark DuVall <cduvall@chromium.org> Reviewed-by: Andrey Kosyakov <caseq@chromium.org> Cr-Commit-Position: refs/heads/main@{#1419585}
Would be nice if someone would provide a multimodal example for the prompting API because all the examples I have seen till now are text-only. |
Closes #40. Somewhat helps with #70.