New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for transformers #375
Comments
My expectation is the WG will look at transformers and related requirements and gaps as part of its v2 feature work. We considered the initial CR "v1", so we're good to move here now. |
It would be valuable to briefly state 2 main kinds of applications of transformers, namely for predictive and generative AI, and each one if not both will be included in the charter. |
Support for transformers was discussed on today's call: The WG felt positive about the prospects of supporting transformers in WebNN and in accordance to the contribution guidelines decided to start explore applicable use cases in this issue first, then moving to investigation of samples models, cross-framework support and cross-platform implementability. |
@fdwr has been working on a Chromium WebNN prototype fdwr/chromium-src-webnn-dml#1 to inform what additional operators are needed in WebNN to support a well-known generative AI model, Stable Diffusion. I expect this prototyping effort to help inform this discussion on use cases. I believe this prototype is WIP, so @fdwr feel free to drop a comment here when appropriate to share your findings. @dani-lbnl do you have specific predictive or generative AI models in mind that are in use in your area of research? We could look into them more closely in a similar fashion. |
We've discussed this topic on a few of our bi-weekly calls and so far proposed investigation paths include Stable Diffusion (informed by @fdwr's Chromium experiment), SegmentAnything (thanks @huningxin!), Transformers.js/HuggingFace's transformers. Want to propose something else? Drop a comment here. We should use this issue to discuss the most promising use cases enabled by transformers that are a good fit to be run in the browser then decomp to see what new ops would be needed in WebNN. Based on this update the use cases in the spec as appropriate. I'm proposing we try to identify a few key use cases first to keep the spec and implementation close to each other. I'll try to keep this topic on our bi-weekly agenda so folks can also bring their input on the call. |
@xenova may be able to provide insights from Transformers.js :-) |
Thanks for the ping! I'll list a few things that I have learnt/experienced while developing Transformers.js.
Current abilitiesTransformers.js currently supports 17 different tasks in different modalities, including:
For the full list, see https://huggingface.co/docs/transformers.js/index#tasks We're mainly focused on adding tasks which have text-based inputs at the moment, primarily due to processing limitations. Some of the other modalities work quite well (e.g., Whisper for speech-to-text; demo), while others (e.g., image segmentation) take much longer and are not suitable to CPU-based inference. Once WebGPU support is added (see here for progress), we'll continue adding the more "demanding" tasks, like text-to-image (e.g., stable diffusion). LimitationsFirst, a discussion on the limits. The current maximum model sizes I have tested and have got working reliably are between 800M and 1B parameters. The main contributing factors are:
Focusing on NLP tasks, we've been able to get relatively large models running in the browser with onnxruntime-web (using their WASM backend). This includes:
However, in practice (especially due to current tech limitations), loading such large (general) models are better run outside of a browser. Once WebGPU becomes more standardized and approaches native performance (currently impacted by redundant bounds checking), this might change. But as of right now, I think it's best to focus on specific use cases. Practical use casesI've mentioned to a some people that the original reason I developed Transformers.js was out of a need to run a ML-powered chrome extension to block spam YouTube comments. I tested BERT, DistilBERT, and T5, and they all worked pretty well! There are some problems with multithreading in chrome extensions at the moment (like this), but once that is fixed, I think you'll see many more ML-powered chrome extensions which run locally in browsers. Anyway, here are some actually useful ideas:
I look forward to seeing the progression of the WebNN standard, and I hope that one day we can add it as a backend of Transformers.js! |
This topic was discussed on our 8 June 2023 call where @xenova gave a well-received presentation on Transformers.js (thanks!). The WG had an active discussion around the presented Transformers.js-enabled use cases. Transformers.js demonstrates a number of transformer-centric generative models for various real-world tasks are now feasible in the browser:
These tasks will now inform the WG's WebNN v2 feature priorities similarly to how the majority of the WG's existing WebNN v1 use cases were informed by predictive ("old school AI") models when we initiated this effort. Notably, also many of the existing WebNN v1 use cases such as those NLP-related are now improved with transformers. This issues remains open for further feedback, comments and contributions from other projects in this space. I expect the ongoing Stable Diffusion Chromium experiment to soon provide additional insights into text-to-image use case feasibility in the browser context. Thank you for your continued contributions everyone! 🚀 |
We continued transformer-centric discussion on our 29 June 2023 call where @fdwr gave another well-received and informative presentation on Transformer models via WebNN in ORT & Chromium (thanks again!). We agreed to use this and the earlier Transformers.js presentation as input to inform our v2 op effort. We welcome further contributions from anyone interested in this space. We discussed our intent to start with a tracking issue for v2 ops (we can reuse this issue or spin a new) and have op-specific detailed discussion in specific issues. We will use our contributing guidelines as the guide but on the high level we want to provide a list of proposed new ops and data types to support transformer-based generative AI use cases for key models. This allows us to seek broader review outside this group for the proposed expanded op set. |
From this presentation and prototype IDL, these operators are needed for: Elementwise comparisonequalCompares two inputs of the same element data type and returns an 8-bit tensor with 0=false or 1=true. It follows standard IEEE rules for NaNs. Denormal/subnormal comparison behavior is unspecified, dependent on the device CPU/GPU/NPU. partial interface MLGraphBuilder {
MLOperand equal(MLOperand a, MLOperand b);
} Pseudocode:
Element data types:
greaterpartial interface MLGraphBuilder {
MLOperand greater(MLOperand a, MLOperand b);
} Pseudocode:
Element data types:
lesserpartial interface MLGraphBuilder {
MLOperand lesser(MLOperand a, MLOperand b);
} Pseudocode:
Element data types:
Alternate names?
Elementwise logical functions/selectionlogicalNotInverts every element of an 8-bit tensor, not to be confused with a Pseudocode:
Element data types:
(future: logicalAnd, logicalOr, logicalXor, bitwiseAnd, bitwiseOr, bitwiseXor...) elementwiseIf / ternary selectA per-element immediate partial interface MLGraphBuilder {
MLOperand elementwiseIf(MLOperand condition, MLOperand trueValues, MLOperand falseValues);
} Pseudocode:
Decomposition: Element data types:
NotesInput tensors are broadcasted to the final output shape. So given Alternate names: greaterOrEqual/lesserOrEqualFor set completeness, these two are worth considering too (not in the models, but missing them would be an awkward gap). Note More elementwise unary operationsidentityReturns the input as-is. Although it's a nop copy, having this completes the set (every ML framework has one), provides a direct mapping for frameworks, and is a useful placeholder in more complex graphs that you can insert without the caller needing to stitching up topology (e.g. swapping out an activation function with a nop, or working around issues with split where multiple inputs have the same name). We've already encountered cases where having this would have been useful when mapping from ONNX Runtime to WebNN too. partial interface MLGraphBuilder {
MLOperand identity(MLOperand input);
} Pseudocode:
Element data types:
Alternate names?
sqrtElementwise square root. See #438. Tis equivalent to partial interface MLGraphBuilder {
MLOperand sqrt(MLOperand input);
} Pseudocode:
Element data types:
erfThe Gauss error function occurs frequently in probability statistics. partial interface MLGraphBuilder {
MLOperand erf(MLOperand input);
} Pseudocode:Polynomial expansion approximation
Element data types:
reciprocalThis inverse is often used in conjunction with multiplication because it's faster than division. GPU's typically implement a dedicated " partial interface MLGraphBuilder {
MLOperand reciprocal(MLOperand input);
} Pseudocode:
Element data types:
Reshaping operationsReshaping operations do not modify the values, just reinterpret the elements with a new shape. Following are a class of operators that should either all be added to the set, or they should all be just resolved into the explicit shape by the caller and implemented via
|
WebNN operator | TOSA | Stable HLO |
---|---|---|
argMax | tosa.argmax | --- |
argMin | tosa.argmin | --- |
cast | tosa.cast | stablehlo.convert |
elementwiseIf | tosa.select | stablehlo.select |
equal | tosa.equal | stablehlo.compare EQ |
erf | tosa.erf | --- |
expand | tosa.tile(a, tosa.div(b.shape, a.shape)) | stablehlo.broadcast_in_dim? |
fillSequence | --- | stablehlo.iota (lacks start and step) |
gather | tosa.gather | stablehlo.gather? (much more complicated) |
greater | tosa.greater | stablehlo.compare GT |
greaterOrEqual | tosa.greater_equal | stablehlo.compare GE |
identity | tosa.identity | --- |
lesser | --- | stablehlo.compare LT |
lesserOrEqual | --- | stablehlo.compare LE |
logicalNot | tosa.logical_not | stablehlo.not with bool |
meanVarianceNormalization | --- | (nearest is stablehlo.batch_norm_inference) |
reciprocal | tosa.reciprocal | --- |
reshape | tosa.reshape | stablehlo.reshape |
sqrt | tosa.reciprocal(tosa.rsqrt) or tosa.pow | stablehlo.sqrt |
triangularMatrix | --- | --- |
triangularMatrix
has no direct/indirect mapping to either TOSA or StableHLO, but there is no known decomposition from smaller primitives.meanVarianceNormalization
has no exact mapping, but there exists an operator of equivalent complexity and similarity instablehlo.batch_norm_inference
.
Questions for readers:
- Do you see any operators that make sense to add to this list for set completeness (even if not actually used in the models)?
- Are there any naming incongruities you see intra-spec? As a global web standard, I expect naming to be more holistically and rigorously thought out than some existing libraries where operators were added more adhoc over time, but it's also important for framework developers to have a clear mapping from their library to WebNN, and so including alternately known names directly in the specification for trivial Ctrl+F searchability is wise.
Chai will create a review for it - we can comment on the details there...
The WG would like to add also text-to-text models. @xenova, given your relevant Transformers.js experience, please feel free to propose text-to-text model(s) you think are the most appropriate targets. (This topic was discussed at WebML WG Teleconference – 24 August 2023.) |
I'd love to! I'll break them up into text-to-text generation (encoder-decoder) models and text-generation (decoder-only) models: Text-to-text generation (encoder-decoder)Of the trending text-to-text models on the Hugging Face Hub, the majority are t5-based or some variation thereof (e.g., flan-t5-base, t5-v1_1, mt5-base). Even newer models like musicgen-small use a t5 text-encoder. Other non-t5 architectures include m2m100 (e.g., m2m100_418M) and bart (e.g., rebel-large). Click to see image of most-downloaded text-to-text models in the past 30 dayssee latest list: https://huggingface.co/models?pipeline_tag=text2text-generation&sort=downloads In fact, I've actually already added support for each of these architectures (except musicgen) to Transformers.js, for tasks including translation, summarization, or even instruction finetuned text2text models. Text-generation (decoder-only)On the other hand, taking a look at the trending text-generation models, we unsurprisingly see a ton of llama models pop up: base models (e.g., llama-2-7b) as well as finetuned versions (e.g., Platypus2-70B-instruct) for conversational use-cases. However, in a web context, I haven't seen anything larger than 7 billion parameters run in the browser. To see what's currently "state-of-the-art", check out the Open LLM Leaderboard; you can also sort by model sizes (<1B, ~3B, and ~7B are most relevant to our discussions). As a side note, for those unaware, other projects (particularly mlc.ai) have demonstrated that it is possible to run 7-billion parameter models in the browser with WebGPU, with 4-bit quantization. Click to see image of most-downloaded text-generation models in the past 30 dayssee latest list: https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads On the smaller (and more reasonable/manageable) side of things, models like gpt2, bloom, gpt-neo are strong contenders for use in web browsers, and are very useful when finetuned on specific use-cases. My favorite use-case right now is code-completion. In these cases, the best models I've seen are: StarCoder, CodeGen, and DeciCoder. I created a little playground/demo to showcase these models being run in-browser with Transformers.js, which you can play around with here (or see the demo video). |
Per WebML WG Teleconference – 7 September 2023 discussions, the following architectures were proposed as additional targets to provide a baseline for a range of browser-friendly use cases in: Thanks @xenova for sharing all the insights and data from Hugging Face Hub that helps the group make informed decisions! With the model targets in place and agreed, we will follow a similar path as with the first-wave models: do an op breakdown to better understand what is common across these architectures so we can make informed decisions on priorities. We'll conduct this evaluation in public and welcome contributions. |
All these transformer models contain dynamic shape, and a good news that ONNX Runtime Web recently enabled a really useful feature: sessionOptions.freeDimensionOverrides, that supports dynamic shape models by setting free dimensions. Besides, ONNX Runtime provides various graph optimizations to improve the performance. Such as constant folding, shape inference, node eliminations, node fusions and so on. This is enabled by default during initializing an inference session. I.e. The ONNX Runtime Web applies all enabled graph optimizations before performing model inference. That means, the WebNN EP will actually run the optimized graphs rather than the original dynamic shape models. After graph optimization, I found number of nodes are eliminated or fused. Follow's the comparison table, you will have a completed view of ops change during graph optimization in all these transformer models. |
Please check the original source data of above spreadsheet from https://docs.google.com/spreadsheets/d/1ELfHuv2UqP2LoXWLgqsC0L8T_qqfBx48KxzFighl8d8/edit#gid=86906292. Next step, I will integrate the data from @fdwr's comment to the table for TOSA and StableHLO mapping. |
Regarding This looks identical to index based accessor, and could lead to out-of-bounds read. I don't think Ahead-of-Time checks are possible because indices are dynamic. I think the spec should explicitly define such behavior. Backends seem to have different opinions on how this should be handled:
I brought this up based on a recently published WebGPU security technical report: https://chromium.googlesource.com/chromium/src/+/main/docs/security/research/graphics/webgpu_technical_report.md I'd recommend a read because it provides insight into what kind of security challenges we could face by exposing access to low level hardware (e.g. something can DMA). My takeaways:
|
Just checking, regarding Sqrt - supported data type sections reads Element data types: Is that intentional or shouldn't this be just floats. |
@sushraja-msft: Thanks - fixed typo. |
reciprocal This CL implements directml nodes for unary operators logical not, identity, erf, and reciprocal. Spec for these operators is available here webmachinelearning/webnn#375 (comment) Unit test is added for the operators, a follow up change will test corner cases such as data type support, sqrt(-1) etc. that requires additional test helper methods. Bug: 1273291 Change-Id: I3ffbacdff7b7e0a0604c53c1869519bc3b761026 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4990981 Reviewed-by: ningxin hu <ningxin.hu@intel.com> Commit-Queue: Sushanth Rajasankar <Sushraja@microsoft.com> Reviewed-by: Rafael Cintron <rafael.cintron@microsoft.com> Reviewed-by: Alex Gough <ajgo@chromium.org> Cr-Commit-Position: refs/heads/main@{#1219715}
This issue has been and will be discussed actively on our calls to keep us focused on this priority task. Please continue to use this issue for feedback, suggestions, questions. The WG currently believes the following 7 models across 4 broad modalities represent good targets for this push:
The WG also believes these models are implementable on major platforms and address diverse browser-friendly use cases with user demand. The WG participants continue implementation efforts to help inform this work. We may adjust this selection of models if new information emerges from these experiments. Please see the op breakdown table for a detailed mapping of these models to proposed new ops. Included is also ONNX, TOSA, StableHLO op set mapping for comparison. Many thanks to @Honry for maintaining this table. |
Great to see this thread, as part of of WebLLM project https://github.com/mlc-ai/web-llm We are also building related compilation flows for WebGPU, with ability to run llama up to 70b(with latest M2 max) https://webllm.mlc.ai/ There are great synergies to webnn related projects that possibly enables future hybrid executions of models(e.g. webgpu for customized op and some through webnn) |
kSqrt This change implements blink side changes to enable building ml graphs with Erf, Identity, LogicalNot, Reciprocal operators. The spec for these operators is available here webmachinelearning/webnn#375 (comment) Bug: 1273291 Change-Id: Idb6d6d82428f4773c782850908cf42ae8502943a Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5015083 Reviewed-by: ningxin hu <ningxin.hu@intel.com> Reviewed-by: Rafael Cintron <rafael.cintron@microsoft.com> Reviewed-by: Jiewei Qian <qjw@chromium.org> Commit-Queue: Sushanth Rajasankar <Sushraja@microsoft.com> Cr-Commit-Position: refs/heads/main@{#1222600}
@tqchen thanks for your feedback! This WG has discussed WebLLM and is hugely inspired by the project. We'd be happy to have a high-bandwidth discussion with you to hear your learnings and suggestions around hybrid execution on one of our future bi-weekly calls when it fits your schedule. We meet Thu 7 am Pacific. The feedback from Transformers.js, ONNX Runtime Web, TF.js and other frameworks have informed WebNN API development and direction. We'd like to get your first-hand insights considered too, including hearing about any browser-specific workarounds and optimizations you've had to make to get large-language models running in today's browser builds. We can help get any such issues looked at with the help of browser engineers who participate this WG. Edit: Spun off into its own issue: #480 |
This operator expands input shapes to new shapes, and the input shapes must be broadcastable according to numpy-broadcasting-rule [1]. The spec for the operator is available here [2]. [1] https://www.w3.org/TR/webnn/#biblio-numpy-broadcasting-rule [2] webmachinelearning/webnn#375 (comment) Bug: 1273291 Change-Id: Ic8b16c293e175bc0865883fb2b4cc93473ddf039 Cq-Include-Trybots: luci.chromium.try:win11-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5022177 Commit-Queue: Junwei Fu <junwei.fu@intel.com> Reviewed-by: ningxin hu <ningxin.hu@intel.com> Reviewed-by: Rafael Cintron <rafael.cintron@microsoft.com> Reviewed-by: Jiewei Qian <qjw@chromium.org> Cr-Commit-Position: refs/heads/main@{#1224053}
hi! I did an experiment to try to convert from the original pytorch whisper model to webnn by hand, here are some gaps I find needed for the whisper model, aside from the new ops you are proposing:
@anssiko @wchao1115 @fdwr I am still new to this space, let me know if I misunderstood anything :) |
@philloooo thanks for this experiment, @xenova may have insights on this given his recent work on whisper-tiny and also distil-whisper. |
This CL implements directml nodes for unary cast operator. The spec for this operator can be found here webmachinelearning/webnn#375 (comment) Unit test is added for all data types the operator supports. Bug: 1273291 Change-Id: I1a3753a1eb41dddf1fdbcef036124de1151b7edf Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5050358 Reviewed-by: ningxin hu <ningxin.hu@intel.com> Commit-Queue: Sushanth Rajasankar <Sushraja@microsoft.com> Reviewed-by: Alex Gough <ajgo@chromium.org> Reviewed-by: Rafael Cintron <rafael.cintron@microsoft.com> Cr-Commit-Position: refs/heads/main@{#1228121}
@philloooo Hi Yajing. Cool, you're trying out WebNN. Thanks for your observations.
|
thanks! @fdwr |
@philloooo: WebNN focuses on statically compiled graphs (at least as of this comment, meaning for a different size, you need to rebuild the graph), but one common technique in such cases is to round up. For example with Stable Diffusion, the text prompt is a variable size, but the token id input is padded with empty tokens and rounded up to 77 tokens, and so the same graph can work with multiple different prompts. |
It is my honour to announce the WG has just reached another major milestone by adding support for operations needed for well-known transformers! 👏 🚀 This work happened in PR #478 and is now delivered as a new W3C Candidate Recommendation Draft published on 11 December 2023 at https://www.w3.org/TR/webnn/ For a summary of changes, see: Thank you everyone, in particular the editors @wchao1115 @huningxin & co who diligently worked on this PR addressing in total 195 review comments, @fdwr @xenova for key contributions that helped shape and formulate the initial scope in #375 (comment) and #375 (comment), @Honry for the transformer models analysis, @BruceDai @mei1127 for WPT and webnn-baseline contributions, @wacky6 for continued careful review and comments also via Chromium CLs, @inexorabletash @zolkis for contributions that helped keep this PR aligned with the latest spec authoring conventions, @miaobin @shiyi9801 @RafaelCintron for all the implementation-informed insights, and all the other contributors whose GH handles escaped me right now -- your contributions are equally appreciated! Please join us to celebrate this major milestone on our 14 December 2023 teleconference! 🥳 🍿 (We will keep this meta issue open for discussion on future enhancements.) |
While our draft charter says that the group:
and while the first two are directly mentioned in WebNN, the latter aren't.
The text was updated successfully, but these errors were encountered: