-
-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly. #824
Comments
Does this also happen when using a normal web worker? 👀 |
Normal workers do not have this problem, but Sharedworkers do. |
Interesting - thanks for the information! @guschmue any idea what's going wrong? |
Not sure about Sharedworkers, let me look at it. |
@xenova Debugging the codebase I spotted one glitch after another. I'm running my code in a web worker (specifically in a web extension), and was getting issues with dynamic imports and After that, I ran in plenty of issues with dynamic imports, transformerjs is trying to do in // load tokenizer config
const tokenizerConfig = mlModel.tokenizerConfig;
const tokenizerJSON = JSON.parse(
new TextDecoder("utf-8").decode(await mlModel.tokenizer.arrayBuffer()),
);
console.log("tokenizerConfig", tokenizerConfig);
console.log("tokenizer", tokenizerJSON);
// create tokenizer
const tokenizer = new XLMRobertaTokenizer(tokenizerJSON, tokenizerConfig);
console.log("tokenizer", tokenizer);
// tokenize input
const modelInputs = tokenizer(["foo", "bar"], {
padding: true,
truncation: true,
});
console.log("modelInputs", modelInputs);
// https://huggingface.co/Xenova/multilingual-e5-small in ORT format
const mlBinaryModelBuffer = await mlModel.blob.arrayBuffer();
const modelSession = await ONNX_WEBGPU.InferenceSession.create(
mlBinaryModelBuffer,
{
executionProviders: ["webgpu"],
},
);
console.log("Created model session", modelSession);
const modelConfig = mlModel.config;
console.log("modelConfig", modelConfig);
const model = new BertModel(modelConfig, modelSession);
console.log("model", model);
const outputs = await model(modelInputs);
let result =
outputs.last_hidden_state ?? outputs.logits ?? outputs.token_embeddings;
console.log("result", result);
result = mean_pooling(result, modelInputs.attention_mask);
console.log("meanPooling result", result);
// normalize embeddings
result = result.normalize(2, -1);
console.log("normalized result", result); When I run the model, which calls So, why was it I tried my luck with this patch: And at least creating Checking the other comments on this issue: microsoft/onnxruntime#20876 (comment) I realized that I'm not the only one running into this, and checking here, I think this could lead in the same direction. This guy also had an issue right after tokenization and invoking the model, it seems like... Next step - So I tried my luck with another nasty hack... But yeah, it doesn't help... the data structure simply seems to have changed in an incompatible way, as after all of that monkey patching of data structures, we get.... As we could see in the screenshot before, the code would access But it did not help... And here I had enough debugging fun for today... good night xD |
Location is not intended to be set because just setting it would not move the data into the right place. I'm not sure how input_ids could ever be not on cpu because the only way to get it to be not on cpu is to Possible related to the transformers.js Tensor class. When we introduced gpubuffers, the transformers.js Tensor was changed to keep the original ort tensor instead of wrapping the ort tensor (the coder here: https://github.com/xenova/transformers.js/blob/v3/src/utils/tensor.js#L43). Let me try your example. |
Thank you for your response @guschmue. That makes sense. You can try my example by cloning https://github.com/kyr0/redaktool running The rest of the transformer.js code is form yesterdays revision of this code base. A simple copy & paste with some imports removed as they collide with Worker limitations. Thank you in advance for taking a look! |
oh, this looks cool. Let me try to get it to work. |
@guschmue Klasse, vielen Dank! :) 💯 I'm also available via Discord 👍 https://discord.gg/4wR9t7cdWc |
@guschmue Is there any way how I can help? Please let me know :) I could spend some time this weekend, debugging and fixing things =) |
I tested chrome extensions with webgpu and that works fine. |
a slightly modified version of your code works for me:
and returns:
|
Thank you @guschmue, I'll try to reproduce on my side and will get back to you soon. |
Ha, running into the same issue and found my place here, thank you again @kyr0 <3 |
@guschmue Hmm.. are you sure that it worked for you with the (One of the reasons why I forked Transformers.js and patched the code, debugged my way through the call stack and implemented it step-by-step manually, to the point where I ended with Well, currently, with your code, I'm ending up in a Btw. I was wondering a bit, and checked for the Also, it would be interesting for me to know how the I've changed the build system in my project... so I think from now on I can use a fork of the current @ChTiSh Haha, welcome to the "stuck club" ;) |
@kyr0 GitHub search doesn't index branches other than main, so you would need to inspect the code directly. For example, the device is set here: Lines 149 to 162 in 1b4d242
|
@xenova Right.. however, the import in the code here was from But he probably has the package locally linked to a build of the |
I think V3 is linked with 1.18 onnx webgpu.
…On Thu, Jul 18, 2024, 5:38 p.m. Aron Homberg ***@***.***> wrote:
@xenova <https://github.com/xenova> Right.. however, the import in the
example code was from @xenova/transformers, so I was assuming that the
latest published version, aka @***@***.*** was meant.
But he probably has the package locally linked to a build of the (v3)[
https://github.com/xenova/transformers.js/tree/v3] branch? I'll re-verify
with v3 locally. Sorry for the confusion..
—
Reply to this email directly, view it on GitHub
<#824 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AWQJC5YTYV74HMURTHKBULTZNBNX5AVCNFSM6AAAAABJ7IASMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZXHAYTKNRQGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@ChTiSh Right. I forked it, checked it out locally, pinned it to If there only was an option to import the runtime from user space and also pass down the WASM runtime Module as a Blob. I have it all.. but both libraries (onnxruntime and transformers.js) try hard to Maybe I can monkey patch it to provide IoC.. I had a hook working with the // copied that over from the `onnxruntime-web`
import getModule from "./transformers/ort-wasm-simd-threaded.jsep";
env.backends.onnx.importWasmModule = async (
mjsPathOverride: string,
wasmPrefixOverride: string,
threading: boolean,
) => {
return [
undefined,
async (moduleArgs = {}) => {
console.log("moduleArgs", moduleArgs); // got called, continued well...
return await getModule(moduleArgs);
},
];
}; But then, the emscripten generated WASM runtime wrapper JS code would still attempt to |
Maybe it should be highlighted again, that this is probably not a problem with "simple" web extensions and their content scripts. I'm talking running it in a service worker of a web extension (background script). // excerpt from manifest.json
"background": {
"service_worker": "src/worker.ts", // <- Transformers.js is imported here.
"type": "module"
}, |
This might sound insane. I might be completely hallucinating, but I went through the whole process of force over writing onnx runtime to 1.19, then change the default to resolve the conflict, but at the very end state, I reached the exactly the same outcome with putting nothing related to onnx in service worker except for the simple thread configuration, and literally just have 1 line being '''device: webgpu''' in the instance. |
catching up ...
|
Thank you @guschmue ! That explains the different runtime behaviour. Well, off-topic limited core functionality qint8 support is growing, and to some extend, available at least in recent versions of Chrome. You can checkout my code to verify: But yeah, there is no generalized I should have thought about that. Thanks for the heads-up. Now that I think about it, it's obvious. And man, there is so much potential for optimization in this backend impl.. Somebody probably should rewrite all the looping over data structure in WebAssembly or at least unroll the loop to be JIT-optimizer friendly... I demonstrated the gains for using performance optimized code here: https://github.com/kyr0/fast-dotproduct The fast-dotproduct repo also demonstrates how, using emscripten, one can inline the emscripten-generated WASM binary in the runtime file, and the runtime-file in the library file, so that there is no need to load anything dynamically. It's available instantly. WebAssembly nowadays is absolutely evergreen with > 97% in https://caniuse.com/wasm -- I think, there's not even the need to check for the constructor to be available or no :) Just a few ideas.. ps.: Once a good test coverage would set the baseline for how each algo should work exactly, it would be safe to implement an alternative in WebAssembly without much breaking changes. Currently, the coverage isn't exactly great, but I guess I understand why.. just for such an attempt as of writing an alternative set of implementations, it would really make sense in a pragmatic sense to prevent regressions :) I'd be willing to start working on optimizing mean pooling / normalization as I need that to be fast for my in-browser vector db just in case there is a consensus on that being a good idea :) (I'm normalizing my locally inferred text embeddings so that a simple dot product would yield me a cosine similarity score as the magnitudes are 1 already; so "insert speed" currently has a bottleneck that is the Transformers.js nomalization and pooling algos) |
System Info
vue
Environment/Platform
Description
pipeline(this.task, this.model, {
dtype: {
encoder_model: 'fp32',
decoder_model_merged: 'q4', // or 'fp32' ('fp16' is broken)
},
device: 'webgpu',
progress_callback,
});
I came up when SharedWorker was using webgpu
"Error: The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly."
This problem
Reproduction
worker.js
pipeline(this.task, this.model, {
dtype: {
encoder_model: 'fp32',
decoder_model_merged: 'q4', // or 'fp32' ('fp16' is broken)
},
device: 'webgpu',
progress_callback,
});
app.vue
new SharedWorker(new URL('./worker.js', import.meta.url), {
type: 'module',
});
The text was updated successfully, but these errors were encountered: