The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly. #824

raodaqi · 2024-06-27T06:38:04Z

System Info

vue

Environment/Platform

Description

pipeline(this.task, this.model, {
dtype: {
encoder_model: 'fp32',
decoder_model_merged: 'q4', // or 'fp32' ('fp16' is broken)
},
device: 'webgpu',
progress_callback,
});
I came up when SharedWorker was using webgpu
"Error: The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly."
This problem

Reproduction

worker.js
pipeline(this.task, this.model, {
dtype: {
encoder_model: 'fp32',
decoder_model_merged: 'q4', // or 'fp32' ('fp16' is broken)
},
device: 'webgpu',
progress_callback,
});

app.vue
new SharedWorker(new URL('./worker.js', import.meta.url), {
type: 'module',
});

xenova · 2024-06-29T12:08:34Z

Does this also happen when using a normal web worker? 👀

raodaqi · 2024-07-01T04:37:20Z

Does this also happen when using a normal web worker? 👀使用普通网络工作者时也会发生这种情况吗？ 👀

Normal workers do not have this problem, but Sharedworkers do.

xenova · 2024-07-01T08:03:11Z

Interesting - thanks for the information! @guschmue any idea what's going wrong?

guschmue · 2024-07-02T15:49:55Z

Not sure about Sharedworkers, let me look at it.
Without looking I'd assume Sharedworkers will behave similar to proxy = true.

kyr0 · 2024-07-09T00:07:16Z

@xenova Debugging the codebase I spotted one glitch after another.

I'm running my code in a web worker (specifically in a web extension), and was getting issues with dynamic imports and webgpu backend not being available. So I checked for issues with the onnxruntime-web package, as this seemed to be an upstream issue. I found microsoft/onnxruntime#20876 and switched to onnxruntime-web@1.19.0-dev.20240621-69d522f4e9 as suggested by the developer.

After that, I ran in plenty of issues with dynamic imports, transformerjs is trying to do in env.js, so I got a bit tired and simply forked it, removed the code that would try to use Node.js functionality, and got rid of all the auto loading. I ended up reverse engineering the pipeline that I needed for intfloat/multilingual-e5-small and got the following code running fine, the session is initialized just fine.., all good. It runs without issues until await model(modelInputs); is invoked.

  // load tokenizer config
  const tokenizerConfig = mlModel.tokenizerConfig;
  const tokenizerJSON = JSON.parse(
    new TextDecoder("utf-8").decode(await mlModel.tokenizer.arrayBuffer()),
  );

  console.log("tokenizerConfig", tokenizerConfig);
  console.log("tokenizer", tokenizerJSON);

  // create tokenizer
  const tokenizer = new XLMRobertaTokenizer(tokenizerJSON, tokenizerConfig);

  console.log("tokenizer", tokenizer);

  // tokenize input
  const modelInputs = tokenizer(["foo", "bar"], {
    padding: true,
    truncation: true,
  });

  console.log("modelInputs", modelInputs);

  // https://huggingface.co/Xenova/multilingual-e5-small in ORT format
  const mlBinaryModelBuffer = await mlModel.blob.arrayBuffer();

  const modelSession = await ONNX_WEBGPU.InferenceSession.create(
    mlBinaryModelBuffer,
    {
      executionProviders: ["webgpu"],
    },
  );
  console.log("Created model session", modelSession);

  const modelConfig = mlModel.config;
  console.log("modelConfig", modelConfig);

  const model = new BertModel(modelConfig, modelSession);
  console.log("model", model);

  const outputs = await model(modelInputs);

  let result =
    outputs.last_hidden_state ?? outputs.logits ?? outputs.token_embeddings;
  console.log("result", result);

  result = mean_pooling(result, modelInputs.attention_mask);
  console.log("meanPooling result", result);

  // normalize embeddings
  result = result.normalize(2, -1);

  console.log("normalized result", result);

When I run the model, which calls encoderForward(), the first issue occured: Setting the token_type_ids a zeroed Tensor didn't work, because apparently, model_inputs.input_ids.data was undefined.

So, why was it undefined? I noticed that the proxied Tensor instance's token_type_ids has a property called dataLocation now (not location) and it was set to "cpu". Also the data property wasn't existing, but now there is a cpuData storing the data:

I tried my luck with this patch:

And at least creating encoderFeeds.token_type_ids worked.

Checking the other comments on this issue: microsoft/onnxruntime#20876 (comment) I realized that I'm not the only one running into this, and checking here, I think this could lead in the same direction. This guy also had an issue right after tokenization and invoking the model, it seems like...

Next step - onnxruntime-web really doesn't like it's own datastructure:

So I tried my luck with another nasty hack...

But yeah, it doesn't help... the data structure simply seems to have changed in an incompatible way, as after all of that monkey patching of data structures, we get....

As we could see in the screenshot before, the code would access e.data instead of cpuData again and this could lead to some .byteLength of undefined, potentially. So I tried:

But it did not help...

And here I had enough debugging fun for today... good night xD

guschmue · 2024-07-09T17:05:38Z

Location is not intended to be set because just setting it would not move the data into the right place.
Only way to set location is indirectly via 'new ort.Tensor()' or 'ort.Tensor.fromGpuBuffer()'.

I'm not sure how input_ids could ever be not on cpu because the only way to get it to be not on cpu is to
call ort.Tensor.fromGpuBuffer() or list an output in session_options.preferredOutputLocation.
transformer.js is not calling the first and input_ids would never be an output.

Possible related to the transformers.js Tensor class. When we introduced gpubuffers, the transformers.js Tensor was changed to keep the original ort tensor instead of wrapping the ort tensor (the coder here: https://github.com/xenova/transformers.js/blob/v3/src/utils/tensor.js#L43).

Let me try your example.

kyr0 · 2024-07-09T18:51:41Z

Thank you for your response @guschmue. That makes sense. You can try my example by cloning https://github.com/kyr0/redaktool running bun install && bun run dev after that, you can load the extension in Chrome or Edge by visiting chrome://extensions/, enable developer mode and simply "Load unpacked extension". After selecting the folder of the extension code cloned (the folder that has the manifest.json) it will load. You can simply open a new tab of choice or reload an existing one. Opening the service worker from the chrome://extensions/ tab will show the log. Until modelInputs it works fine. If you uncomment all the code down to https://github.com/kyr0/redaktool/blob/main/src/lib/worker/embedding/model.ts#L69 and run bun run dev again, re-install the extension code, reload the tab and re-open the service worker log --> you'll find that the issues start to occur. You can find my monkey patching tries here: https://github.com/kyr0/redaktool/blob/main/src/lib/worker/embedding/transformers/models.js#L503 and there: https://github.com/kyr0/redaktool/blob/main/src/lib/worker/embedding/transformers/models.js#L197 (you may want to comment the latter to restore normal behaviour)

The rest of the transformer.js code is form yesterdays revision of this code base. A simple copy & paste with some imports removed as they collide with Worker limitations.

Thank you in advance for taking a look!

guschmue · 2024-07-09T23:09:06Z

oh, this looks cool. Let me try to get it to work.
Bekommen wir schon hin :)

kyr0 · 2024-07-10T08:42:31Z

@guschmue Klasse, vielen Dank! :) 💯 I'm also available via Discord 👍 https://discord.gg/4wR9t7cdWc

kyr0 · 2024-07-12T14:57:32Z

@guschmue Is there any way how I can help? Please let me know :) I could spend some time this weekend, debugging and fixing things =)

guschmue · 2024-07-12T16:00:29Z

I tested chrome extensions with webgpu and that works fine.
Getting to your code soon.

guschmue · 2024-07-15T16:37:25Z

a slightly modified version of your code works for me:

import { env, pipeline, AutoModel, AutoTokenizer } from '@xenova/transformers';

env.localModelPath = 'models/';
env.allowRemoteModels = false;
env.allowLocalModels = true;
env.backends.onnx.wasm.wasmPaths = "/public/";
env.backends.onnx.wasm.proxy = false;

const model_name = 'Xenova/multilingual-e5-small';
const tokenizer = await AutoTokenizer.from_pretrained(model_name);
let model = undefined;
AutoModel.from_pretrained(model_name, {device: 'webgpu'}).then((a) => {
    model = a;
    self.postMessage({
        status: 'ready'
    });
});


async function run(input_text) {
    const tokens = tokenizer(input_text);
    const output = await model(tokens);
    console.log(output);
    return "done";
}

self.addEventListener('message', async (event) => {
    const data = event.data;
    const text = data.text;

    run(text).then((result) => {
        self.postMessage({
            status: 'resp',
            text: result
        });
    });
});

and returns:

ot {cpuData: Float32Array(88320), dataLocation: 'cpu', type: 'float32', dims: Array(3), size: 88320}

kyr0 · 2024-07-17T00:39:00Z

Thank you @guschmue, I'll try to reproduce on my side and will get back to you soon.

ChTiSh · 2024-07-17T18:14:36Z

Ha, running into the same issue and found my place here, thank you again @kyr0 <3

kyr0 · 2024-07-18T19:31:33Z

@guschmue Hmm.. are you sure that it worked for you with the webgpu backend and didn't fallback to the WASM backend silently (downstream in Transformer.js)? Because it seems that @xenova/transformers has webgpu
disabled: https://github.com/xenova/transformers.js/blob/main/src/backends/onnx.js#L28

(One of the reasons why I forked Transformers.js and patched the code, debugged my way through the call stack and implemented it step-by-step manually, to the point where I ended with forwardEncode on the ONNX runtime, running into several issues here)

Well, currently, with your code, I'm ending up in a no available backend found. error.

Btw. from_pretrained isn't processing the device property either, if I'm not mistaken:
https://github.com/xenova/transformers.js/blob/main/src/processors.js#L2208

I was wondering a bit, and checked for the device symbol in the whole repo and found only docs related code:
https://github.com/search?q=repo%3Axenova%2Ftransformers.js%20device&type=code

Also, it would be interesting for me to know how the postMessage code worked in my code and how the fetch would resolve the WASM runtime inside of a Worker in my code, having the public dir set to /public/.

I've changed the build system in my project... so I think from now on I can use a fork of the current main revision and track down the issue more easily...

@ChTiSh Haha, welcome to the "stuck club" ;)

xenova · 2024-07-18T21:42:38Z

@kyr0 GitHub search doesn't index branches other than main, so you would need to inspect the code directly. For example, the device is set here:

transformers.js/src/models.js

Lines 149 to 162 in 1b4d242

    
           let device = options.device; 
        
           if (device && typeof device !== 'string') { 
        
               if (device.hasOwnProperty(fileName)) { 
        
                   device = device[fileName]; 
        
               } else { 
        
                   console.warn(`device not specified for "${fileName}". Using the default device.`); 
        
                   device = null; 
        
               } 
        
           } 
        
           // If the device is not specified, we use the default (supported) execution providers. 
        
           const executionProviders = deviceToExecutionProviders( 
        
               /** @type {import("./utils/devices.js").DeviceType|null} */(device) 
        
           );

kyr0 · 2024-07-19T00:38:00Z

@xenova Right.. however, the import in the code here was from @xenova/transformers, so I was assuming that the latest published version, aka @xenova/transformers@2.17.2.

But he probably has the package locally linked to a build of the v3 branch? I'll re-verify with v3 locally. Sorry for the confusion..

ChTiSh · 2024-07-19T00:51:03Z

I think V3 is linked with 1.18 onnx webgpu.

…

On Thu, Jul 18, 2024, 5:38 p.m. Aron Homberg ***@***.***> wrote: @xenova <https://github.com/xenova> Right.. however, the import in the example code was from @xenova/transformers, so I was assuming that the latest published version, aka @***@***.*** was meant. But he probably has the package locally linked to a build of the (v3)[ https://github.com/xenova/transformers.js/tree/v3] branch? I'll re-verify with v3 locally. Sorry for the confusion.. — Reply to this email directly, view it on GitHub <#824 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWQJC5YTYV74HMURTHKBULTZNBNX5AVCNFSM6AAAAABJ7IASMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZXHAYTKNRQGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

kyr0 · 2024-07-19T01:36:20Z

@ChTiSh Right. I forked it, checked it out locally, pinned it to 1.19.0-dev.20240621-69d522f4e9, built it, linked locally and used the code from above. But it attempts to import() dynamically, and that's not gonna work in a worker extension context:

If there only was an option to import the runtime from user space and also pass down the WASM runtime Module as a Blob. I have it all.. but both libraries (onnxruntime and transformers.js) try hard to fetch()/import() dynamically.

Maybe I can monkey patch it to provide IoC.. I had a hook working with the main branch where the internal call to importWasmModule would be passed up to the userland code so that I could implement env.backends.onnx.importWasmModule like that:

// copied that over from the `onnxruntime-web`
import getModule from "./transformers/ort-wasm-simd-threaded.jsep";

 env.backends.onnx.importWasmModule = async (
    mjsPathOverride: string,
    wasmPrefixOverride: string,
    threading: boolean,
  ) => {
    return [
      undefined,
      async (moduleArgs = {}) => {
        console.log("moduleArgs", moduleArgs); // got called, continued well...
        return await getModule(moduleArgs);
      },
    ];
  };

But then, the emscripten generated WASM runtime wrapper JS code would still attempt to fetch() the actual WASM file. Still looking to pass the Blob or an object URL in via moduleArgs or trick the backend state to mock the importWasmModule completely and assign the internal WASM module reference via some other way.

kyr0 · 2024-07-19T01:41:24Z

Maybe it should be highlighted again, that this is probably not a problem with "simple" web extensions and their content scripts. I'm talking running it in a service worker of a web extension (background script).

// excerpt from manifest.json
"background": {
    "service_worker": "src/worker.ts", // <- Transformers.js is imported here.
    "type": "module"
  },

ChTiSh · 2024-07-19T02:26:30Z

This might sound insane. I might be completely hallucinating, but I went through the whole process of force over writing onnx runtime to 1.19, then change the default to resolve the conflict, but at the very end state, I reached the exactly the same outcome with putting nothing related to onnx in service worker except for the simple thread configuration, and literally just have 1 line being '''device: webgpu''' in the instance.

kyr0 · 2024-07-19T02:36:55Z

Yeah, the "funny" part is, if you debug it through, will Tensorflow.js/ORT internally actually use the GPU? Because here I am, running the code like that and... have fun checking the screenshots :)

Code:

When I debug from_pretrained(), we can clearly see that the session is configured to use webgpu:

It's construction a session...

But here the fun begins... why is the instance a WebAssembly one?

Result:

I guess I'm hallucinating too xD Well, I haven't checked the ORT implementation... maybe the WASM calls through to WebGPU and returns the data via HEAP which is then passed by the runtime back/deserialized as an ONNX Tensor reporting location as cpu... but it's late, 4:30am again, I'm going to sleep... :)

guschmue · 2024-07-22T21:08:41Z

catching up ...
yes, I'm sure I'm using webgpu.
My package.json points to a local repo with the transformers.js v3 branch.
And I can see in the server worker console logs from webgpu that you can enable with

env.backends.onnx.logLevel = "verbose";
env.backends.onnx.debug = true;

@guschmue Hmm.. are you sure that it worked for you with the webgpu backend and didn't fallback to the WASM backend silently (downstream in Transformer.js)? Because it seems that @xenova/transformers has webgpu disabled: https://github.com/xenova/transformers.js/blob/main/src/backends/onnx.js#L28

(One of the reasons why I forked Transformers.js and patched the code, debugged my way through the call stack and implemented it step-by-step manually, to the point where I ended with forwardEncode on the ONNX runtime, running into several issues here)

Well, currently, with your code, I'm ending up in a no available backend found. error.

Btw. from_pretrained isn't processing the device property either, if I'm not mistaken: https://github.com/xenova/transformers.js/blob/main/src/processors.js#L2208

I was wondering a bit, and checked for the device symbol in the whole repo and found only docs related code: https://github.com/search?q=repo%3Axenova%2Ftransformers.js%20device&type=code

Also, it would be interesting for me to know how the postMessage code worked in my code and how the fetch would resolve the WASM runtime inside of a Worker in my code, having the public dir set to /public/.

I've changed the build system in my project... so I think from now on I can use a fork of the current main revision and track down the issue more easily...

@ChTiSh Haha, welcome to the "stuck club" ;)

guschmue · 2024-07-22T21:16:15Z

q8 isn't going to work with webgpu - webgpu itself doesn't support it yet (but might come).
We'd fall back to the wasm op.
env.backends.onnx.logLevel = "verbose";
would tell on which device each op landed.

kyr0 · 2024-07-22T23:03:03Z

Thank you @guschmue ! That explains the different runtime behaviour.

Well, off-topic limited core functionality qint8 support is growing, and to some extend, available at least in recent versions of Chrome. You can checkout my code to verify:
https://github.com/kyr0/fast-dotproduct/blob/main/experiments/dot4U8Packed.js#L29

But yeah, there is no generalized shader-u8 or anything, that's right. There's only shader-f16 for float16 data types:
https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/backend-webgpu.ts#L231

I should have thought about that. Thanks for the heads-up. Now that I think about it, it's obvious.

And man, there is so much potential for optimization in this backend impl.. Somebody probably should rewrite all the looping over data structure in WebAssembly or at least unroll the loop to be JIT-optimizer friendly...
https://github.com/microsoft/onnxruntime/blob/main/js/web/lib/wasm/jsep/backend-webgpu.ts#L532

I demonstrated the gains for using performance optimized code here: https://github.com/kyr0/fast-dotproduct
Analyzing this repo's code, I realized that especially https://github.com/xenova/transformers.js/blob/v3/src/utils/tensor.js and https://github.com/xenova/transformers.js/blob/v3/src/utils/maths.js could also massively benefit from JIT-optimization and a WebAssembly based implementation.

The fast-dotproduct repo also demonstrates how, using emscripten, one can inline the emscripten-generated WASM binary in the runtime file, and the runtime-file in the library file, so that there is no need to load anything dynamically. It's available instantly. WebAssembly nowadays is absolutely evergreen with > 97% in https://caniuse.com/wasm -- I think, there's not even the need to check for the constructor to be available or no :)

Just a few ideas..

ps.: Once a good test coverage would set the baseline for how each algo should work exactly, it would be safe to implement an alternative in WebAssembly without much breaking changes. Currently, the coverage isn't exactly great, but I guess I understand why.. just for such an attempt as of writing an alternative set of implementations, it would really make sense in a pragmatic sense to prevent regressions :)

I'd be willing to start working on optimizing mean pooling / normalization as I need that to be fast for my in-browser vector db just in case there is a consensus on that being a good idea :) (I'm normalizing my locally inferred text embeddings so that a simple dot product would yield me a cosine similarity score as the magnitudes are 1 already; so "insert speed" currently has a bottleneck that is the Transformers.js nomalization and pooling algos)

raodaqi added the bug Something isn't working label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly. #824

The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly. #824

raodaqi commented Jun 27, 2024

xenova commented Jun 29, 2024

raodaqi commented Jul 1, 2024

xenova commented Jul 1, 2024

guschmue commented Jul 2, 2024 •

edited

Loading

kyr0 commented Jul 9, 2024 •

edited

Loading

guschmue commented Jul 9, 2024

kyr0 commented Jul 9, 2024 •

edited

Loading

guschmue commented Jul 9, 2024

kyr0 commented Jul 10, 2024

kyr0 commented Jul 12, 2024

guschmue commented Jul 12, 2024

guschmue commented Jul 15, 2024

kyr0 commented Jul 17, 2024 •

edited

Loading

ChTiSh commented Jul 17, 2024

kyr0 commented Jul 18, 2024 •

edited

Loading

xenova commented Jul 18, 2024

kyr0 commented Jul 19, 2024 •

edited

Loading

ChTiSh commented Jul 19, 2024 via email

kyr0 commented Jul 19, 2024

kyr0 commented Jul 19, 2024 •

edited

Loading

ChTiSh commented Jul 19, 2024

kyr0 commented Jul 19, 2024 •

edited

Loading

guschmue commented Jul 22, 2024

guschmue commented Jul 22, 2024

kyr0 commented Jul 22, 2024 •

edited

Loading

The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly. #824

The data is not on CPU. Use getData() to download GPU data to CPU, or use texture or gpuBuffer property to access the GPU data directly. #824

Comments

raodaqi commented Jun 27, 2024

System Info

Environment/Platform

Description

Reproduction

xenova commented Jun 29, 2024

raodaqi commented Jul 1, 2024

xenova commented Jul 1, 2024

guschmue commented Jul 2, 2024 • edited Loading

kyr0 commented Jul 9, 2024 • edited Loading

guschmue commented Jul 9, 2024

kyr0 commented Jul 9, 2024 • edited Loading

guschmue commented Jul 9, 2024

kyr0 commented Jul 10, 2024

kyr0 commented Jul 12, 2024

guschmue commented Jul 12, 2024

guschmue commented Jul 15, 2024

kyr0 commented Jul 17, 2024 • edited Loading

ChTiSh commented Jul 17, 2024

kyr0 commented Jul 18, 2024 • edited Loading

xenova commented Jul 18, 2024

kyr0 commented Jul 19, 2024 • edited Loading

ChTiSh commented Jul 19, 2024 via email

kyr0 commented Jul 19, 2024

kyr0 commented Jul 19, 2024 • edited Loading

ChTiSh commented Jul 19, 2024

kyr0 commented Jul 19, 2024 • edited Loading

guschmue commented Jul 22, 2024

guschmue commented Jul 22, 2024

kyr0 commented Jul 22, 2024 • edited Loading

guschmue commented Jul 2, 2024 •

edited

Loading

kyr0 commented Jul 9, 2024 •

edited

Loading

kyr0 commented Jul 9, 2024 •

edited

Loading

kyr0 commented Jul 17, 2024 •

edited

Loading

kyr0 commented Jul 18, 2024 •

edited

Loading

kyr0 commented Jul 19, 2024 •

edited

Loading

kyr0 commented Jul 19, 2024 •

edited

Loading

kyr0 commented Jul 19, 2024 •

edited

Loading

kyr0 commented Jul 22, 2024 •

edited

Loading