Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom operations #6

Open
dsmilkov opened this issue Jan 10, 2019 · 24 comments
Open

Custom operations #6

dsmilkov opened this issue Jan 10, 2019 · 24 comments

Comments

@dsmilkov
Copy link

dsmilkov commented Jan 10, 2019

Starting a thread to open the discussion for supporting custom operations.

The ML field is fast moving and model architectures and operations are evolving quickly. In TensorFlow.js, we have around ~200 ops and we still run into issues of missing ops when someone is trying to port a new model to the browser. I believe that the number of built-in ops will be relatively small and will grow very slowly due to standardization.

Thus, it is important to provide a way for library authors to write custom ops that can interop with the built-in neural net ops. That means having high-performance data exchange between custom ops and built-in ops. I stress the importance of high-performance, otherwise lib authors would revert back to implementing all of the ops using lower-level APIs (e.g. WebGPU).

A good way to start is to understand the scope and complexity of the problem. We can look at technical details on how browser-vendors plan to implement built-in ops, which gives us details about where these ops run and where the data lives.

@tomoyukilabs
Copy link
Contributor

I agree to provide a way to create custom ops. More specifically, I guess that developers would like to write custom ops in the similar manner to making custom ops on Python ML frameworks, e.g. using tf.math in TensorFlow, numpy, etc.

The current charter states that basic algebra ops using WebGL, WebGPU shaders and WebAssembly SIMD are out of scope. On the other hand, platform-level neural network APIs seem to support several basic arithmetic ops according to ops mapping table by @huningxin. IMO, we need to survey what ops would be typically needed to write custom ops and what are already provided or not provided by platform-level APIs. Also, it might be important to know how browser vendors will implement built-in ops.

@DanielMazurkiewicz
Copy link

@dsmilkov @tomoyukilabs - updated API proposal, please check if this would suit your needs: https://github.com/DanielMazurkiewicz/WebAI

(custom operations and atomic ml operations API described at the end of document)

@cynthia
Copy link

cynthia commented Feb 8, 2019

For starters, the ergonomics of algebraic operations (especially with custom units) would be problematic as there is no operator overloading. (There have been attempts in the past, but none have been particularly successful.) The pipeline operator proposal (which will also take a decent amount of time) might help with this but only in a limited way; since precedence isn’t nice/straightforward.

The operator list on the document linked above definitely is not enough to do proper custom layers.

I’m under the opinion that this should probably not be in scope for the initial deliverable. It’s a pretty big pond to boil.

@tomoyukilabs
Copy link
Contributor

The current charter says, "The APIs in scope of this group will not be tied to any particular platform and will be implementable on top of existing major platform APIs." In other words, I guess that we may need to consider how operations not included in major platform-level APIs should be provided to framework implementors or web developers.

There may be a couple of possible ways to provide such operations:

  • define those operations in the WebNN spec as normative
  • provide a way to construct custom operations

The operator list on the document linked above definitely is not enough to do proper custom layers.
I’m under the opinion that this should probably not be in scope for the initial deliverable. It’s a pretty big pond to boil.

I agree with that. However, if custom operations are not supported, available DNN models completely depend on operations supported in WebNN API. We should carefully consider what built-in and custom operations should be, in terms of both application-level use cases and ease of browser vendor's implementation.

@cynthia
Copy link

cynthia commented Feb 10, 2019

@dsmilkov @tomoyukilabs - updated API proposal, please check if this would suit your needs: https://github.com/DanielMazurkiewicz/WebAI

@dsmilkov emphasized high-performance, falling back into the main thread to do the operation in JS means 1) a tensor copy from accelerator to the CPU over the bus and back, which is expensive and 2) not friendly to vectorization.

@DanielMazurkiewicz
Copy link

DanielMazurkiewicz commented Feb 10, 2019

@dsmilkov @tomoyukilabs - updated API proposal, please check if this would suit your needs: https://github.com/DanielMazurkiewicz/WebAI

@dsmilkov emphasized high-performance, falling back into the main thread to do the operation in JS means 1) a tensor copy from accelerator to the CPU over the bus and back, which is expensive and 2) not friendly to vectorization.

Ok, I don't know what is exactly working on on bottom of TF, assumed it falls back to JS and copies data between VRAM and regular RAM for every operation (or first and last operation in chain). But if not, this API proposal is still valid, domain doesn't force any particular way of operation usage. It can be used to build "programmatically" code for GPU from Javascript (that is only other way that comes to my mind that it works, but correct me if I'm wrong here too please), just appropriate set of operators (or additional operators) should be provided then.

@huningxin
Copy link
Contributor

huningxin commented Feb 13, 2019

@dsmilkov wrote:

Thus, it is important to provide a way for library authors to write custom ops that can interop with the built-in neural net ops.

I agree extensibility is an important requirement for the low-level API whose major consumers would be the libraries.

@tomoyukilabs wrote:

The current charter states that basic algebra ops using WebGL, WebGPU shaders and WebAssembly SIMD are out of scope.

Today's libraries implement neural network ops by WebGL and WebAssembly. I suppose they will continue to adopt new features, like WebGPU compute shader and WebAssembly SIMD/Threading, for performance enhancement. So it makes sense to me that this group leaves the implementation of custom ops out of scope.

@dsmilkov worte:

That means having high-performance data exchange between custom ops and built-in ops. I stress the importance of high-performance, otherwise lib authors would revert back to implementing all of the ops using lower-level APIs (e.g. WebGPU).

I agree high-performance is the main reason to create this API, as the current charter states that WebNN is for neural network inference hardware acceleration. To access the hardware acceleration through WebNN, a library can partition a neural network into sub-graphs based on the supported ops of WebNN. Some sub-graphs can be executed by WebNN if their ops are all supported. Other sub-graphs need to be executed by custom ops within the library. The WebNN sub-graphs and custom ops should exchange data (tensor) in a efficient way. Otherwise, it is easy to lose the performance gain of WebNN execution due to memory movement/reordering overhead. We observed this issue when experimenting TensorFlow.js optimization on our WebNN POC earlier.

@dsmilkov wrote:

A good way to start is to understand the scope and complexity of the problem. We can look at technical details on how browser-vendors plan to implement built-in ops, which gives us details about where these ops run and where the data lives.

It sounds like a good approach.

Based on our POC and platform API investigation, WebNN may execute the neural network on different devices, e.g. CPU, GPU or accelerator. The API for custom ops implementation would be related to the WebNN execution device. For instance, a library may select WebAssembly-based custom ops when WebNN executes sub-graphs on CPU and select WebGPU-based custom ops when WebNN executes on GPU.

Given the WebGPU is WIP, I propose to look at the WebAssembly-based custom ops first. Within our POC, the existing graph execution API accepts inputs and outputs in ArrayBufferView where the WebAssembly-based custom ops should be able to read and write. For WebNN CPU execution, we've prototyped backends with BNNS for macOS, MKL-DNN for Linux/Windows and NNAPI for Android (the current version of NNAPI doesn't support hardware selection, we can workaround it by using device that only has CPU implementation of NNAPI). Based on this setup, we can start to investigate the solution and performance for WebAssembly-based custom ops. Later by coordinating with WebGPU @Kangz, we can extend the investigation to WebGPU-based custom ops with WebNN GPU execution, e.g. the MPS backend of our POC.

Any thoughts?

@tomoyukilabs
Copy link
Contributor

LGTM, @huningxin. Thanks for your detailed explanation.

@dsmilkov
Copy link
Author

LGTM as well. It comes down to high-performance data exchange between the built-in operations and user code. "Custom operations" follow from that and are the responsibility of user libraries to implement.

@huningxin
Copy link
Contributor

Thanks @tomoyukilabs @dsmilkov !

As mentioned in 14 Feb 2019 call, I will investigate the "custom operations" support on CPU as the first step and hopefully can report back in March call.

@gramalingam
Copy link
Contributor

Doesn't this essentially require standardizing the in-memory representation of tensors?

@huningxin
Copy link
Contributor

I will investigate the "custom operations" support on CPU as the first step and hopefully can report back in March call.

Here is the investigation report. I'll update it in today's CG meeting.

@gramalingam worte:

Doesn't this essentially require standardizing the in-memory representation of tensors?

I agree this is essential requirement. And according to our investigation, we also need to pay attention to memory re-layout overhead, say from plain tensor layout format (e.g. NHWC) to native CPU backends (e.g. MKLDNN uses blocked layout).

@huningxin
Copy link
Contributor

huningxin commented Jun 26, 2019

As a case study of custom ops support for frameworks, @pinzhenx and me prototyped a WebNN backend for ONNX.js.

There are some findings:

  1. If the framework is able to share the graph info with a backend, it would help the integration of a graph-building API, e.g. WebNN. For example, in our prototype, when ONNX.js loads a model file and builds the graph in memory, the WebNN backend is invoked to transform the graph through SessionHandler.transformGraph interface. The WebNN backend tries to partition a sub-graph whose ops are supported by WebNN and rewrite the graph by replacing the sub-graph with a custom WebNN op node (WebNNGraphOp).
  2. If the framework is able to execute ops by multiple backends, it would help the integration of WebNN and custom ops of framework. For example, in our prototype, the WebNN backend only supports 11 ops, however it can fallback to WASM backend and CPU backend thanks to SessionHandler.resolve and ResolveRule interfaces. So when ONNX.js executes the nodes of the graph, executing the WebNNGraphOp node actually executes the corresponding WebNN graph. The other remaining op nodes are executed by WASM backend. If they are still not supported by WASM backend as well, they are fallback to CPU backend.
  3. Performance wise, if the WebNN sub-graph is as big as possible, there would be a good speedup for whole graph execution. For example, when offloading the whole graph of Resnet-50, we observed 9X speedup (WebNN-GPU vs. WebGL) for GPU execution and 30X speedup (WebNN-CPU vs. WASM) for CPU execution on a PC.
  4. The tensor layout conversions would take significant overhead. It should be minimized as much as possible. For example ONNX.js uses NCHW but the current WebNN foundation API only supports NHWC. In our prototype, when executing a WebNN sub-graph, WebNN backend converts NCHW tensor to NHWC for input tensors and converts back for output tensors in JavaScript code. We may consider to support NCHW in WebNN, so browser can optimize the conversion in native.

@huningxin
Copy link
Contributor

huningxin commented Aug 22, 2019

As a follow-up to 8 Aug 2019 call , I've done a very initial investigation of the WebGPU and WebNN memory sharing. The investigation is based on WebGPU backend of TF.js, WebGPU Dawn project and WebNN POC. In particular, I only touched the Metal backend of Dawn and MPS backend and WebNN POC.

Compilation

For sub-graph compilation, WebNN may need to support WebGPUDevice as compilation target (Compilation.setGPUDevice(WebGPUDevice device)). So framework can make sure WebNN allocates and compiles the sub-graph on the same GPU device as its WebGPU backend. For example:

// Create a Compilation object for the constructed model for sub-graph.
const nn_compilation = await model.createCompilation();

// webgpu_backend: framework's WebGPUBackend
// Get the GPUDevice of WebGPUBackend and set that as WebNN compilation target.
nn_compilation.setGPUDevice(webgpu_backend.device);

// Finish the compilation.
await nn_compilation.finish();

// Create an Execution object for the compiled model.
const nn_execution = await nn_compilation.createExecution();

In WebNN implementation, for example the MPS, it would get the MTLDevice object associated with the WebGPUDevice. And it could create the MPSNNGraph - initWithDevice:resultImage with that object.

Execution

Framework implements custom kernels in WebGPU compute shader and uses WebGPUBuffer for data input and output. To allow framework interleaves execution of WebGPU kernels and WebNN sub-graph, WebNN may need to support WebGPUBuffer object as inputs and outputs of execution (Execution.setInput(unsigned long index, WebGPUBuffer buffer) and Execution.setOutput(unsigned long index, WebGPUBuffer buffer) ). For example:

// webgpu_backend: framework's WebGPUBackend
// pre_input, pre_output, post_input, post_output: framework's Tensor backed by WebGPUBuffer
// pre_program, post_program: framework's WebGPUProgram for pre and post processing

// Write the input_data from CPU to GPU.
// input_data: Float32Array
webgpu_backend.write(pre_input, input_data);

// Compile and run pre-processing kernel in WebGPU compute shader.
webgpu_backend.compileAndRun(pre_program, [pre_input], pre_output);

// Set pre processing kernel's output WebGPUBuffer as input of WebNN execution.
nn_execution.setInput(0, tensorToGPUBuffer(pre_output));

// Set post processing kernel's input WebGPUBuffer as output of WebNN execution.
nn_execution.setOutput(0, tensorToGPUBuffer(post_input));

// Start the WebNN sub-graph execution.
nn_execution.startCompute();

// Compile and run the post-processing kernel in WebGPU compute shader.
webgpu_backend.compileAndRun(post_program, [post_input], post_output);

// Get the output data from GPU to CPU.
// output_data: Float32Array
let output_data = await webgpu_backend.read(post_output);

In WebNN implementation, for example MPS, it would:

  • Reuse the same MTLDevice associated with WebGPUDevice.
  • Get the MTLBuffer associated with input and output WebGPUBuffer.
  • Allocate MPSImage for inputs with MTLDevice.
  • Create MTLCommandBuffer from MTLQueue associated with WebGPUDevice.
  • Encode a compute shader that copies and reorders data from MTLBuffer to MPSImage (MPSImage layout).
  • Encode MPSNNGraph to MTLCommandBuffer by MPSNNGraph - encodeToCommandBuffer:sourceImages: that returns the output MPSImage.
  • Encode a compute shader that copies and reorders data from output MPSImage to output MTLBuffer.
  • Commit MTLCommandBuffer.

Layout

Depending on the tensor layout definition of framework's and WebNN's, framework may need to reorder the tensor before and after WebNN execution.

@kainino0x
Copy link

kainino0x commented Aug 22, 2019

Here's a thought about a way to define custom ops that sidesteps the issue of data layout. What if we can define them in a more abstract way? Consider Halide: you can define things in a pointwise way. Abstractly, for us it might look like:

relu(x)[n,h,w,c] = max(0, x[n,h,w,c])

It would take a lot of work (an entire compiler!) to generate peak-performance {shaders,programs} for things defined in this way, but, since we want peak-performance ops to be natively supported by the API (e.g. conv2d/matmul), this may be OK.

@huningxin
Copy link
Contributor

huningxin commented Sep 30, 2019

In WebML CG F2F on 17 Sep, I shared the initial investigation results of WebNN-WebGPU interoperability. According to the feedbacks, I'd like to share more details of the test code and proposal of API change.

Tests

The test code is hosted at https://github.com/huningxin/webnn_webgpu_interop_test. It uses TensorFlow.js (tensorflow/tfjs@b3eed68) as an example of frameworks. All tests use WebGPU backend of TensorFlow.js by invoking tf.setBackend('webgpu').

Test 1 - conv2d/add/relu in WebGPU

This test executes all ops by WebGPU computer shader. No WebNN acceleration. source

// input, filter, bias are tf.tensor
let convOutput = tf.conv2d(input, filter, 1, 'same');
let addOutput = tf.add(convOutput, bias);
let reluOutput = tf.relu(addOutput);
let result = await reluOutput.data();

Test 2 - conv2d (WebNN) -> ArrayBufferView -> add/relu (WebGPU)

This test executes conv2d by WebNN and add/relu by WebGPU. WebNN conv2d result is read back to TypedArray and upload to WebGPUBuffer for WebGPU add/relu. source

// Create a WebNN model contains conv2d
const model = await createWebNNConv(filterValue, noBias, noRelu);
const compilation = await model.createCompilation();
compilation.setPreference(nn.PREFER_SUSTAINED_SPEED);
await compilation.finish();
const execution = await compilation.createExecution();
// input and output are TypedArray
execution.setInput(0, input);
execution.setOutput(0, output);
// Wait for computation done and data is read back
await execution.startCompute();
// Upload to WebGPUBuffer
let outputTensor = tf.tensor(output, inputDims);
let addOutput = tf.add(outputTensor, biasTensor);
let reluOutput = tf.relu(addOutput);
let result = await reluOutput.data();

Test 3 - conv2d (WebNN) -> WebGPUBuffer -> add/relu (WebGPU)

This test executes conv2d by WebNN and add/relu by WebGPU. WebNN conv2d result is written to a WebGPUBuffer which is used as input for WebGPU add/relu. source

// Create a WebNN model contains conv2d
const model = await createWebNNConv(filterValue, noBias, noRelu);
const compilation = await model.createCompilation();
// Set WebNN compilation for the same WebGPUDevice as tfjs's
compilation.setGPUDevice(tf.backend().device);
compilation.setPreference(nn.PREFER_SUSTAINED_SPEED);
await compilation.finish();
const execution = await compilation.createExecution();
// input, output, bias are tf.tensor
// Get underlying WebGPUBuffer
const inputBuffer = tf.backend().getBuffer(input.dataId);
const outputBuffer = tf.backend().getBuffer(output.dataId);
// Set WebGPUBuffer as input and output to WebNN execution
execution.setInputGPUBuffer(0, inputBuffer);
execution.setOutputGPUBuffer(0, outputBuffer);
// Enqueue the execution to command buffer, no need to wait
execution.startCompute();
let addOutput = tf.add(output, bias);
let reluOutput = tf.relu(addOutput);
// Read back result from GPU
let result = await reluOutput.data();

Test4 - conv2d/bias/relu (WebNN)

This test executes conv2d/bias/relu all in WebNN. The input and output of WebNN execution are WebGPUBuffer that are ready for WebGPU compute shaders produce or consume. source

// Create a WebNN Model contains conv2d, bias and relu
const model = await createWebNNConv(filterValue, biasValue, fuseRelu);
const compilation = await model.createCompilation();
// Set WebNN compilation for the same WebGPUDevice as tfjs's
compilation.setGPUDevice(tf.backend().device);
compilation.setPreference(nn.PREFER_SUSTAINED_SPEED);
await compilation.finish();
const execution = await compilation.createExecution();
// Get underlying WebGPUBuffer of input and output (tf.tensor)
const inputBuffer = tf.backend().getBuffer(input.dataId);
const outputBuffer = tf.backend().getBuffer(output.dataId);
// Set WebGPUBuffer as input and output to WebNN execution
execution.setInputGPUBuffer(0, inputBuffer);
execution.setOutputGPUBuffer(0, outputBuffer);
// Enqueue the execution to command buffer, no need to wait
execution.startCompute();
// Read back result from GPU
let result = await output.data();

Result

The prototype of WebNN-WebGPU interop support is based on Chromium 78.0.3891.0. Current prototype only supports macOS where WebNN ops are implemented by MPSCNN kernels and WebGPU is implemented by Metal API. The source code is hosted at https://github.com/huningxin/chromium-src/tree/webnn_webgpu_interop. The result is collected on a MacBook Pro (13-inch, 2017, Four Thunderbolt 3 Ports) running macOS 10.14.6.

The test log

WebNN-WebGPU Interop Test
Start
TF.js sets backend as WebGPU
conv2d input dims: [1,100,100,100] and filter dims: [3,3,100,100]

Test1 - conv2d/add/relu (WebGPU): 63.46 ms
Test2 - conv2d (WebNN) -> ArrayBufferView -> add/relu (WebGPU): 39.57 ms
Test3 - conv2d (WebNN) -> WebGPUBuffer -> add/relu (WebGPU): 22.49 ms
Test4 - conv2d/add/relu (WebNN): 20.82 ms

Summary:

  • Test1 is how framework implements ops today.
  • Test2 demonstrates that WebNN and WebGPU interop with existing API. Existing WebNN API only supports ArrayBufferView as inputs and outputs, so user code has to move data cross GPU and CPU which is not optimal.
  • Test3 demonstrates the WebNN API proposal that supports compilation for WebGPUDevice and
    execution with WebGPUBuffer as inputs and outputs. WebNN ops and WebGPU custom ops can exchange data efficiently.
  • Test4 demonstrates WebNN sub-graph optimization, e.g. fuse add and relu into conv2d op.

Proposal

(according to #6 (comment) @dsmilkov)

  • Hardware optimized ops can be exposed by WebNN.
  • Custom ops can be implemented by WebGPU computer shader.
  • Support high-performance data exchange between WebNN ops and WebGPU custom ops:
    • Propose to extend WebNN Compilation interface that allows to compile ops for a specific WebGPUDevice. (avoid moving data cross GPU devices)
    • Propose to extend WebNN Execution interface that accepts WebGPUBuffer as inputs and outputs. (avoid moving data cross GPU and CPU)

@anssiko
Copy link
Member

anssiko commented Aug 25, 2020

(Context: this topic is being discussed in the workshop GH, e.g.: w3c/machine-learning-workshop#68 (comment))

@wchao1115 @huningxin does operator composition in WebNN API address this issue adequately? Anything we'd like to spin into a separate issue?

It would be helpful if you could summarize briefly in here the proposed way forward for WebNN API so we can then confirm with @dsmilkov whether the issue could be closed.

@wchao1115
Copy link
Collaborator

@anssiko It has been our goal to make sure that for every operator we define, if there is a semantically equivalent composition graph of lower level operators, we will also provide the definitions for all of the lower operators in the graph. The latest and perhaps most illustrated sample of that practice is the GRU operators #83, notice that I also added definitions for slice and squeeze in the same PR. The idea is that once we complete our first set of the API, we will also complete all the building blocks operators needed to build those API. It's hard to know exactly when to stop, since the domain is ever evolving, but at least we know that we are also covering the fundamental computation units as we add new operators into the spec.

@anssiko
Copy link
Member

anssiko commented Mar 3, 2023

Putting a v2 label on this to check the WG's latest thinking around this as we look our longer-term aspirations. No immediate action required.

@bbernhar
Copy link

@anssiko seems this issue could be merged with #688

@a-sully
Copy link
Contributor

a-sully commented May 30, 2024

I'll note that WebGPU interop is only one approach for supporting custom ops. While I agree we should support WebGPU interop, this approach is not free of drawbacks - e.g. we don't yet have an idea of what the performance impact of communication/synchronization and data copies between the GPUDevice and WebNN's respective execution context (e.g. a CPU, GPU, and/or dedicated ML accelerator) will be

There are other approaches for supporting custom ops which WebNN might consider, such as StableHLO's recently-added composite operator (see the RFC) which builds on @philloooo's discussions of a "core operator set" #573

@bbernhar
Copy link

bbernhar commented May 30, 2024

@a-sully Understood. I would recommend we open a separate issue when (if ever) WGPU interop performance becomes a concern and an entirely new issue to support custom OPs within WebNN itself alongside a concrete proposal. This issue is unactionable and until recently, inactive so I'd vote we close.

@inexorabletash
Copy link
Member

+1 to comments by both @a-sully (core op set, composite ops, ..) and @bbernhar (need concrete proposals, this issue is not actionable)

@anssiko can we close?

@huningxin
Copy link
Contributor

e.g. we don't yet have an idea of what the performance impact of communication/synchronization and data copies between the GPUDevice and WebNN's respective execution context (e.g. a CPU, GPU, and/or dedicated ML accelerator) will be

The custom operators could also be implemented in WebAssembly and exchange tensor data with WebNN CPU execution context. This is used by frameworks today, like ONNXRuntime Web. We should continue to support this usage even we have WebGPU interop later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests