Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Accelerate TensorFlow core on FPGA - How? #8820

Closed
wilderfield opened this issue Mar 29, 2017 · 16 comments
Closed

Feature Request: Accelerate TensorFlow core on FPGA - How? #8820

wilderfield opened this issue Mar 29, 2017 · 16 comments
Labels
comp:micro Related to TensorFlow Lite Microcontrollers stale This label marks the issue/pr stale - to be closed automatically if no activity stat:contribution welcome Status - Contributions welcome type:feature Feature requests

Comments

@wilderfield
Copy link

Consider the two following hardware scenarios:

  1. Linux running on x86 w/ FPGA fabric connected via PCIe
  2. Linux running on Arm A53 with AXI i/f to FPGA fabric (Think Xilinx Zynq)

How could the tensorflow core be accelerated for these scenarios?
FPGA vendors do offer OpenCL binaries for running OpenCL APIs to parallelize computations.

Forgive me, I am a bit ignorant, still learning in this area, but I am intrigued by the future possibility of this, and would love to help in anyway I can.

@aselle
Copy link
Contributor

aselle commented Mar 31, 2017

This would be a great question for StackOverflow! I'm not aware of people who have talked about how to do this already, but that would be the best place to look. Thanks!

@aselle aselle closed this as completed Mar 31, 2017
@wilderfield
Copy link
Author

wilderfield commented Mar 31, 2017 via email

@morciari
Copy link

I'm interested in this too. Maybe it would be better to implement an IP in fpga optimized for tensorflow, rather then a generic OpenCL IP.

@alexshirley
Copy link

I'm also interested, but from the aspect of creating an accelerator and then using the hooks in Tensorflow to access it.

@aselle aselle reopened this Jun 9, 2017
@aselle aselle added the stat:contribution welcome Status - Contributions welcome label Jun 9, 2017
@wilderfield
Copy link
Author

wilderfield commented Jun 10, 2017 via email

@viper7882
Copy link

Hi @aselle ,

Hugh Perkins has created Coriander that could run NVIDIA® CUDA™ code on OpenCL 1.2 devices including FPGA devices. You might want to take a look if that suits your need to connect your Deep Learning software to OpenCL 1.2 devices. Kindly attribute his name and his contribution in case if you plan to use his work.

@raulpuric
Copy link

raulpuric commented Jun 29, 2017

@aselle Tensorflow is always lauded for it's portability. However, even if one was to use @viper7882's solution they'd still be constrained to using gpus, traditional cpus, and commonly used FPGAs with OpenCL vendor libraries. This is not true general portability.

Would it be possible to add a custom device feature/interface that allows us to add unique devices, treating the tensor forward and backward computations within the context as a giant black box, that can be fed values via grpc serialization. Or the ability to pass a custom compiled executable/binary and link it in to the rest of the graph (not just a tensorflow lambda func). This would greatly help the applied deep learning community so we can stop pulling values out of the runtime graph, processing them, and reinserting it into the computation graph.

Here are just several use cases based on projects I'm working on now:

  • Integration of tensorflow into high throughput/low latency real time systems where we want to define custom caching logic for only a few specified ops to meet unique run time criteria.
  • Usage of tensorflow to interface with custom ASICs without OpenCL libraries
    • usage of ASICs specific to performing fast neural memory access in NTMs/NDCs/etc
    • usage of analog devices with a digital interface such as an Analog neural network (sub)graph or analog transmitter/receiver
  • Easy ability to use tensorflow autodifferentiation with more abstract concepts in NN theory (such as deep function machines)
  • Easy ability to add distributed (something like MPI) computation for a specific tensor. FireCaffe/DeepScale have the functionality.

My basic understanding is that the device contexts solely serve to define compile logic of the tensors in the context and properly link the compiled nodes with the rest of the graph. I'd implement a PR myself, but I only have a good grasp on the grpc serialization logic.

Wish I understood the graph compilation portion better :(

@Piedone
Copy link

Piedone commented Oct 25, 2017

Would it be feasible to swap out specific parts of TensorFlow and accelerate it e.g. with an FPGA implementation? Like creating FPGA-accelerated ops as @kinsumliu suggests under #12538?

We've built Hastlayer, which automatically transforms .NET programs into equivalent FPGA implementations. What we could thus do is to see how implementing some TF features with Hastlayer on an FPGA would work.

@stevefox
Copy link

Google designed an ASIC with which they've done this called the Tensorflow Processing Unit (TPU). There's a fair amount of information on it that has been publicly released.

You could use this as a starting point to design soft IP for an FPGA that could support a handful of tensorflow Ops you'd like to speed up, then write the TF device handlers and ops.

You could potentially even design the software interface with Xilinx's OpenCL macros (and maybe even the IP) to make the tensorflow Ops simpler to implement so you don't need to write HDL manually. There's also reconfigure.io (program an FPGA with Go). It sounds like Hastlayer that @Piedone mentiond does something similar as well.

@Piedone
Copy link

Piedone commented Nov 27, 2017

Yes, Hastlayer works in a very similar way @stevefox .

What do you think, which part of TF would be most interesting to swap out with an accelerated implementation? It seems to me that the most important piece of the TPU is the matrix multiplier, a somewhat similar feature to GPUs.

@nlbutts
Copy link

nlbutts commented Jan 18, 2018

I was able to get Tensorflow Lite compiled for the Zynq and did a quick benchmark. I typed up my results here: https://github.com/nlbutts/tflite_zynq

My next step was looking at accelerating the convolution operation, which is where the software spends most of it's time when using Mobilenet.

@GiuseppeDiGuglielmo
Copy link

@nlbutts did you make any progress? Is anyone aware of a solution that interfaces Tensorflow with FPGAs?

There are some old attempts to add new devices to Tensorflow, but I did not see anything about FPGAs in particular.

@wilderfield
Copy link
Author

wilderfield commented Jul 6, 2018 via email

@zhaohb
Copy link

zhaohb commented Mar 22, 2019

Now, I add a fpga device in tensorflow, and I can call the kernel of fpga device to perform an addition operation,but the fpga device is also a virtual device ,I do not realize API to collect the info of fpga, I will do it later.
the code:
zhaohb@7456ddb

@mohantym mohantym added comp:micro Related to TensorFlow Lite Microcontrollers type:feature Feature requests labels Oct 17, 2022
@github-actions
Copy link

This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Apr 16, 2023
Copy link

This issue was closed because it has been inactive for 1 year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:micro Related to TensorFlow Lite Microcontrollers stale This label marks the issue/pr stale - to be closed automatically if no activity stat:contribution welcome Status - Contributions welcome type:feature Feature requests
Projects
None yet
Development

No branches or pull requests