Feature Request: Accelerate TensorFlow core on FPGA - How? #8820

wilderfield · 2017-03-29T22:49:54Z

Consider the two following hardware scenarios:

Linux running on x86 w/ FPGA fabric connected via PCIe
Linux running on Arm A53 with AXI i/f to FPGA fabric (Think Xilinx Zynq)

How could the tensorflow core be accelerated for these scenarios?
FPGA vendors do offer OpenCL binaries for running OpenCL APIs to parallelize computations.

Forgive me, I am a bit ignorant, still learning in this area, but I am intrigued by the future possibility of this, and would love to help in anyway I can.

aselle · 2017-03-31T18:17:18Z

This would be a great question for StackOverflow! I'm not aware of people who have talked about how to do this already, but that would be the best place to look. Thanks!

wilderfield · 2017-03-31T18:41:16Z

Andrew, It was more of a feature request than a question... Can you make sure that it makes it to someone at Google? Although, they may have already thought about it, or be doing it behind the scenes...

…

On Fri, Mar 31, 2017 at 11:19 AM, Andrew Selle ***@***.***> wrote: Closed #8820 <#8820>. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8820 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AO_Gp5LYl7TPRGgMDbhyzSOzi6PSUQBtks5rrUOfgaJpZM4Mtp4A> .

morciari · 2017-04-27T08:36:05Z

I'm interested in this too. Maybe it would be better to implement an IP in fpga optimized for tensorflow, rather then a generic OpenCL IP.

alexshirley · 2017-06-09T22:02:34Z

I'm also interested, but from the aspect of creating an accelerator and then using the hooks in Tensorflow to access it.

wilderfield · 2017-06-10T19:25:10Z

FYI, Xilinx has DNN accelerators now. Currently they are integrated with Caffe. I'm sure tensorflow is on the roadmap. http://www.datacenterknowledge.com/archives/2016/11/14/xilinx-unleashes-fpga-accelerator-stack-supporting-caffe-openstack/

…

Sent from my iPhone

On Jun 9, 2017, at 3:04 PM, alexshirley ***@***.***> wrote: I'm also interested, but from the aspect of creating an accelerator and then using the hooks in Tensorflow to access it. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

viper7882 · 2017-06-20T16:01:35Z

Hi @aselle ,

Hugh Perkins has created Coriander that could run NVIDIA® CUDA™ code on OpenCL 1.2 devices including FPGA devices. You might want to take a look if that suits your need to connect your Deep Learning software to OpenCL 1.2 devices. Kindly attribute his name and his contribution in case if you plan to use his work.

raulpuric · 2017-06-29T22:22:01Z

@aselle Tensorflow is always lauded for it's portability. However, even if one was to use @viper7882's solution they'd still be constrained to using gpus, traditional cpus, and commonly used FPGAs with OpenCL vendor libraries. This is not true general portability.

Would it be possible to add a custom device feature/interface that allows us to add unique devices, treating the tensor forward and backward computations within the context as a giant black box, that can be fed values via grpc serialization. Or the ability to pass a custom compiled executable/binary and link it in to the rest of the graph (not just a tensorflow lambda func). This would greatly help the applied deep learning community so we can stop pulling values out of the runtime graph, processing them, and reinserting it into the computation graph.

Here are just several use cases based on projects I'm working on now:

Integration of tensorflow into high throughput/low latency real time systems where we want to define custom caching logic for only a few specified ops to meet unique run time criteria.
Usage of tensorflow to interface with custom ASICs without OpenCL libraries
- usage of ASICs specific to performing fast neural memory access in NTMs/NDCs/etc
- usage of analog devices with a digital interface such as an Analog neural network (sub)graph or analog transmitter/receiver
Easy ability to use tensorflow autodifferentiation with more abstract concepts in NN theory (such as deep function machines)
Easy ability to add distributed (something like MPI) computation for a specific tensor. FireCaffe/DeepScale have the functionality.

My basic understanding is that the device contexts solely serve to define compile logic of the tensors in the context and properly link the compiled nodes with the rest of the graph. I'd implement a PR myself, but I only have a good grasp on the grpc serialization logic.

Wish I understood the graph compilation portion better :(

Piedone · 2017-10-25T11:22:36Z

Would it be feasible to swap out specific parts of TensorFlow and accelerate it e.g. with an FPGA implementation? Like creating FPGA-accelerated ops as @kinsumliu suggests under #12538?

We've built Hastlayer, which automatically transforms .NET programs into equivalent FPGA implementations. What we could thus do is to see how implementing some TF features with Hastlayer on an FPGA would work.

stevefox · 2017-11-27T05:11:48Z

Google designed an ASIC with which they've done this called the Tensorflow Processing Unit (TPU). There's a fair amount of information on it that has been publicly released.

You could use this as a starting point to design soft IP for an FPGA that could support a handful of tensorflow Ops you'd like to speed up, then write the TF device handlers and ops.

You could potentially even design the software interface with Xilinx's OpenCL macros (and maybe even the IP) to make the tensorflow Ops simpler to implement so you don't need to write HDL manually. There's also reconfigure.io (program an FPGA with Go). It sounds like Hastlayer that @Piedone mentiond does something similar as well.

Piedone · 2017-11-27T14:59:35Z

Yes, Hastlayer works in a very similar way @stevefox .

What do you think, which part of TF would be most interesting to swap out with an accelerated implementation? It seems to me that the most important piece of the TPU is the matrix multiplier, a somewhat similar feature to GPUs.

nlbutts · 2018-01-18T22:08:30Z

I was able to get Tensorflow Lite compiled for the Zynq and did a quick benchmark. I typed up my results here: https://github.com/nlbutts/tflite_zynq

My next step was looking at accelerating the convolution operation, which is where the software spends most of it's time when using Mobilenet.

GiuseppeDiGuglielmo · 2018-07-06T20:08:55Z

@nlbutts did you make any progress? Is anyone aware of a solution that interfaces Tensorflow with FPGAs?

There are some old attempts to add new devices to Tensorflow, but I did not see anything about FPGAs in particular.

wilderfield · 2018-07-06T21:51:48Z

To those interested. I would like to share some thoughts, and some info. Full disclosure, I opened this feature request about 1.5 years ago, back when I was more clueless. Also, I have worked at Xilinx for the past 4 years. Just 3 months ago I started as a "Machine Learning Engineer", so now I am very aware of Xilinx's approach in this space. Machine Learning/Computer Vision is a very fast paced market. It seems like every year there is a new state of the art network architecture. Given that an FPGA design cycle can take 6 months to a year, especially when a company is lacking FPGA know-how, Xilinx provisioned a hardware and software team to design a "soft" neural network accelerator core. The core is meant to be general purpose, in that it can accelerate most network architectures. We are calling it "XDNN". The software is getting tagged as "xfDNN". Xilinx Fast DNN. We are so great at coming up with names! The core is designed specifically for accelerating inference, and it takes advantage of fixed point arithmetic to squeeze more compute into the FPGA fabric/DSPs. This means trained networks must undergo an offline quantization process, to go from float32 to int8. There is various research showing that this is an effective method to achieve faster inference with minimal loss in accuracy (1-2%). The core accelerates convolutions, pooling (Max/Average), eltwise add, and ReLU. We run the majority of the network on the FPGA, and leave the final fully connected, or region layers to the CPU. (There was no gain in adding hardware support for these final layers, since FC layers are being used less and less, and region layers can be quite custom). Currently, we only support cloud: "Amazon AWS EC2 F1", and on-premise: "VCU1525". However we are working on bringing the core into Zynq UltraScale+ devices, as well as other cloud providers such as Nimbix. Check out our repo if you are interested: https://github.com/Xilinx/ml-suite Also, you can kick the tires directly at: https://aws.amazon.com/marketplace/pp/B077FM2JNS Right now, the documentation on github isn't great, but we are about to push a new release with much improved docs. There is an EA branch that shows a lot of our software. My job right now is to enhance our documentation, and also add usability features (Ease of use Python stuff). So feel free to complain to me, it is my job. We have some really nice jupyter notebook tutorials coming. I can be reached at b.lozano.havoc@gmail.com or bryan.lozano@xilinx.com. Now we come to framework support... Initially, we did some experiments, enabling caffe to directly call our accelerator. We got it working, but it was clear that there would be a lot of work to add support for various networks, and to support this across Caffe/Tensorflow/MXNET. Tensorflow was still evolving when we started. We simply didn't have enough software people to support that. What we settled on is a compiler frontend that takes your network definition, and weights. The compiler does some optimization (layer merging), then generates a schedule of commands to accelerate a given network. This way, the majority of our software can be framework agnostic, but the compiler will have to "multi-lingual". The compiler reads prototxt, or frozen tensorflow graphs for example. I would still love it, if tensorflow had hooks to directly run on the FPGA, but I don't see it happening anytime soon. It probably wouldn't be that bad, especially if Tensorflow can make OpenCL calls. Anyways, if you want more info, don't hesitate to reach out. Again, I can be reached at b.lozano.havoc@gmail.com or bryan.lozano@xilinx.com Thanks, Bryan

…

On Fri, Jul 6, 2018 at 1:11 PM, GDG ***@***.***> wrote: @nlbutts <https://github.com/nlbutts> did you make any progress? Is anyone aware of a solution that interfaces Tensorflow with FPGAs? There are some old attempts to add new devices to Tensorflow, but I did not see anything about FPGAs in particular. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8820 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AO_Gp91PZcBHOvm6xGlQuTOMPtmufXmdks5uD8SLgaJpZM4Mtp4A> .

zhaohb · 2019-03-22T03:22:00Z

Now, I add a fpga device in tensorflow, and I can call the kernel of fpga device to perform an addition operation，but the fpga device is also a virtual device ,I do not realize API to collect the info of fpga, I will do it later.
the code:
zhaohb@7456ddb

github-actions · 2023-04-16T02:00:22Z

This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2024-04-15T02:51:13Z

This issue was closed because it has been inactive for 1 year.

aselle closed this as completed Mar 31, 2017

aselle reopened this Jun 9, 2017

aselle added the stat:contribution welcome Status - Contributions welcome label Jun 9, 2017

Piedone mentioned this issue Oct 25, 2017

FPGA Implementation on TensorFlow #12538

Closed

mohantym added comp:micro Related to TensorFlow Lite Microcontrollers type:feature Feature requests labels Oct 17, 2022

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Apr 16, 2023

github-actions bot closed this as completed Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Accelerate TensorFlow core on FPGA - How? #8820

Feature Request: Accelerate TensorFlow core on FPGA - How? #8820

wilderfield commented Mar 29, 2017

aselle commented Mar 31, 2017

wilderfield commented Mar 31, 2017 via email

morciari commented Apr 27, 2017

alexshirley commented Jun 9, 2017

wilderfield commented Jun 10, 2017 via email

viper7882 commented Jun 20, 2017

raulpuric commented Jun 29, 2017 •

edited

Piedone commented Oct 25, 2017

stevefox commented Nov 27, 2017

Piedone commented Nov 27, 2017

nlbutts commented Jan 18, 2018

GiuseppeDiGuglielmo commented Jul 6, 2018

wilderfield commented Jul 6, 2018 via email

zhaohb commented Mar 22, 2019 •

edited

github-actions bot commented Apr 16, 2023

github-actions bot commented Apr 15, 2024

Feature Request: Accelerate TensorFlow core on FPGA - How? #8820

Feature Request: Accelerate TensorFlow core on FPGA - How? #8820

Comments

wilderfield commented Mar 29, 2017

aselle commented Mar 31, 2017

wilderfield commented Mar 31, 2017 via email

morciari commented Apr 27, 2017

alexshirley commented Jun 9, 2017

wilderfield commented Jun 10, 2017 via email

viper7882 commented Jun 20, 2017

raulpuric commented Jun 29, 2017 • edited

Piedone commented Oct 25, 2017

stevefox commented Nov 27, 2017

Piedone commented Nov 27, 2017

nlbutts commented Jan 18, 2018

GiuseppeDiGuglielmo commented Jul 6, 2018

wilderfield commented Jul 6, 2018 via email

zhaohb commented Mar 22, 2019 • edited

github-actions bot commented Apr 16, 2023

github-actions bot commented Apr 15, 2024

raulpuric commented Jun 29, 2017 •

edited

zhaohb commented Mar 22, 2019 •

edited